I Replaced My Entire Code Review Process with AI — Here’s What Actually Happened

This post contains affiliate links.

Six months ago I made a decision that got me some raised eyebrows in our Slack channel: I told my team we were going to stop doing traditional async code reviews for most PRs and let AI handle the first pass. Everyone assumed I’d gone off the deep end. Today, our review turnaround time is down 70%, and honestly, the quality feedback has improved.

Let me be clear about what I mean before you close this tab thinking I’ve lost the plot. I’m not saying AI replaces human judgment. I’m saying AI is now the first reviewer — the one that catches the obvious stuff, enforces style consistency, flags potential bugs, and asks the dumb-but-important questions. Humans come in for architecture decisions, business logic, and anything that requires actual context about why we made a decision two years ago.

Here’s exactly how I got here, what broke along the way, and the workflow I actually use today.

Why Traditional Code Review Was Already Broken

Let me paint you a picture. Developer opens a PR. It sits for a day while everyone’s busy. Someone finally reviews it, leaves three comments about semicolons and one about a potential null reference. Developer fixes it. PR sits another day. Gets merged. Two weeks later someone finds a bug that a more thorough review would have caught, but nobody had time.

This wasn’t a people problem. It was a process problem. Code review had become a bottleneck that everyone resented. Reviewers felt guilty about the queue. Authors felt anxious waiting. The feedback loop was too slow to actually improve anyone’s code in a meaningful way.

When I started using Cursor heavily for my own coding, I noticed something: the AI feedback I was getting inline while writing code was often more thorough than the async review I’d get from a colleague two days later. That’s not an insult to my colleagues — it’s a structural observation. Real-time feedback during development is just fundamentally more useful than delayed feedback after the fact.

The First Attempt (And Why It Failed)

My first approach was naive. I set up a GitHub Action that piped the diff to the OpenAI API and posted comments. Technically it worked. Practically it was a disaster.

The AI would leave 40+ comments on a 200-line PR. Developers were overwhelmed. Half the comments were irrelevant given our specific codebase context. It had no idea we intentionally used a certain pattern in our auth module because of a legacy integration. It flagged things as bugs that were deliberate design decisions.

The team hated it within a week. I had to pull it back.

The problem wasn’t the AI — it was that I’d given it no context. I was treating it like a generic linter rather than a knowledgeable reviewer.

The Workflow That Actually Works

Here’s what I rebuilt after that failure. The key insight was that AI code review needs the same context a good human reviewer needs — maybe more.

Step 1: The Context File

Every repository now has a REVIEW_CONTEXT.md file in the root. This is not documentation for humans — it’s documentation for AI reviewers. It covers:

  • Known intentional patterns that might look wrong
  • Areas of the codebase that are legacy and shouldn’t be refactored
  • Our specific error handling conventions
  • Performance constraints that affect certain decisions
  • Things we explicitly don’t care about in reviews

This file gets included in every AI review prompt. It’s maybe 300-500 words per repo. Writing it took a couple of hours and saved us weeks of false positives.

Step 2: Tiered Review Logic

Not every PR gets the same treatment. I classify PRs into three tiers:

Tier 1 — AI Only: Changes under 100 lines, no changes to core business logic, touches only one module. AI reviews it, posts summary, auto-approves if no critical issues found. A human can still review if they want, but there’s no requirement.

Tier 2 — AI First, Human Required: Most PRs fall here. AI does the first pass, posts a structured review comment, and then a human reviewer uses that as a starting point rather than reviewing from scratch.

Tier 3 — Human Led: Architecture changes, security-sensitive code, anything touching payments or auth. AI still provides input but a senior dev leads the review.

This tiering alone cut the review burden in half, because probably 40% of our PRs are legitimately Tier 1.

Step 3: The Prompt Structure

This is the part most people skip but it makes all the difference. Here’s the actual structure I use:

You are reviewing a pull request for [PROJECT NAME].

Context about this codebase:
[REVIEW_CONTEXT.md contents]

PR Description:
[PR description]

Changed files:
[diff]

Your review should:
1. Identify actual bugs or logic errors (HIGH priority)
2. Flag security concerns (HIGH priority)
3. Note performance issues with specific impact estimates (MEDIUM)
4. Suggest readability improvements briefly (LOW)
5. Skip style issues — we have automated linting

Format your response as:
- Summary (2-3 sentences max)
- Critical Issues (if any)
- Suggestions (max 5, ordered by impact)
- What looks good

Do not comment on things already covered in our REVIEW_CONTEXT.md as known patterns.

The explicit format instruction is crucial. Without it you get walls of text. With it you get something a developer can scan in 30 seconds and act on.

The Tooling Stack

For the implementation, I use a combination of GitHub Actions and a small Node.js service I host on DigitalOcean — a basic $6/month droplet handles the load fine for a team of eight. The service manages prompt construction, caching (so we’re not re-reviewing unchanged files), and posting the formatted review back to GitHub.

The GitHub Action triggers on PR open and on each new push to an open PR. It calls my service, which calls the API, and posts a structured comment. Total latency from PR open to first review comment is under 90 seconds.

Here’s the simplified Action config:

name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 0
      
      - name: Generate diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr_diff.txt
      
      - name: Run AI review
        env:
          REVIEW_SERVICE_URL: ${{ secrets.REVIEW_SERVICE_URL }}
          REVIEW_SERVICE_KEY: ${{ secrets.REVIEW_SERVICE_KEY }}
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          curl -X POST $REVIEW_SERVICE_URL/review \
            -H "Authorization: Bearer $REVIEW_SERVICE_KEY" \
            -H "Content-Type: application/json" \
            -d @- <

The service itself is about 200 lines of Node.js. Not worth open sourcing at this point since it's pretty specific to our setup, but the architecture is simple enough to replicate.

What AI Review Is Actually Good At

After six months of data, here's where AI review consistently adds value:

Null/undefined edge cases. AI catches these constantly. Things like "what happens when this array is empty" or "this function assumes the user object exists but there's a path where it might not."

Inconsistent error handling. We throw in some places and return error objects in others. AI flags these inconsistencies every time.

Missing input validation. Especially on API endpoints. AI will almost always ask "where is this input validated?" which is the right question.

Obvious performance issues. N+1 queries, unnecessary loops inside loops, that kind of thing.

What It's Still Bad At

I want to be honest here because I see a lot of hype that skips this part.

AI review is bad at understanding why something was built a certain way. It can't know that we have a weird auth pattern because of a specific enterprise customer requirement from 2023. It doesn't understand team dynamics or that a certain developer is learning and needs encouragement alongside criticism.

It also sometimes confidently flags things that are completely fine, which still requires a human to evaluate. This is less of a problem with good context files, but it doesn't go away entirely.

And for anything involving system design tradeoffs, it gives generic advice that often isn't applicable to our specific constraints. "You should use a queue here" is correct in the abstract and wrong for our situation.

The Honest Numbers

Before this workflow: average 18 hours from PR open to first human review comment. After: 90 seconds to AI review, average 4 hours to human review on Tier 2 PRs (because reviewers are using the AI summary as a starting point). Tier 1 PRs are fully handled with no human review time at all.

Bugs caught in review (AI + human combined) is up about 20% compared to six months ago. I attribute this mostly to the fact that AI catches the boring-but-important stuff, freeing human reviewers to think about the more complex issues.

Getting Started If You Want to Try This

Don't start with the automation. Start with manually feeding your PRs to an AI reviewer for two weeks. Use Cursor's chat feature or just paste diffs into Claude or GPT-4o. Get a feel for the quality of feedback you're getting before you invest in building the pipeline.

Write your context file first. Seriously, this is the highest-leverage thing you can do. Before you write a single line of automation code, document what an AI reviewer needs to know about your codebase.

Then tier your PRs manually. For a month, just decide for each PR whether a human review is actually required. You'll find the answer is "no" more often than you expect, and that insight shapes how you build the automation.

The goal isn't to eliminate human code review. The goal is to make human code review faster, more focused, and less of a bottleneck. Six months in, I'm convinced that's achievable — you just have to build it thoughtfully rather than just throwing a diff at the API and hoping for the best.

If you're building something similar or have questions about the service architecture, drop a comment below or reach out on the WithStack Discord.