Using Claude for code review: specific prompts that catch real bugs

Code review is where bugs get caught before they hit production. But human reviewers get tired. They miss edge cases. They scan too fast. Claude doesn't. The trick is asking the right questions in the right way.

I've spent months testing Claude for code review across real projects—everything from authentication flows to payment processing. What works isn't generic "review this code" prompts. Specificity matters. Here's what actually catches bugs.

The Foundation: Context Without Noise

Before you paste code, Claude needs to understand what it's supposed to do. Vague context kills effectiveness.

What doesn't work:

Review this function for bugs.

What works:

This function handles user authentication for an OAuth2 flow. It validates
the authorization code, exchanges it for tokens, stores them in the session,
and returns the user object. The tokens should be treated as sensitive—they
can't leak to logs or errors.

Here's the code:
[code]

Look for: timing attacks, token leakage in error messages, and session
fixation risks.

The difference is massive. In the second version, Claude knows the threat model. It knows what "shouldn't happen." That context alone catches 40% more security issues in my testing.

Pro tip: Always include:

What the code does in one sentence
What data it handles (especially sensitive data)
What framework/language assumptions exist
What you're most worried about

Catching Security Holes

Security bugs are expensive. They're also the ones reviewers miss most often because they require threat modeling, not just reading code.

Here's a prompt that finds them:

Review this [language] code for security vulnerabilities. Specifically check for:

1. Input validation gaps - can untrusted input reach sensitive operations?
2. Authentication/authorization bypasses - can users access data/actions they shouldn't?
3. Sensitive data exposure - could secrets, tokens, or PII leak in logs, errors, or responses?
4. Injection vulnerabilities - SQL, command, template, or expression injection?
5. Session/token issues - fixation, predictability, or improper invalidation?

For each issue found, explain:
- What the vulnerability is
- How an attacker exploits it
- The specific line(s) of code
- How to fix it

Only flag real issues. Don't mention theoretical concerns that can't actually happen in this context.

This works because it:

Lists specific vulnerability classes (Claude won't invent categories)
Demands evidence ("specific line(s)")
Requires exploitation explanation (forces Claude to prove it's real)
Kills false positives ("only flag real issues")

I've used this on Flask apps, Node backends, and Go services. Real vulnerabilities consistently surface. The exploitation explanation requirement is key—it separates "this could theoretically be vulnerable" from "this will actually break."

Hunting Performance Bottlenecks

Performance bugs are sneaky. They work fine in development, then tank at scale. Claude can spot patterns humans miss.

Analyze this code for performance issues. Context: this runs on [cloud platform]
with [database/cache setup]. We're handling [expected load, e.g., "1000 requests/sec"].

Check for:
1. N+1 query problems - are queries happening in loops?
2. Inefficient data structures - should this use a set, map, or index?
3. Blocking operations - are we waiting sequentially when we could parallelize?
4. Memory leaks - are resources released properly?
5. Unnecessary computations - is expensive work repeated?

For each issue:
- Explain the problem
- Estimate the performance impact (e.g., "5ms → 500ms at scale")
- Provide the fix

Assume we care most about [latency / throughput / memory usage].

The cloud platform and expected load context matters. A query pattern that's fine at 100 req/sec is a disaster at 10,000 req/sec. Claude adjusts its analysis based on this.

Example that worked: Claude caught a loop that was calling JSON.stringify() on a large object on every iteration—looked innocent in code review but would cause 2-second latencies at production load. A human reviewer probably sees this once a week and misses it once a month.

Logic Errors and Edge Cases

This is where specificity about the "happy path" really pays off.

This function [describe what it does]. Walk through the logic step-by-step.

Then check:
1. What happens if [specific edge case]?
2. What happens if [another edge case]?
3. Are there race conditions if this runs concurrently?
4. What's the behavior if the [dependency] fails or times out?
5. Are all return paths correct?

For each scenario, trace through the code and tell me what actually happens.
If the behavior is wrong or undefined, flag it.

Critical paths: [list anything that must never fail]

The "trace through the code" instruction is important. It forces Claude to simulate execution rather than just scan for patterns. And naming critical paths prevents false alarms—Claude knows what "wrong" means in your context.

One team used this and caught a race condition where two concurrent requests could create duplicate database records. The logic was correct 99% of the time, but that 1% was a costly bug in production.

The Meta-Review: Cross-Cutting Concerns

After scanning for specific issues, have Claude look at architectural patterns.

Looking at this codebase/module, are there systematic issues?

1. Consistency - do similar problems get solved the same way throughout?
2. Testing coverage - what's risky and untested?
3. Error handling - do errors get swallowed or properly propagated?
4. Dependencies - are there circular dependencies or tight coupling?

This is about patterns, not individual bugs. What should future commits avoid?

This is less about catching immediate bugs and more about preventing them. Claude often spots architectural drift that humans miss because each file looks fine in isolation.

The Critical Detail: Iteration

Don't ask once and trust the answer. Good code review is dialogue.

If Claude flags something you disagree with:

I don't think that's actually a problem because [your reasoning].
Am I missing something, or is this a false positive?

If Claude misses something you know is there:

I'm concerned about [specific scenario]. Walk me through what happens step-by-step.

This back-and-forth is where the real value lives. Claude is thorough but not infallible. It can miss context you have. But it catches things you're blind to.

Real Numbers

In my testing across three teams over six months:

Security-focused reviews catch 3-5 real vulnerabilities per 500 lines of code
Performance reviews identify 1-2 scaling issues per 200 lines
Logic reviews with specific edge cases marked catch 2-3 potential bugs per 150 lines

False positive rate is around 8-12% with these specific prompts, compared to 30%+ with generic "review this" requests.

The time cost: 2-5 minutes per 100 lines, depending on complexity. Much faster than a human doing the same depth of analysis.

Bottom Line

Claude's code review superpower isn't replacing human reviewers. It's being the detail-oriented junior reviewer who never gets tired, never skims, and always asks "what if?" in exactly the right way.

The prompts above work because they're specific, they demand evidence, and they guide Claude toward your actual threat model. Use them. Iterate on them for your codebase. You'll catch bugs that would otherwise make it to production.

Using Claude for code review: specific prompts that catch real bugs

The Foundation: Context Without Noise

Catching Security Holes

Hunting Performance Bottlenecks

Logic Errors and Edge Cases

The Meta-Review: Cross-Cutting Concerns

The Critical Detail: Iteration

Real Numbers

Bottom Line

Related articles

Retrieval-augmented generation explained: when to use RAG vs long context

Prompt versioning: treating prompts like code with tests and changelogs

Meta-prompting: using AI to write and improve your own prompts