A real-world benchmark for AI code review

falloutx

Company creates a benchmark. Same company is best in that benchmark.

Story as old as time.

show comments

mattvv

Some feedback for the team, looked at pricing page and saw it more expensive ($30/dev/mo) and highly limiting (20prs per month per user). We have devs putting up that many prs in a single day. With this kind of plan pretty much no way we would even try this product

show comments

esafak

I'm not as cynical as the others here; if there are no popular code review benchmarks why should they not design one?

Apparently this is in support of their 2.0 release: https://www.qodo.ai/blog/introducing-qodo-2-0-agentic-code-r...

> We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]

> Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.

> To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.

> Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.

show comments

mbesto

Cmd+F - "Overfitting"...nothing.

Nope, no mention of how they do anything to alleviate overfitting. These benchmarks are getting tiresome.

zhubert

I'm trying to bring a slightly different take to the pricing of ShipItAI (https://shipitai.dev, brazen plug). I've got a $5/mo/active dev + Bring Your Own Key option for those that want better price controls.

Still early in development and has a much simpler goal, but I like simple things that work well.

mohsen1

> Qodo takes a different approach by starting with real, merged PRs

Merged PRs being considered good code?

CuriouslyC

I don't think LLMs are the right tool for pattern enforcement in general, better to get them to create custom lint rules.

Agents are pretty good at suggesting ways to improve a piece of code though, if you get a bunch of agents to wear different hats and debate improvements to a piece of software it can produce some very useful insights.

mdeeks

I feel like pricing needs to be included here. I kind of don't care about 10 percentage points if the cost is dramatically higher. Cursor Bugbot is about the same cost but gives 10x the monthly quota of Qodo.

I know this is focused solely on performance, but cost is a major factor here.

logicx24

Where's the code for this? I'd love to run our tool, https://tachyon.so/, against it.

kachapopopow

coderabbit being the worst while (presumeably) advertising the most seems to be check out at least, wouldn't believe the recall % seems bogus.

aetherspawn

Your pricing page has a bug on it, the annual price is higher than the monthly price.

show comments