Skip to main content

2 posts tagged with "development"

View All Tags

· 7 min read

One amazing benefit that modern LLMs come with is using them as a linter, or a pair programmer. You can easily get feedback on your code: just share it with the AI tool and ask a question. If the feedback is solid, your code is improved. If the feedback is poor, you can just disregard it. But overall with very little effort you can gain a lot.

In this article we are going to focus on code review done with AI tools. We are going to explore a few solutions available as of February 2026 and how they compare based on our experience. This is not a thorough analysis nor are we doing any evals.

· 4 min read

We evaluated three Claude frontier models: Sonnet 4.6, Opus 4.6, and Opus 4.8. This was done on a fixed set of five RHEL triage issues using our end-to-end test harness. The harness is validating our triaging agent against a fixed set of known issues. Triage is the first and most consequential step in the workflow: the agent reads a Jira CVE issue and decides how it should be resolved — backport an upstream patch, rebase to a newer version, or mark it as not-affected. A wrong call here invalidates everything downstream, which is why it's the natural starting point for a model benchmark.

The 4.6 models ran with REASONING_EFFORT=high, which enables native extended thinking. Opus 4.8 ran without it — a LiteLLM/BeeAI provider ID mismatch caused the model to reject the thinking parameter, so we disabled it entirely (tracked here). The harness runs all five issues concurrently and captures per-issue metrics from the agent framework: wall-clock duration, tool call count, and total token usage.

This is the first time we've done this kind of work and it was quite a learning experience. This blog post contains data from a single run with each model.

I already have ideas for improvement for a future run. We need to extend the scope to backporting agent as well and have at least ten cases. Doing multiple runs per model with aggregation would also give us more grounded results.