We evaluated three Claude frontier models: Sonnet 4.6, Opus 4.6, and Opus 4.8. This was done on a fixed set of five RHEL triage issues using our end-to-end test harness. The harness is validating our triaging agent against a fixed set of known issues. Triage is the first and most consequential step in the workflow: the agent reads a Jira CVE issue and decides how it should be resolved — backport an upstream patch, rebase to a newer version, or mark it as not-affected. A wrong call here invalidates everything downstream, which is why it's the natural starting point for a model benchmark.
The 4.6 models ran with REASONING_EFFORT=high, which enables native extended
thinking. Opus 4.8 ran without it — a LiteLLM/BeeAI provider ID mismatch caused
the model to reject the thinking parameter, so we disabled it entirely
(tracked here). The harness runs all five issues concurrently and captures per-issue
metrics from the agent framework: wall-clock duration, tool call count, and
total token usage.
This is the first time we've done this kind of work and it was quite a learning experience. This blog post contains data from a single run with each model.
I already have ideas for improvement for a future run. We need to extend the scope to backporting agent as well and have at least ten cases. Doing multiple runs per model with aggregation would also give us more grounded results.
Analysis
Resolutions
All three models reached identical conclusions on all five issues. The triage decisions were unambiguous: three backports (RHEL-15216, RHEL-112546, RHEL-177992) and two not-affected (RHEL-114607, RHEL-174694). RHEL-114607 is worth highlighting — the issue concerns CVE-2025-59375 in expat, which only affects versions before 2.7.2. RHEL 10 already ships 2.7.3, so the models correctly concluded not-affected rather than recommending a rebase to 2.7.5.
Speed
Sonnet 4.6 was the fastest model on four of the five issues, sometimes by a large margin. RHEL-112546 illustrates the biggest spread: Sonnet resolved it in 98s with just 9 tool calls, while Opus 4.6 took 267s and 37 tool calls, and Opus 4.8 took 154s and 25 tool calls. Both Opus models invested significantly more effort on this libtiff CVE, exploring more patch URLs and performing deeper code analysis.
RHEL-174694 shows the opposite pattern: Opus 4.8 took 389s with 17 tool calls — more than twice Sonnet's 167s — despite using fewer tool calls. The extended thinking budget appeared to cause the model to deliberate at length before settling on a conclusion that Sonnet reached more directly.
Opus 4.8 is consistently faster than Opus 4.6 (except for RHEL-174694), which suggests the architecture improvements between the two generations translate into more efficient reasoning chains.
The stark difference between the numbers is not just caused by the model evolution but also the fact how non-deterministic task this triage is. We are also not utilizing Opus 4.8's adaptive thinking.
Token usage and cost
The token numbers reveal a notable pattern: Opus 4.8 uses more input tokens than Opus 4.6 for the same issues, but far fewer output tokens. On RHEL-112546, Opus 4.6 produced 11,268 output tokens versus Opus 4.8's 5,864 — the newer model reasons more concisely even when it reads more context.
Cost differences are significant but concrete numbers highly depend on the actual plan.
Takeaway
For the triage workload we tested, Sonnet 4.6 offers the best price-performance ratio by a wide margin. Opus models invested more in investigation on hard issues but did not produce different conclusions. Opus 4.8's speed advantage over Opus 4.6 is real but does not close the cost gap.
On the other hand, this is an evaluation harness, so we need to make a real judgement in our day to day work while processing real issues.
None of this analysis would be possible without the incredible work of the whole team and especially Tomas Korbar who authored the E2E test suite, Ondrej Pohorelsky who contributed the initial support for Opus 4.8 to ai-workflows, Nikola Forro - author of our minimal trace-server, Laura Barcziova whose scripts and Claude skills I used for this research, Matej Focko for consulting with me all the time, and Maja Massarini for the polished Makefile & compose setup that carried the test runs.
