We evaluated three Claude frontier models: Sonnet 4.6, Opus 4.6, and Opus 4.8. This was done on a fixed set of five RHEL triage issues using our end-to-end test harness. The harness is validating our triaging agent against a fixed set of known issues. Triage is the first and most consequential step in the workflow: the agent reads a Jira CVE issue and decides how it should be resolved — backport an upstream patch, rebase to a newer version, or mark it as not-affected. A wrong call here invalidates everything downstream, which is why it's the natural starting point for a model benchmark.
The 4.6 models ran with REASONING_EFFORT=high, which enables native extended
thinking. Opus 4.8 ran without it — a LiteLLM/BeeAI provider ID mismatch caused
the model to reject the thinking parameter, so we disabled it entirely
(tracked here). The harness runs all five issues concurrently and captures per-issue
metrics from the agent framework: wall-clock duration, tool call count, and
total token usage.
This is the first time we've done this kind of work and it was quite a learning experience. This blog post contains data from a single run with each model.
I already have ideas for improvement for a future run. We need to extend the scope to backporting agent as well and have at least ten cases. Doing multiple runs per model with aggregation would also give us more grounded results.
