AGENTIC AI CAPABILITIES TESTING Q1 2026
Benchmarking Leadership in Open and Proprietary Models
Evaluating Agentic AI Capabilities with the KAMI v0.1 Benchmark
-
Mitch Lewis
This report evaluates agentic AI capability with the Kamiwaza Agentic Merit Index (KAMI) Benchmark and provides analysis of the agentic AI landscape as of Q1 2026. The KAMI Benchmark provides measurement of AI model accuracy during enterprise-focused agentic workloads. This testing expands upon previous results from the KAMI v0.1 benchmark, outlined in the Signal65 report Measured Leadership with Agentic AI on Open Models. While the previous report focused primarily on popular open source LLMs, additional testing in this Q1 update has broadened the test set and includes several prominent proprietary models.
Key Highlights
Key findings include:
- GPT-5 leads all models tested with a mean accuracy of 95.7%
- Top open source models are highly competitive with proprietary models, in many cases out-performing leading proprietary models.
- GLM-4.6 achieves the highest overall score for an open source model, with 92.57% mean accuracy.
- DeepSeek-v3.1 achieved the second highest overall score for an open source model with 92.19% accuracy.
- Qwen3-Coder-480B-A35B-Instruct achieved the third highest overall score for an open source model with 91.88% accuracy.
- Qwen3-Next-80B-A3B-Instruct achieved the highest score of any open model with fewer than 100B parameters at 83.79% accuracy.
- Ongoing model development of proprietary models shows inconsistent improvement for agentic use cases, with some newer models underperforming previous generations.
- GPT-5 notably outperforms newer GPT models, including GPT-5.1 and GPT-5.2
- Claude-Haiku-3.5 significantly outperforms Claude-Haiku-4.5.
- Similar discrepancies are seen in open model families, including Qwen, Llama, and MiniMax.
- Some models achieved higher accuracy when run on AWS Bedrock than on on-premises hardware, indicating that infrastructure and configuration can impact agentic accuracy.
An overview of the top 10 performing models tested can be seen below:
| Rank | Model | Mean Accuracy Score |
|---|---|---|
| 1 | GPT-5 (Medium Reasoning) | 95.7% |
| 2 | GLM-4.6 | 92.57% |
| 3 | DeepSeek-v3.1 | 92.19% |
| 4 | Qwen3-Coder-480B-A35B-Instruct | 91.88% |
| 5 | Qwen3-235B-A22B-Instruct-2507 | 90.37% |
| 6 | MiniMax-M2 | 89.89% |
| 7 | Claude-Sonnet-4.5 | 89.63% |
| 8 | GPT-5.2 (Medium Reasoning) | 89.08% |
| 9 | Qwen3-235B-A22B-Instruct-2507-FP8 | 88.75% |
| 10 | GLM-4.5 | 88.14% |

