Agentic AI Capabilities Testing Q3 2025

Measured Leadership with Agentic AI on Open Models

Evaluating Agentic AI Capabilities with the KAMI v0.1 Benchmark

Executive Summary

Over the past few years, AI has evolved from a speculative technology to a key priority for enterprise organizations. Rapid model development has led to larger, more complex, and more reliable LLM models. For enterprise use, however, it is agentic applications that offer real value – enabling AI to solve challenges and complete valuable business tasks with as little human intervention as possible.

While agentic AI is the focus of these enterprise efforts, evaluating the usefulness of LLMs to complete agentic tasks has proven to be a challenge. Existing AI benchmarks primarily measure a model’s reasoning ability, rather than its ability to successfully complete enterprise-related tasks. In addition, static AI benchmarks often become incorporated into a model’s training data, reducing the benchmark to a test of memorization.

Key Highlights

Qwen3-235B-A22B Instruct-2507-FP8 is the top agentic AI performer at 88.8% mean accuracy score

FP8 quantization impacts accuracy by ~3% for agentic workloads

Thinking models up to 25% more accurate for agentic AI workloads

To overcome these challenges, Signal65 and Kamiwaza have collaborated to establish a new AI benchmark which measures model performance for enterprise-focused agentic tasks. This paper presents the first iteration of the Kamiwaza Agentic Merit Index (KAMI).

Key findings include the following:

  • Top Performer: Qwen3-235B-A22B-Instruct, both the FP8 quantized and full weight version, achieved the highest scores among models tested, indicating it is a top open source AI model to be considered for agentic AI deployments.
  • Model Size: In general, accuracy was seen to improve with model size, with the highest scores attributed to very large models with over 100B parameters. Small models (<10B parameters) showed a clear deficiency across most agentic tasks. Some models in the 30 to 100B parameter range, however, such as Llama-3.1-70B-Instruct and Qwen3-30B-A3B (thinking mode) outperformed much larger models, demonstrating compelling options for organizations with limited infrastructure.
  • Quantization: FP8 quantization does not appear to have adverse effects on agentic capabilities. Across FP8 quantized and full weight model pairs tested, the FP8 variations consistently achieved similar or even slightly greater accuracy.
  • Thinking: Models with thinking capabilities were generally found to be more accurate in achieving agentic tasks than similar non-thinking models. Non-thinking models, however, became highly competitive when provided basic hints and context clues, offering a possible alternative to the high token usage and cost associated with thinking models.
  • Agentic Benchmarking Disconnect: Several models which achieved high scores across other common AI benchmarks scored disproportionately low in the KAMI v0.1 benchmark, indicating a disconnect between traditional AI benchmarking and real world application. Additionally, some older generation models across both Llama and Qwen model families outperformed their newer generation counterparts that are typically considered to be more advanced according to traditional benchmark results.