Benchmarking Information Retrieval and LLM Hallucination with RIKER

Information retrieval and contextual understanding are foundational components of enterprise AI workflows. By incorporating additional context through approaches such as retrieval-augmented generation (RAG), AI models can provide domain-specific and document-specific assistance. For enterprises deploying these systems, however, understanding how effectively models retrieve information, and when they are prone to hallucination, is increasingly critical.

As part of an ongoing collaboration between Signal65 and Kamiwaza, this paper introduces RIKER (Retrieval Intelligence and Knowledge Extraction Rating), a benchmark designed to evaluate LLM knowledge retrieval capabilities. RIKER builds upon and complements the KAMI benchmark, a previous Signal65 and Kamiwaza effort focused on evaluating agentic AI capability. Together, the RIKER and KAMI benchmarks provide insight into how LLMs perform across common enterprise AI workloads.

Key Takeaways:

91 models benchmarked across real-world enterprise retrieval workloads

27 models exceeded 95% accuracy at 32K context

Only 3 models remained above 95% accuracy at 200K context
Multi-document aggregation accuracy declined more than 2× faster than single-document retrieval
Thinking models improved retrieval performance by up to 64%

Key findings of the RIKER benchmark include:

  • Qwen3.5-397B-A17B (Thinking) achieved the strongest overall retrieval performance: This model achieved the highest overall retrieval performance of all models tested, and was among a handful of models that consistently maintained high accuracy across retrieval tasks and context sizes, while many other models experienced substantial degradation as context length increased. Other top models include Gemma-4-31B-IT-KV-FP8 (Thinking), Qwen3.5-122B-A10B (Thinking), Kimi-K2.5 (Thinking), and GPT-5.4 (Medium Reasoning), all of which maintained greater than 94% overall accuracy at 200K context.
  • Long context windows are not equivalent to reliable retrieval: Although modern LLMs support increasingly large context sizes, retrieval accuracy consistently declined as context length expanded.
  • Aggregation tasks degrade faster than single-document retrieval: Models experienced significantly greater accuracy loss when required to aggregate or compare information across multiple documents versus retrieving information from a single source.
  • Effective long-context retrieval varies dramatically across models: At a 32K context size, 27 models achieved overall accuracy above 95%. At a 200K context size, that number fell to just 3 models. While many models perform similarly at moderate context lengths, performance divergence increases substantially as context size grows.
  • Thinking can improve information retrieval: The top performing models across all context lengths were comprised of thinking models. When compared to non-thinking variations of the same models, thinking models typically demonstrated improved performance.
  • Hallucination behavior appears less sensitive to context length than retrieval accuracy: While information retrieval and aggregation performance declined substantially as context size increased, hallucination-probing tasks exhibited comparatively smaller changes. This suggests that retrieval failures and hallucination behavior may represent partially distinct failure modes in long-context LLM systems.

What does this mean for the AI analysis and performance testing industry?

  • The context window arms race is outpacing enterprise usability: vendors compete on token caps, but only three of 54 models tested sustained production-grade accuracy at 200K. Advertised context length is not a proxy for usable retrieval.
  • Public benchmarks are part of the problem: training contamination, LLM-as-judge subjectivity, and synthetic extraction tasks have produced leaderboards that don’t predict enterprise outcomes. RIKER’s inverted-generation and deterministic grading are a direct response.
  • RIKER is one pillar of a coordinated evaluation framework: paired with KAMI for agentic capability and our upcoming project PINNACLE, Signal65 and Kamiwaza are building the enterprise-grade measurement stack the field has lacked.

Research commissioned by: