Maximizing GPU Utilization with HPE Alletra Storage MP X10000

The rapid evolution of generative artificial intelligence has transitioned from a focus on training models to wide-spread deployment at scale. Current industrial-scale AI infrastructure has rapidly adopted optimizations to enhance their effective GPU utilization. Now these techniques are becoming adopted within smaller providers and enterprise AI deployments.

As large language models (LLMs) grow in complexity and the demand for long-context windows increases, the underlying hardware architectures are facing unprecedented pressure. Traditional compute-centric models, where the Graphics Processing Unit (GPU) operate as isolated islands of memory and compute, show the inefficient nature of AI inferencing. At the center of this challenge is the Key-Value (KV) cache, a rapidly changing and massive data structure that stores the intermediate states of every token in a session. The management of this cache has become the primary determinant of “effective utilization”—the degree to which a GPU is occupied with the generation of new output tokens rather than the redundant processing of input prompts.

Signal65, working together with HPE and Kamiwaza tested HPE Alletra Storage MP X10000 as a target for maintaining a distributed KV-Cache. Testing was focused on assessing the impact of using high-speed storage for storing KV data, and then measuring the change in the output token generation rates and the time to first token.

Signal65 worked with HPE partner Kamiwaza to run real-world, agentic workloads to provide realistic modeling of the impact of using an X10000 system for KV-Cache offload. Our findings showed a massive improvement in the effective utilization rate of GPUs:

Reduction in Time to First Token (TTFT) up to 21.5x vs. not using any KV-Cache

Increase in output token generation rates by up to 19.4x vs. no KV-Cache

Benefits vs. memory-only offload show TTFT reduction of 5.6x and token rate increase of 5.9x

The results shown focus on relative improvements, since actual token generation rates and TTFT are heavily dependent upon specific prompts, the LLM model and the GPU type utilized. Moreover, our focus throughout this paper is on the relative improvements, which are broadly applicable to different LLM models and GPUs.

Research commissioned by:

HPE Logo