MLPerf Client v0.5 Benchmark Brings MLCommons to PCs

In the rapidly evolving world of client AI, benchmarking tools are essential for understanding and comparing the performance of hardware. The new MLPerf Client benchmark tool marks a step forward for the industry, helping address a gap in the assessment of AI capabilities on consumer devices such as desktop PCs and laptops.

Benchmarking AI on client devices is challenging due to the complex and constantly changing nature of AI workloads. Unlike traditional computing tasks where workloads are fairly constant and don’t change much month to month or even year to year, AI processes involve diverse components such as CPUs, GPUs, and NPUs, each contributing to overall performance. And the underlying AI models and software infrastructure changes seemingly every week. Despite the importance of such benchmarks, the industry has a shortage of tools capable of analyzing results easily and consistently.
Enter MLPerf Client, a new benchmark developed by MLCommons, an industry consortium respected for its AI benchmarks in the data center for training and inference. With over 125 members, including prominent names like Nvidia, Intel, Microsoft, and Qualcomm, MLCommons brings credibility to this new tool. MLPerf Client is designed to bring the same level of rigor to consumer devices that its predecessors have established in the enterprise domain.
This release, MLPerf Client v0.5, is intentionally labeled as an early-access version. This approach reflects a practical strategy by the consortium: to gather feedback, refine the tool, and build toward a more comprehensive v1.0 release. The current benchmark focuses on a single large language model (LLM), Meta’s Llama 2 7B, running on GPUs with a subset of supported configurations. While limited in scope, the benchmark offers four distinct testing scenarios based on varying input and output token sizes, simulating real-world LLM usage such as content generation and summarization.
Token-based evaluation is the core of the benchmark, with metrics like Time to First Token (TTFT) and Tokens Per Second (TPS) providing insights into latency and throughput. TTFT measures the delay between submitting a request and receiving the first token of output, a critical metric for interactive AI applications. TPS evaluates the average rate of token generation, excluding the initial latency, offering a comprehensive view of sustained performance.
Category
Approximate Input Tokens
Approximate Expected Output Tokens
Content Generation
128
256
Creative Writing
512
512
Summarization, Light
1024
128
Summarization, Moderate
1566
256
The MLPerf Client benchmark distinguishes itself through vendor-specific optimizations, a departure from the forced neutrality often seen in other benchmarks. Vendors are encouraged to modify the Llama 2 model to maximize performance on their hardware, provided the results meet minimum accuracy thresholds validated by the Massive Multitask Language Understanding (MMLU) benchmark. This balance ensures competitive performance without compromising integrity.
Standardized configuration files further enhance the benchmark’s usability, enabling consistent comparisons across systems. Current implementations support ONNX Runtime GenAI with the DirectML execution provider for GPUs and Intel OpenVINO—each offering unique acceleration paths. Notably, the models are quantized from 16-bit floating-point precision to int4, achieving significant performance gains while maintaining quality.
Our initial testing highlights both the promise and the limitations of MLPerf Client in this iteration. Evaluating an Intel Core Ultra 7 258V laptop (using both OpenVINO and ONNX RT paths), an AMD Ryzen AI 9 HX 370 laptop, and a desktop with a GeForce RTX 4090, we observed intriguing trends. For instance, the RTX 4090 dominated in TTFT, as expected for a high-power discrete GPU. Among laptops, the OpenVINO implementation on the Intel Core Ultra demonstrated an advantage, achieving over twice the speed of its ONNX RT counterpart. Meanwhile, the Ryzen AI system trails by quite a bit, with initial token latency as high as 9.4 seconds in the moderate summarization workload. For comparison, the Intel Core Ultra 7 258V using OpenVINO first token latency is just 1.2 seconds in the same test.
Tokens per Second (TPS) results revealed more consistency, with the Intel Core Ultra’s OpenVINO path maintaining a modest lead over ONNX RT and the AMD system trailing by approximately 8-10%. The RTX 4090’s supremacy is evident, albeit within a power envelope far exceeding the laptops’ energy-efficient designs. These findings illustrate the benchmark’s capability to differentiate performance across a range of systems while highlighting areas for improvement. (Note I included a chart view of TPS results without the RTX 4090 included just so we can get a better comparison for the integrated graphics solutions under test.
Looking ahead, MLCommons has outlined an ambitious roadmap for MLPerf Client. Future updates aim to expand hardware and software support, introduce additional models and use cases, and enhance usability through a standard graphical interface. Support for Windows on Arm and macOS is also planned, signaling a commitment to inclusivity and broader applicability.
The MLPerf Client benchmark represents a promising addition to the toolkit for evaluating AI performance on consumer devices. While its initial iteration has limitations in its v0.5 iteration, the benchmark’s industry backing and clear trajectory position it as a valuable resource for analysts, vendors, and end-users alike. As AI continues to permeate every facet of technology, tools like MLPerf Client will play an essential role in driving innovation and ensuring transparency in performance evaluation. We at Signal65 look forward to incorporating this benchmark into our analysis and watching it evolve.