MLPerf Storage Results Analysis

Announcement Overview

The MLCommons organization announced results for its industry-standard MLPerf® Storage v1.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner.  The initial release includes 55 submissions from 12 different companies and includes multiple storage architectures and options.
The MLPerf Storage benchmark is an assessment tool designed to evaluate storage performance across a variety of AI/ML training scenarios, freely available on GitHub using an Apache license. ​ A key design of the benchmark is the ability to recreate storage workloads while remaining independent of AI accelerators.  By including accelerator processing time, the benchmark can accurately model storage patterns without the need for actual training runs, making it accessible to a broad audience.
The benchmark specifically assesses a storage system’s capability to keep up by mandating that simulated accelerators maintain a specified level of utilization. ​ This first release of the MLCommons storage benchmark incorporates three different AI models: 3D-UNet, Resnet50, and CosmoFlow, which include a wide range of data sizes, from hundreds of megabytes to hundreds of kilobytes.
By using a common set of data, AI frameworks and access methods, run-time operations and reporting requirements, the MLCommons Storage Benchmark provides a framework for evaluating storage performance for AI training workloads.

ML Commons Storage Benchmark Background

Currently, the storage benchmark suite supports benchmarking of 3 deep learning workloads

  • Image segmentation using U-Net3D (appx. 140 MB / sample): I/O Library = PyTorch
  • Image classification using Resnet-50 (appx. 150 KB/ sample): I/O Library = TensorFlow
  • Cosmology parameter prediction using CosmoFlow model (appx. 2 MB / sample ): I/O Library = TensorFlow
All three of the current examples are training workloads, rather than inference workloads.  These three models were chosen in part due to either their ubiquity (as is the case for Resnet-50), or due to the broad applicability of these workloads to AI training.  Importantly, the effects of data caching on a host were minimized or eliminated both by the use of the benchmark design itself, along with the data sizes used.
The McCommon’s organization has taken the approach that it is important to use tools, libraries and access methods that actual AI/ML workloads utilize during training.  Therefore, they chose to utilize Python based tools for data loading, pre-processing and other data operations.  This includes the use of PyTorch I/O libraries for the Resnet-50 model, and TensorFlow for the other two workloads.
With the current benchmark, the focus for training is on data loading, which implies a read-only or read-mostly workload.  For some AI workloads, data writes which can occur with the creation of model checkpoints can be a significant consideration.  Currently, the U-Net3D model did include a minor checkpoint operation although it was not measured independently.
One key criteria for ascertaining a passing score, is the ability to maintain a specific percentage of busy time for the AI Accelerator chosen.  Thus, for those reporting the use of A100 Accelerators, their storage must load data fast enough to have the accelerator remain busy, without falling below the requirement.  For both U-Net 3D and ResNet-50, the minimum threshold was 90% busy, with a 70% threshold for CosmoFlow.

Notes

Accelerators are not utilized during testing, although they were used to understand the sleep / delay cycles for different accelerators.  These values were then modeled in terms of a processing delay for different categories of Nvidia accelerators, such as A100 or H100.
Vastly different file / object sizes, for example, ResNet-50 files tend to be about 150KB, with other workloads using 2 MB or larger sample / file sizes. Thus, reporting IOPS and/or latency may be important, particularly for ResNet-50, although there was no latency requirement or reporting of those values.
Also, because AI accelerators were not actually utilized during testing, it was not possible to measure or report GPU Direct Storage performance results, which may be an important consideration for both vendors and consumers.
The MLPerf Storage benchmark suite is freely available for download and use via GitHub at: mlcommons/storage: MLPerf™ Storage Benchmark Suite (github.com)

Performance Summary

Prior to the announcement, detailed results were not made available to press or analysts.  Therefore, at the current time only limited information is available for each submission, including the total throughput rate, number of storage nodes required to achieve the results along with the number of accelerator nodes supported.
One view of performance is to examine the total throughput rate, looking either at the total throughput for each storage solution, or on a per node basis to understand how the storage solutions scale.  However, a more interesting way to view the results is examining the number of Nvidia H100 accelerator cards that each storage solution can support at a level of 90% or greater utilization for the case of the Unet3D workload.  In Figure 1 below, we show the maximum results for each vendor, in terms of the number of Nvidia H100 accelerators that are supported by each vendor.
MLPerf Storage Benchmark, V1.0
Figure 1: Stacked Results - Vendors Top Submission
Several of the participants have been active in the HPC storage market for many years and are well known names including DDN, HPE, Weka IO along with Hammerspace.  Additionally, Huawei’s results appear to be quite impressive both in terms of number of H100 accelerators supported, as well as their bandwidth results reported.
One of the more interesting entrants was Nutanix, who faired particularly well, with essentially a 3rd place result that placed it significantly ahead of many of the more well known HPC storage vendors outlined previously.  While Nutanix may not be a consideration as a storage solution for AI training workloads, these results indicate that at the very least Nutanix should be a consideration for production, AI inferencing workloads.

Takeaways

Currently, there are many unknowns within the realm of AI workloads, particularly how they use storage.  The MLCommon’s Storage working group, together with industry users and vendors, can help create a broader understanding of the tools and datasets that drive AI workloads.  This in turn will help vendors adapt their products to work better, while also enabling IT consumers to have greater insight into how to evaluate and assess their options.
The current MLCommon’s Storage version 1.0 benchmark should be considered a starting point that will help vendors and IT consumers better understand the impact that various AI workloads have on storage subsystems.  There are many more workloads, datasets and other metrics that can and should be added to this suite over time, but having an open tool that is representative of real-world workloads is beneficial for everyone.
By participating in MLCommon’s working groups, The Futurum Group is committed to helping vendors and IT consumers better understand how AI workloads utilize IT infrastructure, enabling consumers to make better purchasing decisions.