Intel® Gaudi® 3 AI Accelerator at Scale on IBM Cloud
-
Mitch Lewis
Over the last few years, generative AI has demonstrated its immense potential as a revolutionary technology. AI-powered applications have demonstrated the ability to enhance automation, streamline workflows, and rapidly increase innovation. Further, the technology has proven to be broadly applicable, with opportunity for the creation of new, intelligent applications across virtually every industry. While the value of generative AI is apparent, the powerful hardware required to run such applications is often a barrier. As AI is increasingly moving from an experimental trend to the backbone of real world applications, IT organizations are challenged with balancing the necessary performance with economic considerations of AI hardware, and doing so at scale.
This paper outlines how Intel Gaudi 3 AI accelerators hosted on IBM Cloud can assist organizations in overcoming these challenges, and further, evaluates the performance and economics of Gaudi 3 compared to other leading solutions available on IBM Cloud. To evaluate performance, Signal65 conducted comprehensive AI inference testing utilizing multiple Large Language Models (LLMs) running on Intel Gaudi 3, NVIDIA H100, and NVIDIA H200 IBM Cloud instances. Key findings of this analysis include:
- Up to 43% more tokens per second than NVIDIA H200 when running IBM Granite-3.1-8B-Instruct for small AI workloads.
- Up to 20% more tokens per second than NVIDIA H200 when running Mixtral-8x7B-Instruct-v0.1 for balanced AI workloads.
- Up to 36% more tokens per second than NVIDIA H200 when running Llama-3.1-405B-Instruct-FP8 with large context sizes.
- Up to a 120% increase in tokens per dollar than NVIDIA H200 when running Mixtral-8x7B-Instruct-v0.1 and up to 92% more tokens per dollar than NVIDIA H200 when running Llama-3.1-405B-Instruct-FP8.
- Up to a 335% increase in tokens per dollar compared to NVIDIA H100 when running Llama-3.1-405B-Instruct-FP8.
