New Geekbench AI 1.0 Benchmark Analysis and Early Results
- Ryan Shrout
While the ability to run performance analysis of on-device AI computing continues to expand, Geekbench AI 1.0 is a new benchmark that attempts to make sense of it all. Signal65 dives into the benchmark itself, pros and cons, and shares early performance data.
We often take for granted the ubiquity of tools and benchmarks to measure the performance of different systems, platforms, and chips. In the world of PCs and client devices like laptops and smartphone, you can choose from dozens of pre-built benchmarks or dive into the realm of “real world performance” by using real applications and workloads of your own choosing to compare the performance any one CPU to another, generationally or across architectures.
Want to see how the Intel Meteor Lake CPU cores compare to AMD Hawk Point? Just run anything from Cinebench to Adobe Premiere Pro and then make some intelligent judgements. Want to see how the latest GeForce and Radeon graphics cards perform? Fire up your three most played game titles and run them through the paces, using frame monitoring software if you want, and describe the experience of each.
But with the recent importance of local, on-device AI computing in the PC space, the story gets more complicated. AI applications that are easily accessible are still few and far between, and the software infrastructure between those apps, the silicon platforms that run them, and the models and frameworks in the middle, are all constantly shifting. Testing AI is not as simple and direct as measuring a CPU or GPU, and it’s unlikely that will change anytime soon.
You can go down the path of generating your own comparisons with ONNX or Pytorch or even use something like LM Studio. But performance analysis remains a tricky topic.
One of the most well-known and often quoted benchmarks for the PC and mobile space is Geekbench from developer Primate Labs. I’ve used these tools for the better part of two decades across smartphones, desktop CPUs, discrete graphics cards, and more as one part of my testing suite when I ran a hardware review website, when I was creating competitive collateral while at Intel, and continue to do so in my role here at Signal65.
Geekbench ML (v0.5) was first released to the public in May of 2021, and then updated again to v0.6 in December 2023. Today, along with a name change to Geekbench AI, the company is releasing version 1.0.
I want to be very clear up front – Geekbench AI is not an “end all” benchmark for comparing platforms on the topic of AI compute. It represents one tool that the industry and community will need to continue to have conversations about different platforms (hardware and software) for the AI application landscape. At Signal65 we will likely use this benchmark but also depend on other third-party tests while developing our own custom toolsets for comparing AI accelerators for some time to come.
Geekbench AI 1.0, like other Geekbench tests, runs a collection of benchmarks in a few different states, then attempts to culminate in a single number score for each to represent overall performance. This is an incredibly nebulous and difficult task for sure; even just putting a flag in the ground and determining how to weigh each test, the value of performance versus the value of accuracy, and the right models and frameworks to use for the benchmark requires a lot of work and partnership.
I won’t spend any more time going into the complexity of the test development, you can read Primate Labs opinion on that side of things over on their website. Instead, I thought it would be helpful to simply share our quick thoughts on the test itself, the results we see, and how we’ll be using Geekbench AI 1.0 going forward.
Geekbench AI 1.0 takes big steps with support for new frameworks on AI accelerators and hardware. Intel OpenVINO is included for the first time, as is Qualcomm QNN, both meant to ensure the best possible performance for hardware from that particular vendor. ONNX works on Windows and is a critical piece of the puzzle as it allows cross-vendor comparisons and is the only method today to test AMD GPUs. For Android, GeekBench AI supports vendor-specific TensorFLow Lite delegates from Samsung, Arm, and Qualcomm (but notably not MediaTek yet).
There are quite a few models that are run as a part of Geekbench AI, each representing some kind consumer AI workload and using the appropriate model.
- Image classification – MobileNetV1 – Classify the object in an image
- Image segmentation – DeepLabV3+ – Classify each group of pixels in an image
- Pose estimation – OpenPoseV2 – Estimate the pose of a person in an image
- Object detection – MobileNetV1 – Detect position and classification for all objects in an image
- Face detection – MobileNetV2 – Location and confidence of faces in an image
- Depth estimation – EfficientNet-lite3 – Create a depth map relative to camera focal point of an image
- Image super resolution – RFDN – 4x super resolution scaling of an image
- Style transfer – Image Transform Net – Modify one picture to adapt to another pictures style
- Text classification – BERT-Tiny – Text to confidence scores of a category (positive or negative)
- Machine translation – Transformer – Language translation from English to French
Each of these tests is run with three different data types that the benchmark calls Single Precision, Half Precision, and Quantized. Single precision and half precision are running at FP32 and FP16 respectively, while the quantized result is generally labeled as INT8. I say ‘generally labeled’ because in my conversation with the developer it was noted that in some instances, with some frameworks, the exact precision can vary. It is important to note that quantization work for Geekbench AI is done by the Geekbench development team, not by the vendors directly. Vendors are allowed to see how the quantization work was done, and supply input to Primate Labs, but at the end of the day it is on the developer to determine the most fair and reasonable way to complete this step. And it also important to understand that most consumer applications of AI are moving to processing on INT8 for performance and efficiency reasons, so those scores are likely the most important to consider.
This is also why the addition of the accuracy portion to Geekbench AI 1.0 is so interesting. The accuracy score is based on comparison to a ground truth result generated at Full Precision using an Intel Core i7 CPU. Geekbench AI has a different evaluation metric assigned to each of the different tests, from pixel accuracy (for image segmentation) comparisons to root mean square error (for depth estimation). The resulting percentage of accuracy is then applied to the raw performance score for that AI test and can thus swing the results down if the accuracy is off.
Most of the time in our testing we have seen accuracy results of 97%+ but we have seen at least a couple of examples where the overall score is greatly impacted because of a sub-30% accuracy rating.
What does this mean for benchmarking and performance comparisons? One thing to understand is that we still don’t have a universal test that allows us to compare ALL the currently available hardware accelerators for AI to each other. Why?
Despite having one of the fastest NPUs on paper, Geekbench AI 1.0 still doesn’t have the ability to measure the AMD Ryzen AI NPU performance. When I asked Primate Labs about this, I was told the reasoning was that it was a representation of where AMD stood today in terms of its consumer AI framework implementations, and that trying to integrate support through Vitis software (carryover from Xilinx) just wasn’t working out. Disappointing for sure, but also is mirrored by the fact that you cannot run the Procyon AI benchmark on AMD NPUs. Hopefully we’ll have a solution from AMD on this soon.
Let’s look at a quick sample of results. Please keep in mind this is very early and we aren’t putting these charts together to dive in detail into any cross-device comparison, but only wanted to show some examples and see what interesting questions and conversation the new benchmark will help drive.
This first set of results compares an Intel Core Ultra 9 Meteor Lake laptop with one of the new Ryzen AI 9 systems, a Qualcomm X Elite Surface device, and the MacBook Air with the Apple M3 chip. We are looking at the best results for each CPU here, which means OpenVINO for Intel and AMD, ONNX for Qualcomm, and of course CoreML for Apple.
AMD is the standout winner here, scoring more than 60% higher than the Intel Core Ultra 9 CPU in all the result sets. Another observation is that the scores for both Intel and AMD remain basically the same in single precision (FP32) and half precision (FP16), telling us that neither architecture appears to be optimized more for FP16 over FP32. But Apple and Qualcomm results are more interesting; the Snapdragon score goes DOWN from FP32 to FP16 but spikes back up with the full quantized results, surpassing the performance of the Apple M3.
Let’s look at another data set running on the CPU.
Here we have added in scores for both Intel and AMD processors running on ONNX, allowing us a more apples-to-apples comparison to Snapdragon. What we find this time is that in both half precision and in the quantized state, the X Elite is faster than both the Intel Core Ultra and the Ryzen AI 9 CPUs. But because the OpenVINO framework appears to be optimized for both x86 competitors, the resulting AI performance is higher. Frameworks and software matter!
Also worth noting is that all the results using ONNX on the CPU drop going from single precision to half precision, while the OpenVINO results for the same precisions tests maintain performance. Something to keep an eye on.
Finally, we of course need to look at NPU performance.
There are a few interesting things to see in these early results. First, the Intel performance from single to half to quantized precision doesn’t really scale on the Meteor Lake NPU. Both full and half precision score just over 7100 points, while the quantized/INT8 result jumps by about 50% to 11,000. On the other hand, the Qualcomm Snapdragon X Elite NPU increases each step along the way, going from 2177 to 11069 (5x increase) to 21549 (another almost 2x increase). And while the Intel NPU seems to have the performance advantage in the FP32 results, it is just about half the performance of the Snapdragon part in the quantized state. If consumer applications are indeed focused on INT8 development for future AI integrations, then those scores can supersede the FP32/16 scores somewhat.
I mentioned above the aspect of accuracy and its input on the score and perceived “performance metric” of the Geekbench AI test. It turns out there is one example where the accuracy of the model result was enough to impact a subtest score more than the others. The half precision (FP16) accuracy for the style transfer workload showed a 22% accuracy score – much lower than expected. To investigate, the Geekbench test can output and save the inference results and images for each test and the difference was immediately obvious.
The Snapdragon X Elite NPU FP16 image looks very different than the other 5 images, and in our testing, is different from all the other CPU and GPU results we have seen yet. By doing a similarity index measurement between that particular image (and all the FP16 results in this set) and the base reference image, you can see why the accuracy percentage would be low. As a result, the score for that subtest is just 2771 points, compared to 44,000+ on the Intel NPU. It’s just one subtest of 10 for half precision testing, but that means the FP16 score for the X Elite is lower than we’d expect.
It appears that this is a known issue and is more of a software bug than any inherent issue with the platform. So even though the performance result for the Snapdragon X Elite is excellent in the FP16 testing, it’s very likely that we’ll see these scores increase with a Geekbench AI update in the future. And considering that the best performing scores we have seen with the new benchmark overall are coming from the Snapdragon X Elite chip in its quantized/INT8 mode, I’m not worried about Qualcomm’s position.
Hopefully this quick dive into the new Geekbench AI 1.0 announcement and early results is helpful and interesting for the industry. There is still a lot of work to be done in terms of analysis and testing, really putting a lens on different AI integration across the AI PC and Copilot+ PC segments, and also delving into the impact of higher-powered devices like discrete GPUs. I expect that this benchmark will be used by many enthusiasts, reviewers, and analysts in the months ahead as we continue to create and develop new ways of measuring and comparing on-device AI computing performance.