Apple Silicon M3 Pro competes with Nvidia RTX 4090 GPU in AI benchmark [u]

AppleInsider · December 13, 2023 6:34PM

In a recent test of Apple's MLX machine learning framework, a benchmark shows how the new Apple Silicon Macs compete with Nvidia's RTX 4090.

Whisper performance

Apple announced on December 6 the release of MLX, an open-source framework designed explicitly for Apple silicon. It's meant for AI developers to build upon, test, use, and enhance within their projects.

Developer Oliver Wehrens recently shared some benchmark results for the MLX framework on Apple's M1 Pro, M2, and M3 chips compared to Nvidia's RTX 4090 graphics card. It makes use of Whisper, OpenAI's speech recognition model.

Wehrens uses the Whisper model for transcribing speech and measures the time it takes to process a 10-minute audio file. Results show that the M1 Pro chip doesn't quite meet the Nvidia GPU's performance, taking 216 seconds to process the audio compared to the 4090's 186 seconds.

However, newer Apple chips have much better performance. For instance, a different person ran the same audio file on an M2 Ultra with 76 GPUs and an M3 Max featuring 40 GPUs and found that these chips transcribed the audio transcription in less time than the Nvidia GPU.

There is also a significant difference in power consumption between Apple's chips and Nvidia's offering. Specifically, when comparing the power usage of a PC with an Nvidia 4090 running versus its idle state, there's an increase of 242 watts.

In contrast, a MacBook with 16 M1 GPU cores shows a much smaller increase in power usage when active compared to its idle state, with a difference of just 38 watts.

The results highlight Apple's gains in AI and machine learning capabilities and could be the beginning of better capabilities for Apple products. With the MLX framework now open-source, it paves the way for broader application and innovation for developers.

Nvidia's 4090 GPU starts at $1,599 just for the card, without a PC. This is the same price as the M3 MacBook Pro from 2022 -- but prices increase rapidly for M3 Pro and M3 Max.

Using an optimized tool changed the results

An update to Wehrens' blog post changed the story. The M3 chips still performed well, but Nvidia cut its times by more than half when using appropriately optimized benchmark tools.

Updated Whisper performance

We're leaving the original story intact since it reflects how the non-optimized tool performed. However, Apple's processors still have a way to go to match Nvidia's when it comes to AI transcripts.

The most interesting factor to note is the difference in power consumption. That result didn't change -- Apple's chips performed well at a fraction of Nvidia's power draw.

With the new results from the Nvidia-optimized tool, the transcript is completed in 8 seconds. M1 Pro took 263 seconds, M2 Ultra took 95 seconds, and M3 Max took 100 seconds.

Apple's results were still impressive, given the power draw, but still didn't match Nvidia's. Apple Silicon still has some room to improve, but it's getting there.

Updated December 13, 6:50 p.m. ET: Original post used a non-optimized benchmark showing inaccurate results.

Read on AppleInsider

loopless · December 13, 2023 9:27PM

One of the big advantages of Apple's "unified" memory approach is that you do not need to be continually shuffling data to and from the GPU and worrying if the GPU has enough memory. That's why although a GPU might have a theoretical GFLOP rating that is much higher than say a M series chip, in the real world, not so much.

For example to multiply two matrices (C = A.B ) on a NVIDIA GPU the steps are
- allocate memory for matrix A, B, C on the GPU
- copy data from main memory A,B to A,B on the GPU
- call GPU to multiply and compute C
- Copy matrix C from GPU back to main memory

michelb76 · December 13, 2023 9:30PM

He updated the article with the nvidia optimized version of whisper, and no surprise, the 4090 blows aways the macs. Still a nice trajectory that apple is on, but also still a long way to go.

discountopinion · December 13, 2023 10:21PM

Super interesting and it is clear that Apple is making headway and that the MLX release was a great thing for the community.

Comparison, while entertaining, is a bit misleading tho as it is comparing one architecture that is optimised for power efficiency against a plugged-in no holds barred GPU architecture. To say that Apple has a long way to go missed the point a bit.

The power consumption factor should be highlighted even more. The Apple benchmark is likely similar or the same if running plugged in or on battery. All-in the PC guts power consumption is likely 10x (or higher) of the M3 Max SoC. So if doing an amateur Watts x seconds calculation then the optimised Nvidia model is still faster than the MLX M3Max, but not by a whole lot at all.

RTX 4090 has a TDP of 450W. Laptop RTX 4090 is 175W (?). M3Max is far from this and Nvidia GPU power draw is not counting the PC CPU, RAM and other circuitry that resides in the M3 chip.

Now I wonder if there is any option to do a MLX version of the Nvidia optimised model. Perhaps there are many other tweaks in the model besides tuning it for Nvidia cards?

Also I wonder what performance numbers we would see if Apple would go on a ragga tip and do an M3 chip with a TDP of 450-700W. Maybe Apple has tried it and it didn't meet expectations. Chips be hard.

Hopefully M4 will be tripling down on genAI performance.

mknelson · December 14, 2023 12:24AM

michelb76 said:

He updated the article with the nvidia optimized version of whisper, and no surprise, the 4090 blows aways the macs. Still a nice trajectory that apple is on, but also still a long way to go.

As the author notes in his update: the Mac version isn't optimized. That means it should improve.

mattinoz · December 14, 2023 1:05AM

Have I missed something?

The headline says M3 Pro competes with....
The article doesn't list the performance of the M3 Pro
M3 Max yes
M1 Pro yes
M2 Ultra yes

jdw · December 14, 2023 1:59AM

The article falls flat with the updated numbers because, despite the power consumption advantage, the 4090 is so much faster that difference matters nothing. In other words, if you can get THAT much more performance, it may be worth that much extra power draw.

And for those saying the Mac version isn't optimized, well, we can only comment on the data we have right now, which is, no optimized Mac version. When they optimized Mac version hits, then I can re-evaluate my remarks at that point in time. But for now, the crown is clearly on the head of the 4090, despite the massive increase in power consumption.

daven · December 14, 2023 2:54AM

Calm down people. There are advantages to both systems and, in the future, you can be assured that it will be a closer horse race in many aspects.

hreb · December 14, 2023 4:32AM

loopless said:

One of the big advantages of Apple's "unified" memory approach is that you do not need to be continually shuffling data to and from the GPU and worrying if the GPU has enough memory.

This is also why the RTX 4090 ships with 24GB of dedicated GDDR6X memory which runs at 1008 Gb/s. Compare with the unified memory of the m3 max which claims 409.6 GB/s. The m3 max's memory bandwidth is truly impressive -- but it doesn't match a high-end discrete GPU.

michelb76 · December 14, 2023 9:16AM

mknelson said:

michelb76 said:

He updated the article with the nvidia optimized version of whisper, and no surprise, the 4090 blows aways the macs. Still a nice trajectory that apple is on, but also still a long way to go.

As the author notes in his update: the Mac version isn't optimized. That means it should improve.

Definitely, and hopefully someone will write something that can be on par with the Cuda version. Still pretty cool, don't get me wrong, I'm loving my Studio for LLM work.

cheesefreeze · December 14, 2023 1:38PM

This article is a bit misleading in that it only provides a single benchmark and suggests (by how it is setup) to be a great choice for it.
A single benchmark is useless in the field of AI, unless you are just doing that single task.
In the field of AI (in which I'm active) there are many tasks that demand different things from an SoC.

For example, in order to run a local LLM or image generation model, a lot of memory is needed (next to specs such as CPU).
A.I devs need to go all the way up to an M3 Max with 128gb of memory to run these models. With these specs 128gb really surpass it because it has direct access to it.
For most cases however, the 4090 definitely has the edge from a raw power p.o.v, not just because of the CUDA software<>hardware ecosystem
But then again, the M3 Max runs at 65 watts and the Nividia at 380 watts(!), and gets close enough not to be bothered by the slower performance.

kurai_kage · December 14, 2023 4:29PM

Given the emphasis on power consumption of each of chipsets, I would have liked to have seen the actual energy usage results for each of the attempts.

chasm · December 14, 2023 6:25PM

So if I understand this correctly: If I have a bunch of audio transcriptions to do on my laptop, I can either use my PC laptop with a 4090 and feed it my ENTIRE battery, and get that done very quickly indeed, making it as warm as a small sun —or I can use my Mac laptop, get it done a little bit less quickly, and still have plenty of battery life and an unscorched genital area.

Hmmm. Tough call.

marvin · December 14, 2023 8:33PM

michelb76 said:

He updated the article with the nvidia optimized version of whisper, and no surprise, the 4090 blows aways the macs. Still a nice trajectory that apple is on, but also still a long way to go.

The optimized version says it uses parallel processing, default batch size is 24:

https://github.com/Vaibhavs10/insanely-fast-whisper

The other test is sequential. It also says that optimal one runs on Mac.

It's not a fair comparison to put parallel software against sequential to show a 10x gain. That's like running After Effects sequentially on one machine and doing batch image processing on another. Of course it's going to be faster in parallel.

AMD and Nvidia have been squabbling about this recently:

https://www.tomshardware.com/news/nvidia-h100-is-2x-faster-than-amd-m1300x

That one mentions using FP8 calculations vs FP16. They say the result precision isn't compromised but still not a fair comparison.

If these benchmarks are going to be fair, there needs to be a standard for what is comparable. Compare same batch sizes, same compute units otherwise we're not talking about the capability of the hardware.

It's still valid to mention faster results regardless of the route used because in the end the result is what matters but all these manufacturers are working on the same base principles and software. It's about transistor count, clock speed, memory bandwidth etc. Nvidia 4090 is around 4x the raw hardware of M3 Max and M3 Max is on a more advanced node. The 4090 should be expected to be around 3-4x faster in most tests. M3 Max equivalent is the desktop 3070/laptop 4070, not the desktop 4090.

danox · December 14, 2023 10:01PM

Apple is on the right path with Apple Silicon the M3 Studio Ultra (coming this year) with 256 gig of UMA memory and will be all right at 75-110 watts total at peak performance with optimized software (Apple must support these developers any way they can software wise particularly in networking support), imagine what the M4 and M5 will do in the next 2-3 years. Nvidia at 353 watts won't be mobile or efficient with giant cooling mechanisms. From an energy and performance standpoint Apple is in very a good position building from the ground up.

https://www.youtube.com/watch?v=jaM02mb6JFM

https://www.youtube.com/watch?v=5dhuxRF2c_w (Take a look at the power supply and plumbing) long term Apple is on the right path just takes a little longer, like replacing Intel.

A Extreme Mac Pro with 512 gig of UMA at 160 watts? Won't happen but the possibilities.......Apple as a vertical computer company are in a very good position.

edited December 2023

macpro · December 14, 2023 11:11PM

chasm said:

So if I understand this correctly: If I have a bunch of audio transcriptions to do on my laptop, I can either use my PC laptop with a 4090 and feed it my ENTIRE battery, and get that done very quickly indeed, making it as warm as a small sun —or I can use my Mac laptop, get it done a little bit less quickly, and still have plenty of battery life and an unscorched genital area.

Hmmm. Tough call.

Not sure if it's true but I was just told that Nvidia card usage contributes more C02 than the US rail system. I must go check on that. Nope, it's not true, I should have checked first. My bad.

edited December 2023

godofbiscuitssf · December 15, 2023 12:59AM

chasm said:

So if I understand this correctly: If I have a bunch of audio transcriptions to do on my laptop, I can either use my PC laptop with a 4090 and feed it my ENTIRE battery, and get that done very quickly indeed, making it as warm as a small sun —or I can use my Mac laptop, get it done a little bit less quickly, and still have plenty of battery life and an unscorched genital area.

Hmmm. Tough call.

No. You're not understanding it correctly. The author didn't do the math right. He has the total times. He has the power draw, but he only measured instantaneous power draw. He could have measured TOTAL power draw.

Using numbers from anantech, the M1 (not m2) Ultra Mac Studio drew 87 watts peak (if the the author of this article did the tests properly, we'd have better numbers), that's 87 watts * 95 secs = 8265 watts total.
for the 4090:
242 watts * 8 secs = 1936 watts total

meaning that even at that "inefficiency", the 4090 is using FOUR TIMES LESS ENERGY.

And if you insist on comparing your laptop to a desktop 4090, let's go back to the M1Pro chip, with MLX library, and why not? because the author does! his "38 wats" to the 4090's peak numbers:
38 watts * 216 secs = 8208 watts total.

Whoops. the 4090 still uses FOUR TIMES LESS ENERGY even than the M1Pro.

godofbiscuitssf · December 15, 2023 1:00AM

ffs, just rewrite this article. Redo the testing. Remeasure everything. You should be doing total energy usage for the task. Stop citing "peak" energy usage. It's meaningless in this context.

mike_galloway · December 15, 2023 10:36AM

godofbiscuitssf said:

ffs, just rewrite this article. Redo the testing. Remeasure everything. You should be doing total energy usage for the task. Stop citing "peak" energy usage. It's meaningless in this context.

Absolutely - Watts (Volts x Amps) are unless without definition eg Watt/hours. Unfortunately this will happen, as journalism is usually the rehashing of other articles without the knowledge to know if they are correct, and then made worse by us and our uninformed comments here. If I wanted to write this article I would have to do the tests myself otherwise I would not have confidence that it was right.

ollie802 · December 18, 2023 12:49PM

You can also tell the author knows nothing about machine learning or "AI" because this article only deals with the inference workload... which is a totally different ball game to training workload.

gatorguy · January 3, 2024 2:35PM

nimbuschild said:

I think that the benchmark using the Whisper model for speech transcription is a great way to compare these chips. It's impressive that the newer Apple chips, like the M2 Ultra and M3 Max, outperformed Nvidia's GPU in processing time. This shows that Apple is seriously upping its game in terms of processing power and efficiency.

If you read through to the bottom of the article you will see the author has revised his findings, as they were originally flawed. The new Apple chips did not outperform Nvidia's.

Apple Silicon M3 Pro competes with Nvidia RTX 4090 GPU in AI benchmark [u]

Using an optimized tool changed the results

Comments