programmer

About

Username: programmer
Joined: December 2001
Visits: 51
Last Active: July 5
Roles: member
Points: 454
Badges: 1
Posts: 3,503

Reactions

151Like2Dislike41Informative

First Mac Studio M3 Ultra benchmarks significantly outpace the M2 Ultra

programmer

March 9
john-useless said:

I've been a Mac user since the original 1984 model … but benchmarks in recent years confuse me, given the nature of multi-core machines these days, not to mention performance vs. efficiency cores, etc. Consider these two new Mac models:

The Mac Studio with the base M4 Max processor has a 14-core CPU with 10 performance cores and 4 efficiency cores, plus a 32-core GPU.
The Mac Studio with the base M3 Ultra processor has a 28-core CPU with 20 performance cores and 8 efficiency cores, plus a 60-core GPU.

I understand the idea that a lot of software basically gets its work done with the CPU and that only some software is written to get its work done with the GPU. I also understand the idea that each generation of processor does its work faster — thus, M4 processors will have higher single-core scores than comparable M3 processors.

But unless those M3 processors are far, far slower than M4 processors (which isn't the case — we're not talking M1 versus M4 here), wouldn't the model with the M3 Ultra outperform the model with the M4 Max every time because the M3 Ultra has twice as many cores? I thought, perhaps mistakenly, that macOS more or less hides the number of cores from software — that is, an app sends instructions to the CPU once, and macOS takes care of giving that work to all of the cores available to it on a given machine.

I have this image in my mind of horses pulling two wagon trains full of cargo (equal amounts in each train) across the plains. One wagon train has 14 horses, and they are younger and stronger. The other wagon train has 28 horses. They're a bit weaker and more tired … but even so, they're not that much weaker, and there are twice as many of them! Wouldn't the 28-horse team (the M3 Ultra) beat the 14-horse team (the M4 Max) every time? (I suppose it's not as simple as that.)

My use case: I do a lot of editing in Final Cut Pro, mostly HD but some 4K, and some of the projects are 30 minutes long. Is it worth it for me to buy a Mac Studio with M3 Ultra? Twice as many horses which aren't that much weaker…
Excellent questions.

The short answer is "its complicated".

A slightly longer answer includes some of the following factors:
- No software is 100% parallel. There are always some components which run serially (i.e. the first result is required before the second can be computed, and so on). Amdahl's Law (https://en.wikipedia.org/wiki/Amdahl's_law) basically says that parallel hardware can only speed up the parallel portion of a workload, so scaling to an infinite number of cores would only get you to the speed of the non-parallel portion.
- Parallel cores aren't entirely unrelated. They must communicate (which is often the serial portion of the algorithm), and that communication introduces some slow downs. Even if they aren't explicitly communicating, they are sharing resources (e.g. the connection to memory) and thus run into contention there which slows them down a little.
- Signals crossing between chips (even Apple's ultra function connector) tend to be slower than on-chip signals. This means that those communication overheads get a little worse when crossing from one Max to the other, and you can't always avoid that crossing (indeed, Apple's OS makes it mostly invisible to the software... but nobody would likely try to optimize for that anyhow).
- Horse analogy: one horse by itself doesn't contend with anything put pulling on its load and pushing on the ground. Two horses have to deal with the connection between them, jostling from the other, etc. 28 horses would have a whole lot of tugging and jostling, and who knows, maybe some of them don't like each other so there's kicking and biting happening too. The digital equivalent of that does happen.
- The bottleneck in a computation might not be how fast the instructions execute. It might be memory latency or bandwidth, I/O latency or bandwidth, or use of some special function hardware (encoders/decoders, neural units, etc).
- GPUs are very very parallel, but each parallel thread of work they can do is less general and less performant than in a full-fledge CPU. So they aren't great for all tasks, and the software running on them has to pretty much be written specifically for them.
- CPUs vary greatly, and the M-series chips have 2 kinds -- efficiency vs performance. The former are slower, and the OS needs to figure out where to run what. It doesn't always get that right, at least not right away.
- CPUs these days get a lot of their performance by executing multiple instructions at the same time from one sequence of instructions. A lot of those instructions have to execute in the right order, and that limits how many can be done at once. At an extremely detailed level this depends on the software being run. Some software is carefully crafted to run as many non-intertwined instructions in parallel as possible, and then having a CPU that has a very "wide" dispatch and SIMD instructions can go very fast. Most software is nowhere near that carefully crafted (sometimes its just not possible, sometimes its just not worth the effort, and sometimes there hasn't been the time or expertise available), so the in-core parallelism is only lightly utilized even though the CPUs are actively trying to re-order the instructions to go as fast as possible.
- The slowest thing in most modern machines is the memory (well, the I/O is slower, but inside the computer...). To deal with that, a hierarchy of memory caches are built into the chip. These are (relatively) small high speed memories that hold copies of data that has already been read from or written to the main memory. Since it is very common to re-access a given piece of data that has been accessed recently, keeping it in a high speed cache close to the processor can help a lot with performance. But its not magic, and there are always tradeoffs. Caches work on bunches of data, and they are divided into levels (usually called L1, L2, L3) of varying size and speed and different amounts of sharing between cores. This means they're not working with just what the program needs next, they're doing extra work here and there. The sharing between cores means they're competing for this resource. Plus a mix of software runs on the same cores (e.g. your workload, the UI, the file system, the networking, the browser you leave running in the background, etc), and it wants different data in the same caches. Optimizing for cache use is extremely challenging.
- ... and so on. And on. And on. It really is very complicated.
M5 Pro may separate out GPU and CPU for new server-grade performance

programmer

December 2024

apple4thewin said:

Everyone currently trying to come out with a chip using SoC by the end of 2025 or 2026 but apple is already one step ahead. Although would this still share memory or will it go back to the dedicated ram for CPU and other for GPU?

This would almost certainly still be a unified memory architecture. The chiplets will be interconnected via some kind of high speed in-package network or bus, much like current chips use an on-die interconnect. This gives manufacturing flexibility and improve yields. AMD has been aggressively using such techniques for years, and it has really just been a matter of time until Apple jumped on it as well.
Generation gaps: How much faster Apple Silicon gets with each release

programmer

November 2024

MacPro said:
You are aware, I hope, I was referring to the OP's comment about 40 years hence? If you don't think in forty years computing power will be over 1000 times more powerful I am guessing you are young? I started woking for Apple in the late 70s so have a long perspective.
Not as old as you, but not far off. And I’m in the industry right now, and have been for decades with a good view of what is really going on. I’m extremely familiar with how far we’ve come, and yes, it is millions of times more powerful than the earliest computers. Could we see 1000x improvement in the next 40 years? Yes, it’s possible.

My point is that we can’t take past progress as the metric for future progress. This idea of continuous steady progress in process improvement is gone, and has been for quite a while. Much of Moore’s original paper was about the economics of chip production. Performance was kind of a side effect. The problem is that each successive improvement costs more and more, and delivers less and less, and comes at higher and higher risk. In this situation the economic model could break down, and put that 1000x in 40 years in jeopardy. Nobody knows what that’s going to look like because the industry has never been in this position before. New territory. Makes predictions highly suspect.
Generation gaps: How much faster Apple Silicon gets with each release

programmer

November 2024

dope_ahmine said:

CPUs and GPUs actually complement each other in AI. CPUs handle tasks with lots of decision-making or data management, while GPUs jump in to power through the raw computation. It’s not just about one or the other; the best results come from using both for what each does best.

As for energy efficiency, GPUs perform many tasks at a much lower power cost than CPUs, which is huge for AI developers who need high-speed processing without the power drain (or cost) that would come from only using CPUs.

And on top of all that, new architectures are even starting to blend CPU and GPU functions—like Apple’s M-series chips, which let both CPU and GPU access the same memory to cut down on data transfer times and save power. Plus, with all the popular libraries like PyTorch, CUDA, and TensorFlow, it’s easier than ever to optimize code to leverage GPUs, so more developers can get the speed and efficiency benefits without diving deep into complex GPU programming.

This is what the NPU is all about as well. It is, at its core, a matrix multiplication unit. Getting a GPU to multiply large matrices optimally, is a tricky piece of code... so having dedicated matrix multiplication hardware which is purpose-built for the task makes a lot of sense. If you're doing a lot of that. Prior to the heavy adoption of deep learning it was almost unheard of for consumer machines to do large matrix multiplication. That was usually the purview of high performance computing clusters. With the advent of LLMs and generative models, however, things have changed and it is definitely worth having this hardware sitting on the SoC with the CPUs and GPUs. Apple also appears to have added matrix hardware to their CPUs (in addition to conventional SIMD), so there are lots of options in an Apple Silicon SoC for where to do these matrix operations. The NPU is very likely the most power efficient at that (by far), and may also have the highest throughput. And if you're also doing graphics or other compute, now you don't have to worry about your GPU and CPUs being tied up with the ML calculations. And the SoC's unified memory architecture lets all these units share their data very very efficiently.
Generation gaps: How much faster Apple Silicon gets with each release

programmer

November 2024

MacPro said:

1der said:

It seems Cook’s law is then about 4 years. It's always fun to make lots of assumptions and project into the future. In doing so I imagine in say 40 years what seemingly AI miracles could be accomplished with the machine in your hand being 1000 times as powerful

Same here. However, I bet your 1000-times increase is way short of the mark in terms of performance gain.

This sort of abstraction is based on a fallacy that future progress will follow the same pattern as past progress. Moore's "Law" broke down because that no longer holds. Until about the mid-2000s, we were rapidly and steadily taking advantage of the relatively easy scaling offered by the available EM spectrum for exposing masks. Since that time the rate of improvement has gotten slower, much harder, and much more expensive because we've reached extreme frequencies which are hard to use, we've hit the power leakage problem at tiny feature sizes, and so many more issues. Each process node improvement is a slow, expensive victory with ever more diminishing returns. For a lot of kinds of chips its not worth the cost of going to a smaller process, and that means there is less demand to drive shrinking to the next node. So it is not justified to look at the progress over M1 thru M4 and extrapolate linearly. We aren't at the end of the road, but getting to each successive process node is less appealing.