programmer
About
- Username
- programmer
- Joined
- Visits
- 51
- Last Active
- Roles
- member
- Points
- 454
- Badges
- 1
- Posts
- 3,503
Reactions
-
First Mac Studio M3 Ultra benchmarks significantly outpace the M2 Ultra
john-useless said:I've been a Mac user since the original 1984 model … but benchmarks in recent years confuse me, given the nature of multi-core machines these days, not to mention performance vs. efficiency cores, etc. Consider these two new Mac models:- The Mac Studio with the base M4 Max processor has a 14-core CPU with 10 performance cores and 4 efficiency cores, plus a 32-core GPU.
- The Mac Studio with the base M3 Ultra processor has a 28-core CPU with 20 performance cores and 8 efficiency cores, plus a 60-core GPU.
I understand the idea that a lot of software basically gets its work done with the CPU and that only some software is written to get its work done with the GPU. I also understand the idea that each generation of processor does its work faster — thus, M4 processors will have higher single-core scores than comparable M3 processors.
But unless those M3 processors are far, far slower than M4 processors (which isn't the case — we're not talking M1 versus M4 here), wouldn't the model with the M3 Ultra outperform the model with the M4 Max every time because the M3 Ultra has twice as many cores? I thought, perhaps mistakenly, that macOS more or less hides the number of cores from software — that is, an app sends instructions to the CPU once, and macOS takes care of giving that work to all of the cores available to it on a given machine.
I have this image in my mind of horses pulling two wagon trains full of cargo (equal amounts in each train) across the plains. One wagon train has 14 horses, and they are younger and stronger. The other wagon train has 28 horses. They're a bit weaker and more tired … but even so, they're not that much weaker, and there are twice as many of them! Wouldn't the 28-horse team (the M3 Ultra) beat the 14-horse team (the M4 Max) every time? (I suppose it's not as simple as that.)
My use case: I do a lot of editing in Final Cut Pro, mostly HD but some 4K, and some of the projects are 30 minutes long. Is it worth it for me to buy a Mac Studio with M3 Ultra? Twice as many horses which aren't that much weaker…
Excellent questions.
The short answer is "its complicated".
A slightly longer answer includes some of the following factors:- No software is 100% parallel. There are always some components which run serially (i.e. the first result is required before the second can be computed, and so on). Amdahl's Law (https://en.wikipedia.org/wiki/Amdahl's_law) basically says that parallel hardware can only speed up the parallel portion of a workload, so scaling to an infinite number of cores would only get you to the speed of the non-parallel portion.
- Parallel cores aren't entirely unrelated. They must communicate (which is often the serial portion of the algorithm), and that communication introduces some slow downs. Even if they aren't explicitly communicating, they are sharing resources (e.g. the connection to memory) and thus run into contention there which slows them down a little.
- Signals crossing between chips (even Apple's ultra function connector) tend to be slower than on-chip signals. This means that those communication overheads get a little worse when crossing from one Max to the other, and you can't always avoid that crossing (indeed, Apple's OS makes it mostly invisible to the software... but nobody would likely try to optimize for that anyhow).
- Horse analogy: one horse by itself doesn't contend with anything put pulling on its load and pushing on the ground. Two horses have to deal with the connection between them, jostling from the other, etc. 28 horses would have a whole lot of tugging and jostling, and who knows, maybe some of them don't like each other so there's kicking and biting happening too. The digital equivalent of that does happen.
- The bottleneck in a computation might not be how fast the instructions execute. It might be memory latency or bandwidth, I/O latency or bandwidth, or use of some special function hardware (encoders/decoders, neural units, etc).
- GPUs are very very parallel, but each parallel thread of work they can do is less general and less performant than in a full-fledge CPU. So they aren't great for all tasks, and the software running on them has to pretty much be written specifically for them.
- CPUs vary greatly, and the M-series chips have 2 kinds -- efficiency vs performance. The former are slower, and the OS needs to figure out where to run what. It doesn't always get that right, at least not right away.
- CPUs these days get a lot of their performance by executing multiple instructions at the same time from one sequence of instructions. A lot of those instructions have to execute in the right order, and that limits how many can be done at once. At an extremely detailed level this depends on the software being run. Some software is carefully crafted to run as many non-intertwined instructions in parallel as possible, and then having a CPU that has a very "wide" dispatch and SIMD instructions can go very fast. Most software is nowhere near that carefully crafted (sometimes its just not possible, sometimes its just not worth the effort, and sometimes there hasn't been the time or expertise available), so the in-core parallelism is only lightly utilized even though the CPUs are actively trying to re-order the instructions to go as fast as possible.
- The slowest thing in most modern machines is the memory (well, the I/O is slower, but inside the computer...). To deal with that, a hierarchy of memory caches are built into the chip. These are (relatively) small high speed memories that hold copies of data that has already been read from or written to the main memory. Since it is very common to re-access a given piece of data that has been accessed recently, keeping it in a high speed cache close to the processor can help a lot with performance. But its not magic, and there are always tradeoffs. Caches work on bunches of data, and they are divided into levels (usually called L1, L2, L3) of varying size and speed and different amounts of sharing between cores. This means they're not working with just what the program needs next, they're doing extra work here and there. The sharing between cores means they're competing for this resource. Plus a mix of software runs on the same cores (e.g. your workload, the UI, the file system, the networking, the browser you leave running in the background, etc), and it wants different data in the same caches. Optimizing for cache use is extremely challenging.
- ... and so on. And on. And on. It really is very complicated.
-
M5 Pro may separate out GPU and CPU for new server-grade performance
apple4thewin said:Everyone currently trying to come out with a chip using SoC by the end of 2025 or 2026 but apple is already one step ahead. Although would this still share memory or will it go back to the dedicated ram for CPU and other for GPU? -
Generation gaps: How much faster Apple Silicon gets with each release
MacPro said:You are aware, I hope, I was referring to the OP's comment about 40 years hence? If you don't think in forty years computing power will be over 1000 times more powerful I am guessing you are young? I started woking for Apple in the late 70s so have a long perspective.Not as old as you, but not far off. And I’m in the industry right now, and have been for decades with a good view of what is really going on. I’m extremely familiar with how far we’ve come, and yes, it is millions of times more powerful than the earliest computers. Could we see 1000x improvement in the next 40 years? Yes, it’s possible.My point is that we can’t take past progress as the metric for future progress. This idea of continuous steady progress in process improvement is gone, and has been for quite a while. Much of Moore’s original paper was about the economics of chip production. Performance was kind of a side effect. The problem is that each successive improvement costs more and more, and delivers less and less, and comes at higher and higher risk. In this situation the economic model could break down, and put that 1000x in 40 years in jeopardy. Nobody knows what that’s going to look like because the industry has never been in this position before. New territory. Makes predictions highly suspect. -
Generation gaps: How much faster Apple Silicon gets with each release
dope_ahmine said:CPUs and GPUs actually complement each other in AI. CPUs handle tasks with lots of decision-making or data management, while GPUs jump in to power through the raw computation. It’s not just about one or the other; the best results come from using both for what each does best.
As for energy efficiency, GPUs perform many tasks at a much lower power cost than CPUs, which is huge for AI developers who need high-speed processing without the power drain (or cost) that would come from only using CPUs.
And on top of all that, new architectures are even starting to blend CPU and GPU functions—like Apple’s M-series chips, which let both CPU and GPU access the same memory to cut down on data transfer times and save power. Plus, with all the popular libraries like PyTorch, CUDA, and TensorFlow, it’s easier than ever to optimize code to leverage GPUs, so more developers can get the speed and efficiency benefits without diving deep into complex GPU programming.
-
Generation gaps: How much faster Apple Silicon gets with each release
MacPro said:1der said:It seems Cook’s law is then about 4 years. It's always fun to make lots of assumptions and project into the future. In doing so I imagine in say 40 years what seemingly AI miracles could be accomplished with the machine in your hand being 1000 times as powerful