programmer

About

Username: programmer
Joined: December 2001
Visits: 51
Last Active: July 5
Roles: member
Points: 454
Badges: 1
Posts: 3,503

Reactions

151Like2Dislike41Informative

First Mac Studio M3 Ultra benchmarks significantly outpace the M2 Ultra

programmer

March 9

netrox said:
Mac Studio with M3 Ultra seems targeted toward users who use AI LLMs and scientific computing where loading massive amount of datasets into RAM can make a huge difference. LLMs are often around 4GB at a minimum (and produces laughable outputs) to 128GB with better outputs. For AI to work efficiently, they need to be loaded into RAM. It may be cheaper to just use the AI servers than to buy Mac Ultra 3 but if a person truly wants everything "offline" then Ultra 3 is suitable. You can tell that Apple is targeting them by offering massive 256GB or 512GB as an option and advertised about running the LLM in RAM.

Yes. I don’t imagine that Apple expects to sell many of those models. And their margins on them are undoubtedly…. generous. Important for Apple to have these machines in the market though.
First Mac Studio M3 Ultra benchmarks significantly outpace the M2 Ultra

programmer

March 9
john-useless said:

I've been a Mac user since the original 1984 model … but benchmarks in recent years confuse me, given the nature of multi-core machines these days, not to mention performance vs. efficiency cores, etc. Consider these two new Mac models:

The Mac Studio with the base M4 Max processor has a 14-core CPU with 10 performance cores and 4 efficiency cores, plus a 32-core GPU.
The Mac Studio with the base M3 Ultra processor has a 28-core CPU with 20 performance cores and 8 efficiency cores, plus a 60-core GPU.

I understand the idea that a lot of software basically gets its work done with the CPU and that only some software is written to get its work done with the GPU. I also understand the idea that each generation of processor does its work faster — thus, M4 processors will have higher single-core scores than comparable M3 processors.

But unless those M3 processors are far, far slower than M4 processors (which isn't the case — we're not talking M1 versus M4 here), wouldn't the model with the M3 Ultra outperform the model with the M4 Max every time because the M3 Ultra has twice as many cores? I thought, perhaps mistakenly, that macOS more or less hides the number of cores from software — that is, an app sends instructions to the CPU once, and macOS takes care of giving that work to all of the cores available to it on a given machine.

I have this image in my mind of horses pulling two wagon trains full of cargo (equal amounts in each train) across the plains. One wagon train has 14 horses, and they are younger and stronger. The other wagon train has 28 horses. They're a bit weaker and more tired … but even so, they're not that much weaker, and there are twice as many of them! Wouldn't the 28-horse team (the M3 Ultra) beat the 14-horse team (the M4 Max) every time? (I suppose it's not as simple as that.)

My use case: I do a lot of editing in Final Cut Pro, mostly HD but some 4K, and some of the projects are 30 minutes long. Is it worth it for me to buy a Mac Studio with M3 Ultra? Twice as many horses which aren't that much weaker…
Excellent questions.

The short answer is "its complicated".

A slightly longer answer includes some of the following factors:
- No software is 100% parallel. There are always some components which run serially (i.e. the first result is required before the second can be computed, and so on). Amdahl's Law (https://en.wikipedia.org/wiki/Amdahl's_law) basically says that parallel hardware can only speed up the parallel portion of a workload, so scaling to an infinite number of cores would only get you to the speed of the non-parallel portion.
- Parallel cores aren't entirely unrelated. They must communicate (which is often the serial portion of the algorithm), and that communication introduces some slow downs. Even if they aren't explicitly communicating, they are sharing resources (e.g. the connection to memory) and thus run into contention there which slows them down a little.
- Signals crossing between chips (even Apple's ultra function connector) tend to be slower than on-chip signals. This means that those communication overheads get a little worse when crossing from one Max to the other, and you can't always avoid that crossing (indeed, Apple's OS makes it mostly invisible to the software... but nobody would likely try to optimize for that anyhow).
- Horse analogy: one horse by itself doesn't contend with anything put pulling on its load and pushing on the ground. Two horses have to deal with the connection between them, jostling from the other, etc. 28 horses would have a whole lot of tugging and jostling, and who knows, maybe some of them don't like each other so there's kicking and biting happening too. The digital equivalent of that does happen.
- The bottleneck in a computation might not be how fast the instructions execute. It might be memory latency or bandwidth, I/O latency or bandwidth, or use of some special function hardware (encoders/decoders, neural units, etc).
- GPUs are very very parallel, but each parallel thread of work they can do is less general and less performant than in a full-fledge CPU. So they aren't great for all tasks, and the software running on them has to pretty much be written specifically for them.
- CPUs vary greatly, and the M-series chips have 2 kinds -- efficiency vs performance. The former are slower, and the OS needs to figure out where to run what. It doesn't always get that right, at least not right away.
- CPUs these days get a lot of their performance by executing multiple instructions at the same time from one sequence of instructions. A lot of those instructions have to execute in the right order, and that limits how many can be done at once. At an extremely detailed level this depends on the software being run. Some software is carefully crafted to run as many non-intertwined instructions in parallel as possible, and then having a CPU that has a very "wide" dispatch and SIMD instructions can go very fast. Most software is nowhere near that carefully crafted (sometimes its just not possible, sometimes its just not worth the effort, and sometimes there hasn't been the time or expertise available), so the in-core parallelism is only lightly utilized even though the CPUs are actively trying to re-order the instructions to go as fast as possible.
- The slowest thing in most modern machines is the memory (well, the I/O is slower, but inside the computer...). To deal with that, a hierarchy of memory caches are built into the chip. These are (relatively) small high speed memories that hold copies of data that has already been read from or written to the main memory. Since it is very common to re-access a given piece of data that has been accessed recently, keeping it in a high speed cache close to the processor can help a lot with performance. But its not magic, and there are always tradeoffs. Caches work on bunches of data, and they are divided into levels (usually called L1, L2, L3) of varying size and speed and different amounts of sharing between cores. This means they're not working with just what the program needs next, they're doing extra work here and there. The sharing between cores means they're competing for this resource. Plus a mix of software runs on the same cores (e.g. your workload, the UI, the file system, the networking, the browser you leave running in the background, etc), and it wants different data in the same caches. Optimizing for cache use is extremely challenging.
- ... and so on. And on. And on. It really is very complicated.
Apple says not every Apple Silicon generation will get an Ultra

programmer

March 6

ApplePoor said:

To cover their costs to create and get the operational M3 Ultra, they will need to make a lot of them.

I don’t think that’s true. I suspect the original M3 Max design had the ultra fusion connector, but they just masked it off to make the die a little smaller. Once they decided to start making ultras they stop cropping it, and voila! The amount of additional design work could be virtually nil. The markup on the ultras is pretty significant, so that reduces the number of sales needed to profit. And if they are using ultras in servers (which might be why we’ve not seen them until now), selling to users is just helping amortize dev costs.
Apple says not every Apple Silicon generation will get an Ultra

programmer

March 5

keithw said:

While it's nice that the M3 Ultra is now finally out, why did it take them over a year to release it? (The M3 line came out on October 30, 2023!) Why didn't they release the M4 Max Studio at the same time as the M4 Max MBP? If they had, I may have saved a few thousand $$$ since I got tired of waiting and bought the MBP. And is the single core performance of the M3 Ultra the same as the M4 Max? Enquiring minds want to know... But I guess with the 512MB memory capacity and the 80 graphics cores on top of the 32 CPU cores, the M3 Ultra should be killer LLM machine.

The thing people don’t seem to understand when whining about not getting this or that is that these products take a huge development effort to create. And then when they have been designed, there is a complicated balancing act about fab capacities and yields. This isn’t some blokes with a drill press pumping out aluminum parts from their garage. This is light years beyond that in terms of complexity. Apple (and the rest of the industry) are pulling off miracles, and forum trolls pour hate on them because doing that takes no brain cells and you don’t even have to get off the couch.

If I were to speculate wildly, I would suppose that the process node (N3B) used for the M3 series had some issues, and pretty much only Apple used it. So to address the issues, Apple moved faster on the M4 using the newer process (N3E), and eschewed the ultra connector to get that line out faster. This may have freed up M3-capable capacity, which they can now use for the M3 Ultra… and IIRC (and this is even more speculative) the first process did have some advantages over the later one (they removed features from N3E to make it work better), which may play better to what high end chips like the ultra need. So rather than spending the time to make the ultra connector work using N3E, they are probably focused on N3P, which is apparently what comes next.
M4 Mac mini review three months later: the perfect headless Mac

programmer

January 22

My M4 Pro mac mini w/ 48GB RAM and 1TB SSD (plus the external USB3 storage I already had for previous machines) scores a 6.0/5.0 ... it is way beyond what I expected in terms of performance. Its performance even with Rosetta or other forms of virtualization/emulation is astonishing. Compile speeds are mind blowing, and its compute capabilities are really amazing, often greatly outperforming big iron servers I use regularly. And the whole package was quite a bit cheaper than what I've spent on numerous previous computers over the decades.

Oh, and the power button is perfect. Can't hit it by mistake when fiddling with the ports on the back by feel, and I have pressed it exactly _once_ in over two months.