programmer

About

Username: programmer
Joined: December 2001
Visits: 53
Last Active: 9:12AM
Roles: member
Points: 454
Badges: 1
Posts: 3,503

Reactions

151Like2Dislike41Informative

First M3 benchmarks show big speed improvements over M2

programmer

November 2023

timmillea said:

5nM/3nM = 1.6 recurring, suggesting a move from the 5nM process to the 3nM process would yield a 67% improvement in speed/power ratio. We are not seeing that.

LOL... that's not how this works. Never has, never will. For starters, the process number represents the linear dimension of the smallest feature that the process can create. It does not apply to everything on the chip, plus it is a single dimension whereas chips are 2-dimensional. In theory that means that this shrink ought to allow 2.8x as many devices in the chip (i.e. the number of transistors that is often quoted). But chips are far more than just transistors, and indeed Apple's numbers mention "only" a 37% increase in transistor count (M2Max -> M3Max). And the number of transistors does not linearly relate to performance either -- the reality is far more complex and nuanced. Furthermore, performance is vastly more complex than just one number -- there are a mind blowing number of factors, and greatly depends on what software you need to run. A benchmark gives only a vague snapshot of a computer's capability, unless what you plan to use it for is running that specific benchmark algorithm (which is virtually never the case). Performance is a vast and complex topic, so thinking you can related it to the process number is simply naive.

As for waiting for a particular process tech, that doesn't make much sense. The continual steady onward march of process tech ended over a decade ago, and now transitions happen with more fits and starts. They are enormously expensive, and bring diminishing returns or additional problems. Predicting what is going happen next year is difficult enough, further projections are worthless at this point.

Your M1-based Mac ought to do you well for years. When it makes sense to upgrade should depend on when it stops doing what you need, or when Apple starts shipping a machine which has a new capability that you need. This has very little to do with the process technologies being used to create it.
M5 Pro may separate out GPU and CPU for new server-grade performance

programmer

December 2024

apple4thewin said:

Everyone currently trying to come out with a chip using SoC by the end of 2025 or 2026 but apple is already one step ahead. Although would this still share memory or will it go back to the dedicated ram for CPU and other for GPU?

This would almost certainly still be a unified memory architecture. The chiplets will be interconnected via some kind of high speed in-package network or bus, much like current chips use an on-die interconnect. This gives manufacturing flexibility and improve yields. AMD has been aggressively using such techniques for years, and it has really just been a matter of time until Apple jumped on it as well.
Generation gaps: How much faster Apple Silicon gets with each release

programmer

November 2024

chasm said:

netrox said:

Exactly why do we need to keep adding more CPU cores when most creative oriented applications would benefit from having more GPU cores?

Not that I’m the last word on this topic, but to put this VERY simply CPUs do math and GPUs take that math and manipulate pixels. Graphics are created through math, so more CPUs enable GPUs to do their job better.

More GPUs are needed when you have really really large screens/more screens. More CPUs are needed when you need more graphics.

Sorry, but that is wrong. GPUs excel at doing math at high memory bandwidths... but they basically need to be able to do the math in parallel, and the application has to be written specifically to use the GPU. CPUs are the default place for code to run, and are generally better at doing complex logic with lots of decisions, randomly chasing through memory for data, and doing less "orderly" computations. To leverage multiple CPUs, the application has to be written to do that and it isn't the default. Code typically starts its existence on a single CPU, then the programmer improves it to take advantage of multiple CPUs, then they might improve it further to either use the CPU SIMD or matrix hardware, or re-write critical pieces to run on the GPU. These days it is also quite common for application programmers to use libraries (often Apple's) which do things like leverage multiple cores, SIMD, matrix hardware, and GPUs. Creative oriented applications are often graphics or audio heavy, and those things can usually take advantage of all this potential hardware parallelism as long as they are optimized to do so (and the good ones are).

The question of CPUs vs GPUs on the SoC is a complex one. Many applications don't use the GPU at all, except for the UI (which hardly needs any GPU at all) but are optimized for multiple CPUs... adding more GPU for those applications gets you nothing. Even GPU-heavy applications can also benefit from more CPUs, in some cases. Ultimately though, the GPUs tend to be memory bandwidth limited, so scaling up the GPU beyond what the memory bandwidth can support gets us very little.
When will Apple upgrade all of its Macs to M4?

programmer

August 2024

9secondkox2 said:

M4 is old news now.

Forget about it.

What? It has only shipped in 1 product, only a few months ago, and they haven't introduced the pro/max/ultra flavors. You're delusional if you think they're going to M5 in the near future. Process migrations are a massive transition, and Apple hasn't even moved their entire lineup to the latest 3nm process.
Future Mac Pro may use Apple Silicon & PCI-E GPUs in parallel

programmer

February 2023

The original post and none of the subsequent posts mention what is almost certainly the biggest stumbling block to supporting non-Apple GPU hardware: drivers.

Drivers have always been the biggest issue with Apple GPU support, and it has always been a hot potato tossed back and forth between Apple's OS group and the 3rd party HW vendor (including Intel for the integrated GPUs). GPU drivers are terribly complex things, and Apple can't/doesn't use the drivers written by AMD/Intel/nVidia... and those vendors aren't likely to put much effort into writing drivers for macOS even if Apple were to start shipping their GPUs in Apple products. They never did before, the market is too small. So will Apple write drivers for any 3rd party devices? Their current direction suggests that the answer is a resounding "no", but that's not definitive and could change. They still have drivers that work on the Intel chip based Macs, and porting to Aarch64 may not be terribly difficult. Keeping up with the moving target that is the latest AMD GPUs though, is a lot of work. On top of supporting Apple's own GPU designs.

The Apple Silicon hardware is almost certainly hardware compatible with most GPUs from other vendors, thanks to PCI-e / Thunderbolt being standardized in its various flavours. So you can physical install any of the devices, but you need drivers to make it interoperate with macOS and macOS needs to continue to expose the functionality required to do that (which conceivably it may not on Apple Silicon since the macOS team may be taking advantage of detailed knowledge of the hardware).
Early M2 Max benchmarks may have just leaked online

programmer

December 2022

FileMakerFeller said:

bulk001 said:

So what’s it been? A year or two and Apple is already basically at the same place Intel is with Incremental updates spread out over a period of years unable to deliver on a predictable timetable. Yes there are some battery life advantages and the initial jump of the v1 chip but it is not very promising moving forward if this is accurate!

... while dealing with the massive disruption brought about by a global pandemic. I don't share your pessimism. Remember that Apple famously ships a great v1.0 and then iterates consistently; some people moan about the lack of regular "OMG" updates but over time the consistency of improvement leads to massive gains.

Also, tenthousandthings pointed out that in 2014 (a mere 8 years ago!) TSMC was using a 20nm process node. I mentally used a swear word when I read that. Astonishing progress to be shipping at 5nm and imminently 3nm in that timeframe. Well done to everyone at TSMC, that is spectacular!

A resetting of expectations is also required. The reality of semiconductor fabrication is that since running into the power wall back in the mid-00s, things haven't been scaling smoothly like they had since humanity started building integrated circuits. The time between nodes has increased, the risk of going to new nodes has increased, and the cost of going to new nodes has dramatically increased. The free lunch ended almost two decades ago, and since then the semiconductor industry has been clawing out improvements with higher effort and lower rewards. A lot of the improvements that have been gained has happened by doing things other than simply bumping CPU clock rate and cache sizes. GPU advancements were the first "post-CPU" wave, SoC happened in the mobile space first then moved into the laptop/desktop space, and more recently ML related hardware has become common. Most of these things require software changes to make any use of at all, nevermind actually optimizing for them. And that is why Apple needed to move to Apple Silicon -- not because they could build a better CPU than Intel, but because they need to build the chip Apple needs. There is far more in the Axx/Mx chips than just the CPUs and GPUs. How they are interconnected, how they share cache/memory resources, fixed function hardware units, what is the mix of devices, how the various accelerators are tuned for the workloads running on Apple systems, etc. Expect to see variants tuned for specific systems, perhaps variants at the packaging level... the same cores re-packaged and used in different configurations, etc.

Just the fact that TSMC has so many variations on the 5nm process node ought to be a clue about how hard getting to the next level has become. Intel being stuck for a long time at 10nm was a foreshadowing of the future. And at each stage, the designers are going to have to work harder and innovate more with each process advancement to wring as much value from it as possible... because the next one is going to be even more horrendously expensive and risky (and likely bring diminishing returns, plus "interesting" problems).
Generation gaps: How much faster Apple Silicon gets with each release

programmer

November 2024

dope_ahmine said:

CPUs and GPUs actually complement each other in AI. CPUs handle tasks with lots of decision-making or data management, while GPUs jump in to power through the raw computation. It’s not just about one or the other; the best results come from using both for what each does best.

As for energy efficiency, GPUs perform many tasks at a much lower power cost than CPUs, which is huge for AI developers who need high-speed processing without the power drain (or cost) that would come from only using CPUs.

And on top of all that, new architectures are even starting to blend CPU and GPU functions—like Apple’s M-series chips, which let both CPU and GPU access the same memory to cut down on data transfer times and save power. Plus, with all the popular libraries like PyTorch, CUDA, and TensorFlow, it’s easier than ever to optimize code to leverage GPUs, so more developers can get the speed and efficiency benefits without diving deep into complex GPU programming.

This is what the NPU is all about as well. It is, at its core, a matrix multiplication unit. Getting a GPU to multiply large matrices optimally, is a tricky piece of code... so having dedicated matrix multiplication hardware which is purpose-built for the task makes a lot of sense. If you're doing a lot of that. Prior to the heavy adoption of deep learning it was almost unheard of for consumer machines to do large matrix multiplication. That was usually the purview of high performance computing clusters. With the advent of LLMs and generative models, however, things have changed and it is definitely worth having this hardware sitting on the SoC with the CPUs and GPUs. Apple also appears to have added matrix hardware to their CPUs (in addition to conventional SIMD), so there are lots of options in an Apple Silicon SoC for where to do these matrix operations. The NPU is very likely the most power efficient at that (by far), and may also have the highest throughput. And if you're also doing graphics or other compute, now you don't have to worry about your GPU and CPUs being tied up with the ML calculations. And the SoC's unified memory architecture lets all these units share their data very very efficiently.
Apple Silicon Mac Pro could combine two M1 Ultra chips for speed

programmer

March 2022

My guess is that the Mac Pro will use the same M1 Ultra as the Mac Studio does. The difference will be in the system around the SoC. With a larger form factor, they have more cooling potential and could bump up the clock rates a little... but really, the M1 Ultra is a monster as it is (both in terms of size and performance). I would just take what Turnes said at face value, this is already the last of the M1 series. And I think we will see a Mac Pro that uses it.

So what could differentiate the Mac Pro? In a word: expandability.

1) PCIe slots. The M1 Ultra seems to have plenty of I/O potential, and a fast PCIe bridge chip would easily enable a lot of expansion potential.

2) Drive bays. The Mac Pro would have the same built-in super fast SSD, but in a large case a whole lot of additional storage can be accommodated.

3) RAM. This is where it gets tricky. The Apple Silicon approach is to use in-package memory, and there are real constraints on how much can be put into a single package. Some Pros just need more than can be fit into a single package, or more than is worth building in the TSMC production run. So conventional DIMMs are needed to supplement the super fast in-package memory. The question is, how does OSX use it? Apple seems to want to keep the programming model simple (i.e. CPU/GPU shared memory with a flat/uniform 64-bit virtual address space), so having some fast vs slow areas of memory doesn't seem like the direction they want to go in (although they could and just rely on the M1 Ultra's ENORMOUS caches). They are already doing virtual memory paging to flash, however... so why not do virtual memory paging to the DIMMs instead? Big DMA data transfers between in-package and on-DIMM memory across the very fast PCIe 5.0 lanes would ensure that the available bandwidth is used as efficiently as possible, and the latency is masked by the big (page-sized) transfers. A 128GB working memory (the in-package RAM) is huge, so doing VMM to get to the expanded pool is not as bad as you might think. Such a memory scheme may even just sit on PCIe cards so buyers only need to pay for the DIMM slots if they really need it. Such "RAM disk" cards have been around for ages, but are usually hampered by lack of direct OS support... and issue Apple could fix easily in their kernel.
Apple working out how to use Mac in parallel with iPhone or iPad to process big jobs

programmer

June 2024

A few years ago I suggested that Apple Silicon & Mac Pro could be combined by creating an M-series chip-on-a-PCIe-board which could be inserted into a Mac Pro's chassis. The problem with doing this is that it doesn't look (to software) like a traditional CPU/GPU/memory machine. That is precisely what this article is about though -- how to distribute heavy computations to the available hardware. The more computation Apple manages to offload from the local machine, the more it makes sense to have additional "headless" hardware available. This would make the Mac Pro chassis a lot more compelling than it is currently, and the same ASi-on-PCIe boards could be deployed into servers in the cloud.
First Mac Studio M3 Ultra benchmarks significantly outpace the M2 Ultra

programmer

March 9
john-useless said:

I've been a Mac user since the original 1984 model … but benchmarks in recent years confuse me, given the nature of multi-core machines these days, not to mention performance vs. efficiency cores, etc. Consider these two new Mac models:

The Mac Studio with the base M4 Max processor has a 14-core CPU with 10 performance cores and 4 efficiency cores, plus a 32-core GPU.
The Mac Studio with the base M3 Ultra processor has a 28-core CPU with 20 performance cores and 8 efficiency cores, plus a 60-core GPU.

I understand the idea that a lot of software basically gets its work done with the CPU and that only some software is written to get its work done with the GPU. I also understand the idea that each generation of processor does its work faster — thus, M4 processors will have higher single-core scores than comparable M3 processors.

But unless those M3 processors are far, far slower than M4 processors (which isn't the case — we're not talking M1 versus M4 here), wouldn't the model with the M3 Ultra outperform the model with the M4 Max every time because the M3 Ultra has twice as many cores? I thought, perhaps mistakenly, that macOS more or less hides the number of cores from software — that is, an app sends instructions to the CPU once, and macOS takes care of giving that work to all of the cores available to it on a given machine.

I have this image in my mind of horses pulling two wagon trains full of cargo (equal amounts in each train) across the plains. One wagon train has 14 horses, and they are younger and stronger. The other wagon train has 28 horses. They're a bit weaker and more tired … but even so, they're not that much weaker, and there are twice as many of them! Wouldn't the 28-horse team (the M3 Ultra) beat the 14-horse team (the M4 Max) every time? (I suppose it's not as simple as that.)

My use case: I do a lot of editing in Final Cut Pro, mostly HD but some 4K, and some of the projects are 30 minutes long. Is it worth it for me to buy a Mac Studio with M3 Ultra? Twice as many horses which aren't that much weaker…
Excellent questions.

The short answer is "its complicated".

A slightly longer answer includes some of the following factors:
- No software is 100% parallel. There are always some components which run serially (i.e. the first result is required before the second can be computed, and so on). Amdahl's Law (https://en.wikipedia.org/wiki/Amdahl's_law) basically says that parallel hardware can only speed up the parallel portion of a workload, so scaling to an infinite number of cores would only get you to the speed of the non-parallel portion.
- Parallel cores aren't entirely unrelated. They must communicate (which is often the serial portion of the algorithm), and that communication introduces some slow downs. Even if they aren't explicitly communicating, they are sharing resources (e.g. the connection to memory) and thus run into contention there which slows them down a little.
- Signals crossing between chips (even Apple's ultra function connector) tend to be slower than on-chip signals. This means that those communication overheads get a little worse when crossing from one Max to the other, and you can't always avoid that crossing (indeed, Apple's OS makes it mostly invisible to the software... but nobody would likely try to optimize for that anyhow).
- Horse analogy: one horse by itself doesn't contend with anything put pulling on its load and pushing on the ground. Two horses have to deal with the connection between them, jostling from the other, etc. 28 horses would have a whole lot of tugging and jostling, and who knows, maybe some of them don't like each other so there's kicking and biting happening too. The digital equivalent of that does happen.
- The bottleneck in a computation might not be how fast the instructions execute. It might be memory latency or bandwidth, I/O latency or bandwidth, or use of some special function hardware (encoders/decoders, neural units, etc).
- GPUs are very very parallel, but each parallel thread of work they can do is less general and less performant than in a full-fledge CPU. So they aren't great for all tasks, and the software running on them has to pretty much be written specifically for them.
- CPUs vary greatly, and the M-series chips have 2 kinds -- efficiency vs performance. The former are slower, and the OS needs to figure out where to run what. It doesn't always get that right, at least not right away.
- CPUs these days get a lot of their performance by executing multiple instructions at the same time from one sequence of instructions. A lot of those instructions have to execute in the right order, and that limits how many can be done at once. At an extremely detailed level this depends on the software being run. Some software is carefully crafted to run as many non-intertwined instructions in parallel as possible, and then having a CPU that has a very "wide" dispatch and SIMD instructions can go very fast. Most software is nowhere near that carefully crafted (sometimes its just not possible, sometimes its just not worth the effort, and sometimes there hasn't been the time or expertise available), so the in-core parallelism is only lightly utilized even though the CPUs are actively trying to re-order the instructions to go as fast as possible.
- The slowest thing in most modern machines is the memory (well, the I/O is slower, but inside the computer...). To deal with that, a hierarchy of memory caches are built into the chip. These are (relatively) small high speed memories that hold copies of data that has already been read from or written to the main memory. Since it is very common to re-access a given piece of data that has been accessed recently, keeping it in a high speed cache close to the processor can help a lot with performance. But its not magic, and there are always tradeoffs. Caches work on bunches of data, and they are divided into levels (usually called L1, L2, L3) of varying size and speed and different amounts of sharing between cores. This means they're not working with just what the program needs next, they're doing extra work here and there. The sharing between cores means they're competing for this resource. Plus a mix of software runs on the same cores (e.g. your workload, the UI, the file system, the networking, the browser you leave running in the background, etc), and it wants different data in the same caches. Optimizing for cache use is extremely challenging.
- ... and so on. And on. And on. It really is very complicated.