JustSomeGuy1

About

Banned
Username
JustSomeGuy1
Joined
Visits
60
Last Active
Roles
member
Points
1,172
Badges
1
Posts
330
  • Johny Srouji says Apple's hardware ambitions are limited only by physics

    docno42 said:
    GG1 said:
    On some other AI thread, a poster mentioned how expensive these M1 Pro/Max SoCs were to make, especially with the variable RAM amounts. So I'm wondering if the forthcoming MacPro may just use two or more M1 Max's on a single motherboard, similar to very high end dual Xeon boards.
    Better hope they never go dual socket. The compromises you have to make to have two sockets are numerous and what held back the cheese grater and trash can Mac Pro's.  Heck there are performance penalties for "chiplet" packaging like AMD is using with Ryzen and Threadripper.  I get why they are doing it - it's WAY more cost effective for a bunch of reasons.  But if Apple can stomach the production for massive SOC - if you care about performance that's the real ticket and what I hope they stay focused on.  
    mjtomlin said:
    I doubt they'll double up on SoC's. More than likely they'll go with much bigger SoCs variants, "M# Ultra" or "M# Extreme". They would be so far ahead in performance that they'll only need to update them half as often as the rest of the M1 family (basically the same as they did with the "X" variant in the A-series).
    Doubling up SoCs simply won't work (at least, well enough to be worth doing, and assuming you're talking about the existing SoCs). There are all sorts of "gated by physics" issues there. But a single large SoC also seems impossible (at least if they stick with their current memory architecture). Whatever they do, it's going to be a lot more interesting. I wrote a bit more about this in comments here: https://forums.appleinsider.com/discussion/224623/compared-new-14-inch-macbook-pro-versus-13-inch-m1-macbook-pro-versus-intel-13-inch-macbo/p2? ; - see comments 31, 32, 35, 36.

    This is a monumental engineering challenge! If it weren't, we'd already be seeing the new Mac Pros. I fully expect to see a new Pro next year, and that it will be amazing. Don't imagine for a second that these things are easy.
    patchythepiratecg27watto_cobra
  • Johny Srouji says Apple's hardware ambitions are limited only by physics

    602warren said:
    KTR said:
    Will Google be selling the chips, or are they going to try to blend Google os and chip?
    My gut tells me they will behave like Samsung with phone screens and make incredible proprietary products for themselves, and sell off the ‘lesser’ versions to others for use in their devices. But it’s an interesting question and can’t wait to see what they do in the future. As long as someone is keeping up with, or trying to keep up with, A or M series chips, that will push innovation across the board in consumer devices.
    I dislike Samsung but that's not accurate. The OLED screens they sold Apple for use in iPhones have been better than the screens they've used on their own flagship phones. It's just a matter of who was willing to pay more and work more.
    rezwitswatto_cobra
  • Compared: 14-inch MacBook Pro vs. 13-inch M1 MacBook Pro vs. Intel 13-inch MacBook Pro

    crowley said:
    Marvin said:
    crowley said:
    Does the integration of the memory and GPU on the M-series SoCs not create issues for multiple CPU architectures?  Seems like it might (I claim no expertise here, just guessing).
    [...]
    They can scale the processing units separately from the memory. This would allow them to sell higher-end units with lower amounts of RAM.

    I expect the 27" iMacs to offer M1 Pro, Max and Max Duo chips starting at 16GB RAM and going up to 128GB RAM. Mac Pro if there is one would be Max Duo and Quad, likely starting at 32GB and going up to 256GB - this config could easily go in an iMac too.
    No, Crowley's right. There's a BIG issue here, and how they solve it is going to fascinate (and possibly terrify!) a lot of people.

    The problem is that it's not easy to build massively parallel CPUs. One reason is the need for more memory bandwidth. Look at GPUs to see how this can be dealt with. Memory bandwidth is a huge factor in performance, which is why high-end ones all use GDDR5 (or 6) and they all have wide busses (like, 384 bits wide!)... except for the ones using HBM RAM, which is just that trend taken to an even further extreme (wider buses, though somewhat slower gbps/pin). This is all very expensive, and so CPUs (with sometimes an order or two of magnitude more RAM) have been slower getting there - the largest have only 8 channels of DDR4 RAM.

    The other big challenge is just getting all the cores to be able to talk to each other and to the RAM behind memory controllers that are part of a remote core's cluster/complex. Dealing with cache coherency (or deciding not to), how many layers of cache, etc., is all part of the deepest wizardry. And whether you build a giant mesh or several rings, or do something else, power is a huge issue. It's been estimated that the uncore in the biggest EPYCs can consume ~50% of the entire power budget of the chip.

    Building a Mac Pro with 4x M1 Maxes would be *very* tricky. In fact, I'll say right out that you can't just do that. You could start with the M1 Max and change it in various ways to get to where you need to go, but each choice you make comes with constraints and trade-offs.

    For example, say you decided to get as close as you can to just putting four M1 Maxes in a box. You already face an enormous challenge, which is to minimize latency between far cores (that is, core on one max chip trying to talk to another, or, more to the point, talking to the RAM managed by that other core). There's also the issue of what you do about cache coherency. Do you build a giant LLC that sits in the middle which all the SLCs talk to? Does all memory access go through that? (Hint: almost certainly not.) This would look somewhat like a hypertrophied Threadripper or Epyc. The chances that they could make this work are pretty poor, unless they bring in some VERY new technology - which is entirely possible. In particular, if they go all-in on TSMC "3DFabric" tech (InFO, CoWoS, etc.) and possibly work hard on some integrated cooling tech, then maybe this could be possible, mostly due to the ridiculously low power consumption they're hitting.

    But this leaves out a very important question. What do you do about that RAM, actually? How do you even get room for enough traces to simply talk to all that RAM? And if you want anything even remotely like the memory capacity the current Intel chips have, how are you going to accomplish that? It may be that only HBM can even get you close to that, and that is *extremely* expensive. Like, prohibitively so except for high-end Pro buyers.

    Fundamentally, the biggest issue is that the integrated close RAM that's such a big part of their performance magic is just not scalable. There's physically not enough room for it or its traces, since you need a lot of that room for inter-CPU links (which would push the RAM to... where?). You can't just fill up a larger diameter as speed-of-light issues will start to affect latency.

    But they have a plan. They *are* going to solve this. And whether that involves 3D stacking of some sort, order-of-magnitude larger caches, HBM... we don't know yet. It might involve radical cooling solutions. And there's always the possibility of them doing something really new, which is of course the most exciting possibility of all.

    Really we don't even know if they're going to maintain the unified memory. It seems at least plausible that they won't given the call from some quarters for multiple GPUs. I wouldn't bet on it though.

    So... in a few months to a year, we'll learn what the answer is. Don't think for a second they won't have a good answer. But it's going to be a surprise, whatever it is, and you're going to see idiots all over the net going "that'll never work, it's lies!" until the benchmarks come out. And whatever it is, it will be *damn* impressive. And 99% of the people using them will never understand that. Oh well. :-)
    Thanks, I figured that something like would be a problem, it makes sense that multiple systems on a chip on the same computer would present challenges.  The integration of the SoC means that parallelisation would be more akin to Xgrid than Grand Central Dispatch, so they'd need some very beefy controller chips to manage everything without latency.
    It's not even that simple. "Beefy controllers" won't do the trick here, as you can't just add transistors and expect to cut latency.

    You could indeed build a system with multiple M1 Maxes. But performance in any task that spanned multiple M1s would be terrible, because interchip latency would be awful- they have no bus suitable for the task. It would be as you said, much like four separate systems linked with Ethernet or Infiniband. It would fail miserably compared to large chips like EPYCs or Xeons. No, Apple will be doing something else.

    At the very least, the chips (or chiplets) they use will have a high-bandwidth low-latency link just to talk to other chip(let)s. But that's just a start. There's some very serious engineering going on! Expect to be surprised.
    watto_cobra
  • Compared: 14-inch MacBook Pro vs. 13-inch M1 MacBook Pro vs. Intel 13-inch MacBook Pro

    Marvin said:
    What do you do about that RAM, actually? How do you even get room for enough traces to simply talk to all that RAM? And if you want anything even remotely like the memory capacity the current Intel chips have, how are you going to accomplish that? It may be that only HBM can even get you close to that, and that is *extremely* expensive. Like, prohibitively so except for high-end Pro buyers.

    Fundamentally, the biggest issue is that the integrated close RAM that's such a big part of their performance magic is just not scalable. There's physically not enough room for it or its traces, since you need a lot of that room for inter-CPU links (which would push the RAM to... where?). You can't just fill up a larger diameter as speed-of-light issues will start to affect latency.

    But they have a plan. They *are* going to solve this. And whether that involves 3D stacking of some sort, order-of-magnitude larger caches, HBM... we don't know yet. It might involve radical cooling solutions. And there's always the possibility of them doing something really new, which is of course the most exciting possibility of all.

    Really we don't even know if they're going to maintain the unified memory. It seems at least plausible that they won't given the call from some quarters for multiple GPUs. I wouldn't bet on it though.
    HBM would be expensive for iMac models but the price is ok for Mac Pro level. Here it estimates 16GB HMB2 at $320:

    https://www.fudzilla.com/news/graphics/48019-radeon-vii-16gb-hbm-2-memory-cost-around-320

    $1280 for 64GB, $2560 for 128GB. That's not a lot of money for that much memory. An upcoming Radeon GPU is reported to offer up to 128GB HBM2E:

    https://www.tomshardware.com/news/amd-aldebaran-memory-subsystem-detailed

    Intel will use HBM too:

    https://www.tweaktown.com/news/80272/intel-confirms-sapphire-rapids-cpus-will-use-hbm-drops-in-late-2022/index.html

    It says here there will be a successor to HBM in late 2022:

    https://www.pcgamer.com/an-ultra-bandwidth-successor-to-hbm2e-memory-is-coming-but-not-until-2022/

    I don't think the links between chips are as important. Supercomputers are made up of separate machines. The separate GPUs in the current Mac Pro are connected by Infinity Fabric at 84GB/s. M1 Max has 400GB/s memory bandwidth.

    A lot of tasks that work in parallel can just be moved to the other chips for example processing separate frames in video software or separate render buckets in 3D software.

    But as you say, we can assume they've planned it out. They employ experts in their field so they'll have a solution to scale things up. If it's good enough for Intel to do this in server chips and AMD to do it in their GPUs, it should be fine for Apple to do it too:

    https://www.nextplatform.com/2021/08/19/intel-finally-gets-chiplet-religion-with-server-chips/
    [graphics removed]
    You're missing several key points.

    SK Hynix announced their HBM3 for shipment next year just a couple days ago. Max stack height is 12, total stack capacity is 24GB. That requires *1024* pins for data. *Per stack*. You just can't get enough capacity using HBM (as I pointed out right after the post you quoted). So HBM might be a component of their solution, but it can't be the entire answer, unless they're willing to completely give up on large-memory configurations. That seems unlikely, but... no way to know for now.

    Your larger error is in thinking interchip links aren't so important. They are fundamental to whatever solution Apple comes up with. The key to Apple's "secret sauce" right now is their unified memory. But if you think that the magic comes just from the CPU and GPU having equal access to the same memory pool, you're missing half the story. The other half is that memory bandwidth is huge. Oh, and there's a third "half" (heh), which is the astounding cache architecture (with very low latency, a big part of the overall picture). That sets them apart from everyone else.

    Interchip links are key not just because of bandwidth issues, but also latency. Your examples mix a bunch of different technologies that are appropriate for different applications, but not for linking cores in a large system. They also have dramatically different requirements than pure memory busses or supercomputer links (ethernet or infiniband most often). I glossed over a bunch of that briefly when I mentioned cache coherency issues.

    If you want to get a bit of a sense of why that all matters, Andrei over at Anandtech built a very cool benchmark that shows a grid with cross-core latency figures for large CPUs, which he uses when reviewing chips like EPYCs and Xeons. (I think he's used it in the Apple Ax reviews too.) You can see the impact the various technologies have.

    As for chiplets - yes, that's the obvious path forward generally for everyone. But it's not at all obvious how you combine that idea with unified RAM. This is what I was talking about above - if you build a chiplet setup, you're comitted to either some sort of NUMA setup, or to a central memory controller. (That is, the architecture of the first EPYCs, versus the architecture of the most recent generation.) Both have benefits and drawbacks, and both are problematic if you're trying to do what Apple has done with their memory architecture. You run into a variety of issues with physical layout and density. Among other things, this produces huge heat engineering challenges, and grave dificulties physically routing enough traces between CPUs and RAM.

    This is not something you can handwave away by pointing at other implementations. Apple is going to have to do something different. And when they do, it's going to be very exciting. Don't preemptively diminish it by likening it to existing chips. It won't be, unless they give up on close unified memory.
    williamlondonwatto_cobra
  • Compared: 14-inch MacBook Pro vs. 13-inch M1 MacBook Pro vs. Intel 13-inch MacBook Pro

    Marvin said:
    crowley said:
    Does the integration of the memory and GPU on the M-series SoCs not create issues for multiple CPU architectures?  Seems like it might (I claim no expertise here, just guessing).
    [...]
    They can scale the processing units separately from the memory. This would allow them to sell higher-end units with lower amounts of RAM.

    I expect the 27" iMacs to offer M1 Pro, Max and Max Duo chips starting at 16GB RAM and going up to 128GB RAM. Mac Pro if there is one would be Max Duo and Quad, likely starting at 32GB and going up to 256GB - this config could easily go in an iMac too.
    No, Crowley's right. There's a BIG issue here, and how they solve it is going to fascinate (and possibly terrify!) a lot of people.

    The problem is that it's not easy to build massively parallel CPUs. One reason is the need for more memory bandwidth. Look at GPUs to see how this can be dealt with. Memory bandwidth is a huge factor in performance, which is why high-end ones all use GDDR5 (or 6) and they all have wide busses (like, 384 bits wide!)... except for the ones using HBM RAM, which is just that trend taken to an even further extreme (wider buses, though somewhat slower gbps/pin). This is all very expensive, and so CPUs (with sometimes an order or two of magnitude more RAM) have been slower getting there - the largest have only 8 channels of DDR4 RAM.

    The other big challenge is just getting all the cores to be able to talk to each other and to the RAM behind memory controllers that are part of a remote core's cluster/complex. Dealing with cache coherency (or deciding not to), how many layers of cache, etc., is all part of the deepest wizardry. And whether you build a giant mesh or several rings, or do something else, power is a huge issue. It's been estimated that the uncore in the biggest EPYCs can consume ~50% of the entire power budget of the chip.

    Building a Mac Pro with 4x M1 Maxes would be *very* tricky. In fact, I'll say right out that you can't just do that. You could start with the M1 Max and change it in various ways to get to where you need to go, but each choice you make comes with constraints and trade-offs.

    For example, say you decided to get as close as you can to just putting four M1 Maxes in a box. You already face an enormous challenge, which is to minimize latency between far cores (that is, core on one max chip trying to talk to another, or, more to the point, talking to the RAM managed by that other core). There's also the issue of what you do about cache coherency. Do you build a giant LLC that sits in the middle which all the SLCs talk to? Does all memory access go through that? (Hint: almost certainly not.) This would look somewhat like a hypertrophied Threadripper or Epyc. The chances that they could make this work are pretty poor, unless they bring in some VERY new technology - which is entirely possible. In particular, if they go all-in on TSMC "3DFabric" tech (InFO, CoWoS, etc.) and possibly work hard on some integrated cooling tech, then maybe this could be possible, mostly due to the ridiculously low power consumption they're hitting.

    But this leaves out a very important question. What do you do about that RAM, actually? How do you even get room for enough traces to simply talk to all that RAM? And if you want anything even remotely like the memory capacity the current Intel chips have, how are you going to accomplish that? It may be that only HBM can even get you close to that, and that is *extremely* expensive. Like, prohibitively so except for high-end Pro buyers.

    Fundamentally, the biggest issue is that the integrated close RAM that's such a big part of their performance magic is just not scalable. There's physically not enough room for it or its traces, since you need a lot of that room for inter-CPU links (which would push the RAM to... where?). You can't just fill up a larger diameter as speed-of-light issues will start to affect latency.

    But they have a plan. They *are* going to solve this. And whether that involves 3D stacking of some sort, order-of-magnitude larger caches, HBM... we don't know yet. It might involve radical cooling solutions. And there's always the possibility of them doing something really new, which is of course the most exciting possibility of all.

    Really we don't even know if they're going to maintain the unified memory. It seems at least plausible that they won't given the call from some quarters for multiple GPUs. I wouldn't bet on it though.

    So... in a few months to a year, we'll learn what the answer is. Don't think for a second they won't have a good answer. But it's going to be a surprise, whatever it is, and you're going to see idiots all over the net going "that'll never work, it's lies!" until the benchmarks come out. And whatever it is, it will be *damn* impressive. And 99% of the people using them will never understand that. Oh well. :-)
    crowleywatto_cobra