More power with less: Apple's A13 Bionic is faster and more power efficient

knowitall · September 13, 2019 9:49AM

JustSomeGuy1 said:

melgross said:

JustSomeGuy1 said:

knowitall said:

Really impressive especially the power savings and switching of active areas, Kudos to the design team.
The new Ad13 iMac will be a blast.
But Teslas FSD chip is currently 12 times faster (73 TOPS vs A13 6 TOPS?) which makes clear Tesla has a design team (only a few people I understood) that can easily match Apples.
So I expect much room for improvement for the A14 and its desktop version the Ad14 next year.
Exciting times, it must be difficult working at Intel now.

Edit: note the TOPS (not TFLOPS)

Tesla's chip is much faster than Apple's at running, say, resnet 50. But it's MUCH slower than Apple's at running any normal app.

You're failing to distinguish between different types of TOPS. Tesla's achievement is notable but not nearly in the same category as Apple's. In principle, it's not all that hard to add more TOPS, if you're talking about tensor/matrix ops, vectors, NNI, GPU, etc. Those ops (and OPS) are all easily parallelized. Further, you fail to recognize the hard limits placed by the power and cooling budgets for each chip. Tesla is entirely focused on video image recognition, so they need massive NN processing. They have the power budget of a car - not unlimited, by a long shot, but still... the battery in a Tesla is a little bigger than the battery in an iPhone! Whereas Apple is building a much more general-purpose chip, and specifically one with extraordinary traditional integer (and FP) OPS. That's a MUCH MUCH harder problem to solve, as it's extremely difficult to extract parallelism from conventional software (that is, pretty much every app that isn't doing AR or a few other very specific things).

So far, Apple has in the last couple of years kicked *everyone's* ass at that, inside their domain (low-power chips). Nobody even comes close. And if you look carefully at what they've done, you can build a fairly convincing case that they've already built every part necessary to beat Intel at their own game (fast high-power multicore chips), they just don't want to sell those yet.

The biggest open question is this: Can Apple build a ring/mesh/whatever connecting 8-16 high-performance cores in a reasonable power budget? As I've written previously about the A12X, they've *already done this*. So they can, right now today, build something competitive with Intel's best mainstream desktop CPU (the 8-core i9-9900). Whether or not they can actually bet it will depend on whether or not they can clock up. And we know more about that than we did a few months ago, as we can see AMD pushing the same process to around 4.3 GHz tops, about 4.1 comfortably. We still don't know if the A13's pipeline is long enough to sustain this sort of speed, or how easy it would be for Apple to change it enough, but the performance crown there seems easily within their grasp.

Going to more cores is the biggest question mark. The ring or mesh good enough to handle 8 cores really well may not be enough to handle 12 or 16 cores. But the only machines where that would matter is the iMac Pro and the desktop Mac Pro. And I don't think anyone expects those to transition to ARM as early as the laptops.

Apple has already gone to 4 cores in the A12x, 5 cores when counting the efficiency cores together. I don’t see why they can’t remove some unneeded sections from the chips that duplicate functions that don’t need to be duplicated, and run two of these, I suppose now, A13x chips together. Apple has the ability to do it however they think best, as they control the IP.

That's not how it works. You don't see why, because you don't design chips for a living. The story is both better and worse than you think.

About core count: Since the efficiency and fast cores can (and do!) all run simultaneously, Apple has with the A12X demonstrated that whatever they're using to connect all those cores (almost certainly a ring bus, but just possibly some sort of more complex mesh) is capable of handling not just 4 or 5, but 8 cores. "Efficiency" or not, the bus has to handle the same kind of work- all cores have to have cache coherency, equal access to main RAM, etc. Doing this at low enough power, with that many cores, is the big trick that will be key to winning on the desktop - and they've managed to do it well enough to work in an iPad. That's very impressive. We don't know if that architecture will extend to more cores than that, and that's an open question. It may be completely inappropriate for more than 8 cores. But still, 8 cores will get you a VERY long way today.

Now, you talked about trimming "unneeded sections" from the chips. That's not likely to help very much. Most of those "unneeded sections" probably do not participate on the bus/mesh on an equal basis with the scalar cores, because why would they bother? However, nobody knows for sure, because Apple doesn't tell, and Andrei over at AT (the only person I know of who's gained deep insight into the chips and published about them) only has so much time to go poking at the innards with clever software and more clever analysis. So it's vaguely possible that they already have a big-time mesh architecture or multi-ring-bus (like newer and older XCC Xeons, respectively) already, which would be amazing. But it's very unlikely, as the power draw would be incredibly difficult to deal with.

Lastly, it's not at all simple to "run two chips together". If you're thinking about 2S systems like typical Xeons... then you need significant logic to get them to play nicely together and with RAM and the rest of the system. And you'd need to do really major surgery on the A12/13/whatever. On the other hand, if you're thinking about a chiplet setup like AMD's... then it's the same deal with slightly different details. In both cases, you don't know that the secret sauce Apple's using will carry over well. For example, one of the biggest factors in the massive speedups seen in the A12 is apparently the cache architecture and the large L3. If you took that out and stuck it in a central chiplet (like AMD ZEN) you'd probably take a massive perf hit. This is all moderately wild speculation, but the point is, it's not a slam dunk. I personally believe that if the decide to do it, they will embarrass the crap out of everybody else. But I'm skeptical that they'll bother any time soon.

tl;dr: As I've said before, they *already* have shown the ability to go neck-and-neck with Intel's top mainstream chips. If they wanted to fight over the HEDT and Xeons, they could probably do a great job, but it seems unlikely that they'll bother in the near future. They've already got everything they need for every laptop segment, excepting only people who need x64 Windows (or Linux) virtualization at native speed.

Very good info! Thanks.
Busses connecting cpus have to be fast and will be in constant use and so cost a lot of energy.
Mesh networks can be a lot more energy efficient, but create other problems (elaborate chip layout).
Letting nodes run independently, by for example copying (or maybe ‘mapping’ virtual) read memory to local processor memory solves problems like cash coherency and main RAM access.
This type of parallelism is something that can be achieved by using GCD.

melgross · September 14, 2019 12:51AM

JustSomeGuy1 said:

melgross said:

JustSomeGuy1 said:

knowitall said:

Really impressive especially the power savings and switching of active areas, Kudos to the design team.
The new Ad13 iMac will be a blast.
But Teslas FSD chip is currently 12 times faster (73 TOPS vs A13 6 TOPS?) which makes clear Tesla has a design team (only a few people I understood) that can easily match Apples.
So I expect much room for improvement for the A14 and its desktop version the Ad14 next year.
Exciting times, it must be difficult working at Intel now.

Edit: note the TOPS (not TFLOPS)

Tesla's chip is much faster than Apple's at running, say, resnet 50. But it's MUCH slower than Apple's at running any normal app.

You're failing to distinguish between different types of TOPS. Tesla's achievement is notable but not nearly in the same category as Apple's. In principle, it's not all that hard to add more TOPS, if you're talking about tensor/matrix ops, vectors, NNI, GPU, etc. Those ops (and OPS) are all easily parallelized. Further, you fail to recognize the hard limits placed by the power and cooling budgets for each chip. Tesla is entirely focused on video image recognition, so they need massive NN processing. They have the power budget of a car - not unlimited, by a long shot, but still... the battery in a Tesla is a little bigger than the battery in an iPhone! Whereas Apple is building a much more general-purpose chip, and specifically one with extraordinary traditional integer (and FP) OPS. That's a MUCH MUCH harder problem to solve, as it's extremely difficult to extract parallelism from conventional software (that is, pretty much every app that isn't doing AR or a few other very specific things).

So far, Apple has in the last couple of years kicked *everyone's* ass at that, inside their domain (low-power chips). Nobody even comes close. And if you look carefully at what they've done, you can build a fairly convincing case that they've already built every part necessary to beat Intel at their own game (fast high-power multicore chips), they just don't want to sell those yet.

The biggest open question is this: Can Apple build a ring/mesh/whatever connecting 8-16 high-performance cores in a reasonable power budget? As I've written previously about the A12X, they've *already done this*. So they can, right now today, build something competitive with Intel's best mainstream desktop CPU (the 8-core i9-9900). Whether or not they can actually bet it will depend on whether or not they can clock up. And we know more about that than we did a few months ago, as we can see AMD pushing the same process to around 4.3 GHz tops, about 4.1 comfortably. We still don't know if the A13's pipeline is long enough to sustain this sort of speed, or how easy it would be for Apple to change it enough, but the performance crown there seems easily within their grasp.

Going to more cores is the biggest question mark. The ring or mesh good enough to handle 8 cores really well may not be enough to handle 12 or 16 cores. But the only machines where that would matter is the iMac Pro and the desktop Mac Pro. And I don't think anyone expects those to transition to ARM as early as the laptops.

Apple has already gone to 4 cores in the A12x, 5 cores when counting the efficiency cores together. I don’t see why they can’t remove some unneeded sections from the chips that duplicate functions that don’t need to be duplicated, and run two of these, I suppose now, A13x chips together. Apple has the ability to do it however they think best, as they control the IP.

That's not how it works. You don't see why, because you don't design chips for a living. The story is both better and worse than you think.

About core count: Since the efficiency and fast cores can (and do!) all run simultaneously, Apple has with the A12X demonstrated that whatever they're using to connect all those cores (almost certainly a ring bus, but just possibly some sort of more complex mesh) is capable of handling not just 4 or 5, but 8 cores. "Efficiency" or not, the bus has to handle the same kind of work- all cores have to have cache coherency, equal access to main RAM, etc. Doing this at low enough power, with that many cores, is the big trick that will be key to winning on the desktop - and they've managed to do it well enough to work in an iPad. That's very impressive. We don't know if that architecture will extend to more cores than that, and that's an open question. It may be completely inappropriate for more than 8 cores. But still, 8 cores will get you a VERY long way today.

Now, you talked about trimming "unneeded sections" from the chips. That's not likely to help very much. Most of those "unneeded sections" probably do not participate on the bus/mesh on an equal basis with the scalar cores, because why would they bother? However, nobody knows for sure, because Apple doesn't tell, and Andrei over at AT (the only person I know of who's gained deep insight into the chips and published about them) only has so much time to go poking at the innards with clever software and more clever analysis. So it's vaguely possible that they already have a big-time mesh architecture or multi-ring-bus (like newer and older XCC Xeons, respectively) already, which would be amazing. But it's very unlikely, as the power draw would be incredibly difficult to deal with.

Lastly, it's not at all simple to "run two chips together". If you're thinking about 2S systems like typical Xeons... then you need significant logic to get them to play nicely together and with RAM and the rest of the system. And you'd need to do really major surgery on the A12/13/whatever. On the other hand, if you're thinking about a chiplet setup like AMD's... then it's the same deal with slightly different details. In both cases, you don't know that the secret sauce Apple's using will carry over well. For example, one of the biggest factors in the massive speedups seen in the A12 is apparently the cache architecture and the large L3. If you took that out and stuck it in a central chiplet (like AMD ZEN) you'd probably take a massive perf hit. This is all moderately wild speculation, but the point is, it's not a slam dunk. I personally believe that if the decide to do it, they will embarrass the crap out of everybody else. But I'm skeptical that they'll bother any time soon.

tl;dr: As I've said before, they *already* have shown the ability to go neck-and-neck with Intel's top mainstream chips. If they wanted to fight over the HEDT and Xeons, they could probably do a great job, but it seems unlikely that they'll bother in the near future. They've already got everything they need for every laptop segment, excepting only people who need x64 Windows (or Linux) virtualization at native speed.

So, I assume that you do design chips for a living? Gee, you’re right, I don’t understand ANYTHING about any of this. And I guess you’re also right in essentially saying that Apple’s chip designers don’t know anything about this either.

you’re so smart, it’s killing me.

edited September 2019

justsomeguy1 · September 14, 2019 4:20AM

melgross said:

JustSomeGuy1 said:

melgross said:

JustSomeGuy1 said:

knowitall said:

Really impressive especially the power savings and switching of active areas, Kudos to the design team.
The new Ad13 iMac will be a blast.
But Teslas FSD chip is currently 12 times faster (73 TOPS vs A13 6 TOPS?) which makes clear Tesla has a design team (only a few people I understood) that can easily match Apples.
So I expect much room for improvement for the A14 and its desktop version the Ad14 next year.
Exciting times, it must be difficult working at Intel now.

Edit: note the TOPS (not TFLOPS)

Tesla's chip is much faster than Apple's at running, say, resnet 50. But it's MUCH slower than Apple's at running any normal app.

You're failing to distinguish between different types of TOPS. Tesla's achievement is notable but not nearly in the same category as Apple's. In principle, it's not all that hard to add more TOPS, if you're talking about tensor/matrix ops, vectors, NNI, GPU, etc. Those ops (and OPS) are all easily parallelized. Further, you fail to recognize the hard limits placed by the power and cooling budgets for each chip. Tesla is entirely focused on video image recognition, so they need massive NN processing. They have the power budget of a car - not unlimited, by a long shot, but still... the battery in a Tesla is a little bigger than the battery in an iPhone! Whereas Apple is building a much more general-purpose chip, and specifically one with extraordinary traditional integer (and FP) OPS. That's a MUCH MUCH harder problem to solve, as it's extremely difficult to extract parallelism from conventional software (that is, pretty much every app that isn't doing AR or a few other very specific things).

So far, Apple has in the last couple of years kicked *everyone's* ass at that, inside their domain (low-power chips). Nobody even comes close. And if you look carefully at what they've done, you can build a fairly convincing case that they've already built every part necessary to beat Intel at their own game (fast high-power multicore chips), they just don't want to sell those yet.

The biggest open question is this: Can Apple build a ring/mesh/whatever connecting 8-16 high-performance cores in a reasonable power budget? As I've written previously about the A12X, they've *already done this*. So they can, right now today, build something competitive with Intel's best mainstream desktop CPU (the 8-core i9-9900). Whether or not they can actually bet it will depend on whether or not they can clock up. And we know more about that than we did a few months ago, as we can see AMD pushing the same process to around 4.3 GHz tops, about 4.1 comfortably. We still don't know if the A13's pipeline is long enough to sustain this sort of speed, or how easy it would be for Apple to change it enough, but the performance crown there seems easily within their grasp.

Going to more cores is the biggest question mark. The ring or mesh good enough to handle 8 cores really well may not be enough to handle 12 or 16 cores. But the only machines where that would matter is the iMac Pro and the desktop Mac Pro. And I don't think anyone expects those to transition to ARM as early as the laptops.

Apple has already gone to 4 cores in the A12x, 5 cores when counting the efficiency cores together. I don’t see why they can’t remove some unneeded sections from the chips that duplicate functions that don’t need to be duplicated, and run two of these, I suppose now, A13x chips together. Apple has the ability to do it however they think best, as they control the IP.

That's not how it works. You don't see why, because you don't design chips for a living. The story is both better and worse than you think.

About core count: Since the efficiency and fast cores can (and do!) all run simultaneously, Apple has with the A12X demonstrated that whatever they're using to connect all those cores (almost certainly a ring bus, but just possibly some sort of more complex mesh) is capable of handling not just 4 or 5, but 8 cores. "Efficiency" or not, the bus has to handle the same kind of work- all cores have to have cache coherency, equal access to main RAM, etc. Doing this at low enough power, with that many cores, is the big trick that will be key to winning on the desktop - and they've managed to do it well enough to work in an iPad. That's very impressive. We don't know if that architecture will extend to more cores than that, and that's an open question. It may be completely inappropriate for more than 8 cores. But still, 8 cores will get you a VERY long way today.

Now, you talked about trimming "unneeded sections" from the chips. That's not likely to help very much. Most of those "unneeded sections" probably do not participate on the bus/mesh on an equal basis with the scalar cores, because why would they bother? However, nobody knows for sure, because Apple doesn't tell, and Andrei over at AT (the only person I know of who's gained deep insight into the chips and published about them) only has so much time to go poking at the innards with clever software and more clever analysis. So it's vaguely possible that they already have a big-time mesh architecture or multi-ring-bus (like newer and older XCC Xeons, respectively) already, which would be amazing. But it's very unlikely, as the power draw would be incredibly difficult to deal with.

Lastly, it's not at all simple to "run two chips together". If you're thinking about 2S systems like typical Xeons... then you need significant logic to get them to play nicely together and with RAM and the rest of the system. And you'd need to do really major surgery on the A12/13/whatever. On the other hand, if you're thinking about a chiplet setup like AMD's... then it's the same deal with slightly different details. In both cases, you don't know that the secret sauce Apple's using will carry over well. For example, one of the biggest factors in the massive speedups seen in the A12 is apparently the cache architecture and the large L3. If you took that out and stuck it in a central chiplet (like AMD ZEN) you'd probably take a massive perf hit. This is all moderately wild speculation, but the point is, it's not a slam dunk. I personally believe that if the decide to do it, they will embarrass the crap out of everybody else. But I'm skeptical that they'll bother any time soon.

tl;dr: As I've said before, they *already* have shown the ability to go neck-and-neck with Intel's top mainstream chips. If they wanted to fight over the HEDT and Xeons, they could probably do a great job, but it seems unlikely that they'll bother in the near future. They've already got everything they need for every laptop segment, excepting only people who need x64 Windows (or Linux) virtualization at native speed.

So, I assume that you do design chips for a living? Gee, you’re right, I don’t understand ANYTHING about any of this. And I guess you’re also right in essentially saying that Apple’s chip designers don’t know anything about this either.

Um... Did you even read what I wrote? In general the posts of yours that I've read have shown much better comprehension of a wide range of subjects than a lot of the ranting ...people who show up here. But there are some subtle details that you appeared to be missing, based on what you wrote earlier, and I tried to fill them in. Primarily, you seemed to be conflating adding more cores within a single chip, versus supporting multiple chips in a single system. Those are two *very* different things (though AMD is blurring the lines a bit). You also compared multiple efficiency cores to a single performance core, which is a mistake, since it's about as easy to plug a perf core into the chip's bus/mesh as it is an efficiency core. Relatedly, it's likely that other components (ISP, NNP, enclave, etc.) do not participate on an equal level since their needs for memory access, cache coherency, etc. are so different, so swapping them for regular cores is not a simple cut and paste.

As for Apple's silicon team: I don't think I've ever said anything negative about them, for the simple reason that they keep hitting it out of the park year after year. And in fact the key point I was making above is that they have *already completed* all the serious work they'd need to do to build chips that can beat everything the competition has to offer, from pads through laptops all the way through mainstream desktops. They know *everything* about this.

BTW, there are other things neither of us touched on that Apple has yet to do (or at least, show its hand on) that are necessary for desktops and even laptops. For example, I/O on the Axx chips is very limited, as is appropriate for such a SoC. But you're going to want a pile of PCIe lanes coming off the CPU for a laptop, and even more for a desktop. If you're Apple, designing for the future, I imagine you'd want at least 32 PCIe lanes: 16 for graphics, 4 for SSD, 8 for Thunderbolt, and 4 for other miscellany. More would be better, and necessary if you have near-term desktop ambitions, while you could cut it in half using PCIe4 instead of PCIe3 - though I don't see Apple doing that, as their homegrown hardware always goes big, and I can't imagine them investing more in PCIe3 this late in the game.

You know what I really want to see? How fast the Apple GPU design can get if you give it a giant pile of cores and 100 or 150 watts to play with. I don't think anyone (outside the Apple design team) has a clue how well it will scale. I'd be excited to find out, though.

justsomeguy1 · September 14, 2019 7:15PM

Oops... When I wrote above "you could cut it in half using PCIe4 instead of PCIe3 - though I don't see Apple doing that", I meant that I don't see Apple cutting the number of lanes in half, not that I don't see then moving to PCIe4. In fact, I would bet on PCIe4 for any laptop/desktop chip, though it's not a sure thing (power use for PCIe4 is higher, so you can argue that it's the wrong choice for a laptop). (Edit: In fact, it might be that PCIe4 with half the lanes is the best compromise between bandwidth and power, for a laptop. That's the sort of thing only Apple's team can determine.)

Sorry about the ambiguity.

edited September 2019

melgross · September 15, 2019 7:01PM

JustSomeGuy1 said:

melgross said:

JustSomeGuy1 said:

melgross said:

JustSomeGuy1 said:

knowitall said:

Really impressive especially the power savings and switching of active areas, Kudos to the design team.
The new Ad13 iMac will be a blast.
But Teslas FSD chip is currently 12 times faster (73 TOPS vs A13 6 TOPS?) which makes clear Tesla has a design team (only a few people I understood) that can easily match Apples.
So I expect much room for improvement for the A14 and its desktop version the Ad14 next year.
Exciting times, it must be difficult working at Intel now.

Edit: note the TOPS (not TFLOPS)

Tesla's chip is much faster than Apple's at running, say, resnet 50. But it's MUCH slower than Apple's at running any normal app.

You're failing to distinguish between different types of TOPS. Tesla's achievement is notable but not nearly in the same category as Apple's. In principle, it's not all that hard to add more TOPS, if you're talking about tensor/matrix ops, vectors, NNI, GPU, etc. Those ops (and OPS) are all easily parallelized. Further, you fail to recognize the hard limits placed by the power and cooling budgets for each chip. Tesla is entirely focused on video image recognition, so they need massive NN processing. They have the power budget of a car - not unlimited, by a long shot, but still... the battery in a Tesla is a little bigger than the battery in an iPhone! Whereas Apple is building a much more general-purpose chip, and specifically one with extraordinary traditional integer (and FP) OPS. That's a MUCH MUCH harder problem to solve, as it's extremely difficult to extract parallelism from conventional software (that is, pretty much every app that isn't doing AR or a few other very specific things).

So far, Apple has in the last couple of years kicked *everyone's* ass at that, inside their domain (low-power chips). Nobody even comes close. And if you look carefully at what they've done, you can build a fairly convincing case that they've already built every part necessary to beat Intel at their own game (fast high-power multicore chips), they just don't want to sell those yet.

The biggest open question is this: Can Apple build a ring/mesh/whatever connecting 8-16 high-performance cores in a reasonable power budget? As I've written previously about the A12X, they've *already done this*. So they can, right now today, build something competitive with Intel's best mainstream desktop CPU (the 8-core i9-9900). Whether or not they can actually bet it will depend on whether or not they can clock up. And we know more about that than we did a few months ago, as we can see AMD pushing the same process to around 4.3 GHz tops, about 4.1 comfortably. We still don't know if the A13's pipeline is long enough to sustain this sort of speed, or how easy it would be for Apple to change it enough, but the performance crown there seems easily within their grasp.

Going to more cores is the biggest question mark. The ring or mesh good enough to handle 8 cores really well may not be enough to handle 12 or 16 cores. But the only machines where that would matter is the iMac Pro and the desktop Mac Pro. And I don't think anyone expects those to transition to ARM as early as the laptops.

Apple has already gone to 4 cores in the A12x, 5 cores when counting the efficiency cores together. I don’t see why they can’t remove some unneeded sections from the chips that duplicate functions that don’t need to be duplicated, and run two of these, I suppose now, A13x chips together. Apple has the ability to do it however they think best, as they control the IP.

That's not how it works. You don't see why, because you don't design chips for a living. The story is both better and worse than you think.

About core count: Since the efficiency and fast cores can (and do!) all run simultaneously, Apple has with the A12X demonstrated that whatever they're using to connect all those cores (almost certainly a ring bus, but just possibly some sort of more complex mesh) is capable of handling not just 4 or 5, but 8 cores. "Efficiency" or not, the bus has to handle the same kind of work- all cores have to have cache coherency, equal access to main RAM, etc. Doing this at low enough power, with that many cores, is the big trick that will be key to winning on the desktop - and they've managed to do it well enough to work in an iPad. That's very impressive. We don't know if that architecture will extend to more cores than that, and that's an open question. It may be completely inappropriate for more than 8 cores. But still, 8 cores will get you a VERY long way today.

Now, you talked about trimming "unneeded sections" from the chips. That's not likely to help very much. Most of those "unneeded sections" probably do not participate on the bus/mesh on an equal basis with the scalar cores, because why would they bother? However, nobody knows for sure, because Apple doesn't tell, and Andrei over at AT (the only person I know of who's gained deep insight into the chips and published about them) only has so much time to go poking at the innards with clever software and more clever analysis. So it's vaguely possible that they already have a big-time mesh architecture or multi-ring-bus (like newer and older XCC Xeons, respectively) already, which would be amazing. But it's very unlikely, as the power draw would be incredibly difficult to deal with.

Lastly, it's not at all simple to "run two chips together". If you're thinking about 2S systems like typical Xeons... then you need significant logic to get them to play nicely together and with RAM and the rest of the system. And you'd need to do really major surgery on the A12/13/whatever. On the other hand, if you're thinking about a chiplet setup like AMD's... then it's the same deal with slightly different details. In both cases, you don't know that the secret sauce Apple's using will carry over well. For example, one of the biggest factors in the massive speedups seen in the A12 is apparently the cache architecture and the large L3. If you took that out and stuck it in a central chiplet (like AMD ZEN) you'd probably take a massive perf hit. This is all moderately wild speculation, but the point is, it's not a slam dunk. I personally believe that if the decide to do it, they will embarrass the crap out of everybody else. But I'm skeptical that they'll bother any time soon.

tl;dr: As I've said before, they *already* have shown the ability to go neck-and-neck with Intel's top mainstream chips. If they wanted to fight over the HEDT and Xeons, they could probably do a great job, but it seems unlikely that they'll bother in the near future. They've already got everything they need for every laptop segment, excepting only people who need x64 Windows (or Linux) virtualization at native speed.

So, I assume that you do design chips for a living? Gee, you’re right, I don’t understand ANYTHING about any of this. And I guess you’re also right in essentially saying that Apple’s chip designers don’t know anything about this either.

Um... Did you even read what I wrote? In general the posts of yours that I've read have shown much better comprehension of a wide range of subjects than a lot of the ranting ...people who show up here. But there are some subtle details that you appeared to be missing, based on what you wrote earlier, and I tried to fill them in. Primarily, you seemed to be conflating adding more cores within a single chip, versus supporting multiple chips in a single system. Those are two *very* different things (though AMD is blurring the lines a bit). You also compared multiple efficiency cores to a single performance core, which is a mistake, since it's about as easy to plug a perf core into the chip's bus/mesh as it is an efficiency core. Relatedly, it's likely that other components (ISP, NNP, enclave, etc.) do not participate on an equal level since their needs for memory access, cache coherency, etc. are so different, so swapping them for regular cores is not a simple cut and paste.

As for Apple's silicon team: I don't think I've ever said anything negative about them, for the simple reason that they keep hitting it out of the park year after year. And in fact the key point I was making above is that they have *already completed* all the serious work they'd need to do to build chips that can beat everything the competition has to offer, from pads through laptops all the way through mainstream desktops. They know *everything* about this.

BTW, there are other things neither of us touched on that Apple has yet to do (or at least, show its hand on) that are necessary for desktops and even laptops. For example, I/O on the Axx chips is very limited, as is appropriate for such a SoC. But you're going to want a pile of PCIe lanes coming off the CPU for a laptop, and even more for a desktop. If you're Apple, designing for the future, I imagine you'd want at least 32 PCIe lanes: 16 for graphics, 4 for SSD, 8 for Thunderbolt, and 4 for other miscellany. More would be better, and necessary if you have near-term desktop ambitions, while you could cut it in half using PCIe4 instead of PCIe3 - though I don't see Apple doing that, as their homegrown hardware always goes big, and I can't imagine them investing more in PCIe3 this late in the game.

You know what I really want to see? How fast the Apple GPU design can get if you give it a giant pile of cores and 100 or 150 watts to play with. I don't think anyone (outside the Apple design team) has a clue how well it will scale. I'd be excited to find out, though.

I carefully read your entire post, twice, before responding. I didn’t like your tone any more than I agreed with your suppositions. I do happen to understand a good deal about chip design and deployment. You make the assumption that Apple doesn’t understand what is needed here, and that they care to sav Cyrus enough to design what they would need. I object to that.

yes, I know that more cores on one chip is different from the same number of cores on two chips. I assume that Apple knows that as well. The rest is fluff. Making that assumption that Apple does understand. This undermines your entire argument. Do you really believe that Apple’s teams can’t either design whatever they need to pull this off, or that they can’t hire the needed talent in that area? That would be a big mistake on your part.

justsomeguy1 · September 15, 2019 10:41PM

melgross said:

JustSomeGuy1 said:

melgross said:

JustSomeGuy1 said:

melgross said:

JustSomeGuy1 said:

knowitall said:

Really impressive especially the power savings and switching of active areas, Kudos to the design team.
The new Ad13 iMac will be a blast.
But Teslas FSD chip is currently 12 times faster (73 TOPS vs A13 6 TOPS?) which makes clear Tesla has a design team (only a few people I understood) that can easily match Apples.
So I expect much room for improvement for the A14 and its desktop version the Ad14 next year.
Exciting times, it must be difficult working at Intel now.

Edit: note the TOPS (not TFLOPS)

Tesla's chip is much faster than Apple's at running, say, resnet 50. But it's MUCH slower than Apple's at running any normal app.

You're failing to distinguish between different types of TOPS. Tesla's achievement is notable but not nearly in the same category as Apple's. In principle, it's not all that hard to add more TOPS, if you're talking about tensor/matrix ops, vectors, NNI, GPU, etc. Those ops (and OPS) are all easily parallelized. Further, you fail to recognize the hard limits placed by the power and cooling budgets for each chip. Tesla is entirely focused on video image recognition, so they need massive NN processing. They have the power budget of a car - not unlimited, by a long shot, but still... the battery in a Tesla is a little bigger than the battery in an iPhone! Whereas Apple is building a much more general-purpose chip, and specifically one with extraordinary traditional integer (and FP) OPS. That's a MUCH MUCH harder problem to solve, as it's extremely difficult to extract parallelism from conventional software (that is, pretty much every app that isn't doing AR or a few other very specific things).

So far, Apple has in the last couple of years kicked *everyone's* ass at that, inside their domain (low-power chips). Nobody even comes close. And if you look carefully at what they've done, you can build a fairly convincing case that they've already built every part necessary to beat Intel at their own game (fast high-power multicore chips), they just don't want to sell those yet.

The biggest open question is this: Can Apple build a ring/mesh/whatever connecting 8-16 high-performance cores in a reasonable power budget? As I've written previously about the A12X, they've *already done this*. So they can, right now today, build something competitive with Intel's best mainstream desktop CPU (the 8-core i9-9900). Whether or not they can actually bet it will depend on whether or not they can clock up. And we know more about that than we did a few months ago, as we can see AMD pushing the same process to around 4.3 GHz tops, about 4.1 comfortably. We still don't know if the A13's pipeline is long enough to sustain this sort of speed, or how easy it would be for Apple to change it enough, but the performance crown there seems easily within their grasp.

Going to more cores is the biggest question mark. The ring or mesh good enough to handle 8 cores really well may not be enough to handle 12 or 16 cores. But the only machines where that would matter is the iMac Pro and the desktop Mac Pro. And I don't think anyone expects those to transition to ARM as early as the laptops.

Apple has already gone to 4 cores in the A12x, 5 cores when counting the efficiency cores together. I don’t see why they can’t remove some unneeded sections from the chips that duplicate functions that don’t need to be duplicated, and run two of these, I suppose now, A13x chips together. Apple has the ability to do it however they think best, as they control the IP.

That's not how it works. You don't see why, because you don't design chips for a living. The story is both better and worse than you think.

About core count: Since the efficiency and fast cores can (and do!) all run simultaneously, Apple has with the A12X demonstrated that whatever they're using to connect all those cores (almost certainly a ring bus, but just possibly some sort of more complex mesh) is capable of handling not just 4 or 5, but 8 cores. "Efficiency" or not, the bus has to handle the same kind of work- all cores have to have cache coherency, equal access to main RAM, etc. Doing this at low enough power, with that many cores, is the big trick that will be key to winning on the desktop - and they've managed to do it well enough to work in an iPad. That's very impressive. We don't know if that architecture will extend to more cores than that, and that's an open question. It may be completely inappropriate for more than 8 cores. But still, 8 cores will get you a VERY long way today.

Now, you talked about trimming "unneeded sections" from the chips. That's not likely to help very much. Most of those "unneeded sections" probably do not participate on the bus/mesh on an equal basis with the scalar cores, because why would they bother? However, nobody knows for sure, because Apple doesn't tell, and Andrei over at AT (the only person I know of who's gained deep insight into the chips and published about them) only has so much time to go poking at the innards with clever software and more clever analysis. So it's vaguely possible that they already have a big-time mesh architecture or multi-ring-bus (like newer and older XCC Xeons, respectively) already, which would be amazing. But it's very unlikely, as the power draw would be incredibly difficult to deal with.

Lastly, it's not at all simple to "run two chips together". If you're thinking about 2S systems like typical Xeons... then you need significant logic to get them to play nicely together and with RAM and the rest of the system. And you'd need to do really major surgery on the A12/13/whatever. On the other hand, if you're thinking about a chiplet setup like AMD's... then it's the same deal with slightly different details. In both cases, you don't know that the secret sauce Apple's using will carry over well. For example, one of the biggest factors in the massive speedups seen in the A12 is apparently the cache architecture and the large L3. If you took that out and stuck it in a central chiplet (like AMD ZEN) you'd probably take a massive perf hit. This is all moderately wild speculation, but the point is, it's not a slam dunk. I personally believe that if the decide to do it, they will embarrass the crap out of everybody else. But I'm skeptical that they'll bother any time soon.

tl;dr: As I've said before, they *already* have shown the ability to go neck-and-neck with Intel's top mainstream chips. If they wanted to fight over the HEDT and Xeons, they could probably do a great job, but it seems unlikely that they'll bother in the near future. They've already got everything they need for every laptop segment, excepting only people who need x64 Windows (or Linux) virtualization at native speed.

So, I assume that you do design chips for a living? Gee, you’re right, I don’t understand ANYTHING about any of this. And I guess you’re also right in essentially saying that Apple’s chip designers don’t know anything about this either.

Um... Did you even read what I wrote? In general the posts of yours that I've read have shown much better comprehension of a wide range of subjects than a lot of the ranting ...people who show up here. But there are some subtle details that you appeared to be missing, based on what you wrote earlier, and I tried to fill them in. Primarily, you seemed to be conflating adding more cores within a single chip, versus supporting multiple chips in a single system. Those are two *very* different things (though AMD is blurring the lines a bit). You also compared multiple efficiency cores to a single performance core, which is a mistake, since it's about as easy to plug a perf core into the chip's bus/mesh as it is an efficiency core. Relatedly, it's likely that other components (ISP, NNP, enclave, etc.) do not participate on an equal level since their needs for memory access, cache coherency, etc. are so different, so swapping them for regular cores is not a simple cut and paste.

As for Apple's silicon team: I don't think I've ever said anything negative about them, for the simple reason that they keep hitting it out of the park year after year. And in fact the key point I was making above is that they have *already completed* all the serious work they'd need to do to build chips that can beat everything the competition has to offer, from pads through laptops all the way through mainstream desktops. They know *everything* about this.

BTW, there are other things neither of us touched on that Apple has yet to do (or at least, show its hand on) that are necessary for desktops and even laptops. For example, I/O on the Axx chips is very limited, as is appropriate for such a SoC. But you're going to want a pile of PCIe lanes coming off the CPU for a laptop, and even more for a desktop. If you're Apple, designing for the future, I imagine you'd want at least 32 PCIe lanes: 16 for graphics, 4 for SSD, 8 for Thunderbolt, and 4 for other miscellany. More would be better, and necessary if you have near-term desktop ambitions, while you could cut it in half using PCIe4 instead of PCIe3 - though I don't see Apple doing that, as their homegrown hardware always goes big, and I can't imagine them investing more in PCIe3 this late in the game.

You know what I really want to see? How fast the Apple GPU design can get if you give it a giant pile of cores and 100 or 150 watts to play with. I don't think anyone (outside the Apple design team) has a clue how well it will scale. I'd be excited to find out, though.

I carefully read your entire post, twice, before responding. I didn’t like your tone any more than I agreed with your suppositions. I do happen to understand a good deal about chip design and deployment. You make the assumption that Apple doesn’t understand what is needed here, and that they care to sav Cyrus enough to design what they would need. I object to that.

yes, I know that more cores on one chip is different from the same number of cores on two chips. I assume that Apple knows that as well. The rest is fluff. Making that assumption that Apple does understand. This undermines your entire argument. Do you really believe that Apple’s teams can’t either design whatever they need to pull this off, or that they can’t hire the needed talent in that area? That would be a big mistake on your part.

OK, I don't get it. I read your response twice, and I have no idea why you're arguing. You keep saying that I think Apple can't do... something, it's not clear what. I think I've made very clear that they have *already* done everything they need to do to address the entire market they're interested in, while blowing everyone else away. And I believe I've said directly, twice, that while they haven't yet built (or at least shown) all the tech they'd need for HEDT or XCC-Xeon type chips, that I expect they'd have no problem with that either, if they decided to do it.

You wrote that you object to my assumption that "they care to sav Cyrus enough to design what they would need". I have no idea what that means, but I assure you I'm making no assumptions about anything called "Cyrus" here.

Lastly... I don't know what you know about chip design. But the stuff you wrote earlier was mistaken in a couple of ways, which I've pointed out, because the correct information actually supports your position - which, as far as I can tell, is roughly the same as mine.

Anyway, I'm out. Feel free to have the last word.

More power with less: Apple's A13 Bionic is faster and more power efficient

Comments