Two versions of the 9xx would make a lot of sense -- one can be tailored for lower power output and higher integration (using HT and on-chip memory controller), whereas the other is the current one... designed for SMP/SMT, high-speed busses, and larger L2 caches. IBM has said this is going to be a family of chips. That doesn't just mean a series of generations, it can also mean siblings.
It's worth noting the PPC970 has some pretty horrid latency to main memory so L3 cache wouldn't hurt it. An on die memory controller is definitely coming down the line for the PPC970 though.
Anyone want to chime in on how the additional registers will improve single threaded performance on the Power5?
It probably won't help at all unless they've added more execution units (which is probably a good bet). In code which doesn't have a lot of inter-dependencies this could let them increase the instruction dispatch capability. All by itself, however, it doesn't do much.
I recall that a few people have speculated that the 64-bit space could be used to run 2x32-bit streams in parallel, but that this idea has generally been poo-poohed by the more knowledgeable members here.
I'm the first to disclaim any competence when talking about CPUs, but on the face of it, it seems to me that the logic needed to run SMT would take you most of the way to implementing this kind of 2x32-bit parallelism.
I recall that a few people have speculated that the 64-bit space could be used to run 2x32-bit streams in parallel, but that this idea has generally been poo-poohed by the more knowledgeable members here.
I'm the first to disclaim any competence when talking about CPUs, but on the face of it, it seems to me that the logic needed to run SMT would take you most of the way to implementing this kind of 2x32-bit parallelism.
Am I way off track here?
Yes. Way off track.
BTW, I just read the blurb from the hot chips symposium that was posted on Ars. It looks like IBM decided to just add one additional thread and no new execution units. It turns out that their current utilization levels are typically only 20-25%!!! Adding an additional thread costs 24% extra transistors and delivers 40% better performance (on average), which is a reasonable tradeoff. Adding additional threads delivers diminishing returns.
I believe it was BadAndy or hobold over at Ars who was surprised by the low number of integer units in the 970, given that integer units are almost constantly being used for pointers, array indexes, and the like even in intense floating point code. SMT would reasonably demand at least one more integer unit - although, having said that, the fact that hardly anyone over there can figure out how the 970 does as well as it does given the integer units it has means that there's at least one variable in the equation that they aren't considering, perhaps because they can't consider it without substantially more information than IBM is willing to publish.
If IBM's extremely pessimistic assessment of efficiency in the POWER4 (that most execution units run at about 25% of theoretical efficiency) carries over, it might actually be possible for IBM to get full SMT going without actually adding all that much in the way of units - it would come down to a quality of implementation issue. We know, however, that SMT in the POWER5 core adds 24% more transistors than make up the POWER4 core, so it's still not cheap (although for the claimed 40% performance gain, it's definitely worth it).
The interesting thing about the latter-day IBM designs is that they seem to make decisions based heavily on analysis of collected performance data, which on the one hand results on great CPUs in real-world use, and on the other hand makes them maddeningly complicated to figure out from the descriptions given. I have a lurking feeling that, with the added variable of SMT, the rationales behind the design decisions made in the POWER5 core will be even harder to figure out without recourse to the designers' data. I note a few moves in the direction of the G4, though: Increasing the associativity of the caches rather than enlarging them, for example (the 745x series have highly efficient and sophisticated caching on die).
The discussion was fairly clear -- no additional execution units. Given the low utilization, this isn't surprising. There's no point in adding units if the existing ones aren't working hard.
The interesting thing about IBM's analysis is that it is an implicit admission that most code is poorly optimized and that the best way to speed up overall machine performance is to make as much of the unoptimized code have as high utilization as possible. It is essentially the opposite of Intel's approach with EPIC. I'll leave it to your imagination as to which approachh will result in faster code most of the time.
It probably won't help at all unless they've added more execution units (which is probably a good bet). In code which doesn't have a lot of inter-dependencies this could let them increase the instruction dispatch capability. All by itself, however, it doesn't do much.
So what they say here is wrong?
But actually disabling the multithreading option actually has an important side effect?it can improve performance, Kalla said. Turning off a thread gives the single thread access to all 120 registers, affording the Power5 a significant instruction-per-clock (IPC) advantage compared with the Power4.
They are essentially thinking that the additional 40 registers will give the Power5, in single threaded mode, lower latencies? With 50% more registers, the chip won't have to dip into L1 or L2 as much anymore?
But actually disabling the multithreading option actually has an important side effect?it can improve performance, Kalla said. Turning off a thread gives the single thread access to all 120 registers, affording the Power5 a significant instruction-per-clock (IPC) advantage compared with the Power4.
They are essentially thinking that the additional 40 registers will give the Power5, in single threaded mode, lower latencies? With 50% more registers, the chip won't have to dip into L1 or L2 as much anymore?
Keep in mind that these are new rename registers only -- the architected registers do not change. This means that cache usage is completely unaffected by this change because the use of memory is a function of the code... without changing the code you can't have the chip "dip into L1 or L2". The additional registers allow the second SMT thread to run without stalling on a lack of physical registers. As I said in my previous post, I was surprised by the really low typical utilization quoted by IBM and in that light it makes sense to add SMT without adding execution units. Rename registers were obviously the one resource they were short of, so that is the resource that they increased by 50%.
Comments
Originally posted by THT
Anyone want to chime in on how the additional registers will improve single threaded performance on the Power5?
It probably won't help at all unless they've added more execution units (which is probably a good bet). In code which doesn't have a lot of inter-dependencies this could let them increase the instruction dispatch capability. All by itself, however, it doesn't do much.
I'm the first to disclaim any competence when talking about CPUs, but on the face of it, it seems to me that the logic needed to run SMT would take you most of the way to implementing this kind of 2x32-bit parallelism.
Am I way off track here?
Originally posted by boy_analog
I recall that a few people have speculated that the 64-bit space could be used to run 2x32-bit streams in parallel, but that this idea has generally been poo-poohed by the more knowledgeable members here.
I'm the first to disclaim any competence when talking about CPUs, but on the face of it, it seems to me that the logic needed to run SMT would take you most of the way to implementing this kind of 2x32-bit parallelism.
Am I way off track here?
Yes. Way off track.
BTW, I just read the blurb from the hot chips symposium that was posted on Ars. It looks like IBM decided to just add one additional thread and no new execution units. It turns out that their current utilization levels are typically only 20-25%!!! Adding an additional thread costs 24% extra transistors and delivers 40% better performance (on average), which is a reasonable tradeoff. Adding additional threads delivers diminishing returns.
If IBM's extremely pessimistic assessment of efficiency in the POWER4 (that most execution units run at about 25% of theoretical efficiency) carries over, it might actually be possible for IBM to get full SMT going without actually adding all that much in the way of units - it would come down to a quality of implementation issue. We know, however, that SMT in the POWER5 core adds 24% more transistors than make up the POWER4 core, so it's still not cheap (although for the claimed 40% performance gain, it's definitely worth it).
The interesting thing about the latter-day IBM designs is that they seem to make decisions based heavily on analysis of collected performance data, which on the one hand results on great CPUs in real-world use, and on the other hand makes them maddeningly complicated to figure out from the descriptions given. I have a lurking feeling that, with the added variable of SMT, the rationales behind the design decisions made in the POWER5 core will be even harder to figure out without recourse to the designers' data. I note a few moves in the direction of the G4, though: Increasing the associativity of the caches rather than enlarging them, for example (the 745x series have highly efficient and sophisticated caching on die).
The interesting thing about IBM's analysis is that it is an implicit admission that most code is poorly optimized and that the best way to speed up overall machine performance is to make as much of the unoptimized code have as high utilization as possible. It is essentially the opposite of Intel's approach with EPIC. I'll leave it to your imagination as to which approachh will result in faster code most of the time.
Originally posted by Programmer
It probably won't help at all unless they've added more execution units (which is probably a good bet). In code which doesn't have a lot of inter-dependencies this could let them increase the instruction dispatch capability. All by itself, however, it doesn't do much.
So what they say here is wrong?
But actually disabling the multithreading option actually has an important side effect?it can improve performance, Kalla said. Turning off a thread gives the single thread access to all 120 registers, affording the Power5 a significant instruction-per-clock (IPC) advantage compared with the Power4.
They are essentially thinking that the additional 40 registers will give the Power5, in single threaded mode, lower latencies? With 50% more registers, the chip won't have to dip into L1 or L2 as much anymore?
Originally posted by THT
So what they say here is wrong?
But actually disabling the multithreading option actually has an important side effect?it can improve performance, Kalla said. Turning off a thread gives the single thread access to all 120 registers, affording the Power5 a significant instruction-per-clock (IPC) advantage compared with the Power4.
They are essentially thinking that the additional 40 registers will give the Power5, in single threaded mode, lower latencies? With 50% more registers, the chip won't have to dip into L1 or L2 as much anymore?
Keep in mind that these are new rename registers only -- the architected registers do not change. This means that cache usage is completely unaffected by this change because the use of memory is a function of the code... without changing the code you can't have the chip "dip into L1 or L2". The additional registers allow the second SMT thread to run without stalling on a lack of physical registers. As I said in my previous post, I was surprised by the really low typical utilization quoted by IBM and in that light it makes sense to add SMT without adding execution units. Rename registers were obviously the one resource they were short of, so that is the resource that they increased by 50%.
Originally posted by Programmer
The additional registers allow the second SMT thread to run without stalling on a lack of physical registers.
But why are the saying that the additional registers will increase performance for single threaded apps?