The new information regarding Power5, VMX and single core contradicts some of the earlier information we have of it.
Previously we have been told that Power5 will be dual core just like the Power4, and that it will have special circuitry to accelerate common stuff like TCP/IP-processing.
A VMX unit might do this as Motorola have shown.
But.. I've never heard of a single core Power5 before and it wouldn't make sense to use a chip like that in a massive installation like the Blue World supercomputer. The first thing they do is to bundle a couple of chips together to form a single processing unit so it'd make more sense if they used a dual core chip in the first place so they could reduce complexity.
But.. a ordinairy dual core Power5 might be too expensive, so I'm leaning towards believing that this "Power5" might instead be an AltiVec equiped GRUL, aka 97x/980.
GRUL would be cheaper than a regular Power5 since it'd be manufactured in larger quantities (IBM blades, workstations, small servers and of course Macs), have reduced complexity, reduced power consumption, have an VMX and it will be single core.
It fits the Blue World profile best, imho.
Why oh why are all there Opteron based supercomputers popping up now? Since they all run Linux, PPC970 must be the better choice as i see it. If they are going to do any scientific work which can be vectorised the AltiVec WILL blow ANY competition out of the water easily as the NASA benchmarks clearly shows.
FastPath is not VMX although there is no reason that a processor couldn't have both. VMX is not specialized hardware for handling specific OS tasks, it is a general purpose SIMD unit. What we do know about FastPath indicates that it is very specific to a few common OS functions and is designed to avoid interrupting the processor as often as is currently necessary -- VMX does nothing to reduce processor interrupts.
I don't know if the POWER5 will be single or dual core... or both... or more. Given a modular design and their automated design process its entirely possible that IBM could generate a couple of chips with different numbers of cores, aimed at different markets. They could also do what they've recently done with the POWER4 and use dual core chips which have had a fatal flaw in one core as single core.
The new information regarding Power5, VMX and single core contradicts some of the earlier information we have of it.
Previously we have been told that Power5 will be dual core just like the Power4, and that it will have special circuitry to accelerate common stuff like TCP/IP-processing.
I don't think the info we have contradicts the dual core Power5 or special ciruitry for TCP/IP in addition to the VMX. It just says that there will be a single core version and there might be VMX. And as Programmer pointed out, VMX may not fit the fastpath description (without some possible modifications?).
Quote:
But.. I've never heard of a single core Power5 before and it wouldn't make sense to use a chip like that in a massive installation like the Blue World supercomputer. The first thing they do is to bundle a couple of chips together to form a single processing unit so it'd make more sense if they used a dual core chip in the first place so they could reduce complexity.
As I menitoned above, they want to use the single chip version in the Blue Planet so that each processor core has use of the full off chip bandwidth. They are trying to maximise the system's sustained performance and not its peak performance so ensuring sufficient data flow to the chips is important. They still bundle together 4 single core Power5s into a multi chip module, put 2 MCMs in a node and stack up the nodes to the moon.
Quote:
But.. a ordinary dual core Power5 might be too expensive, so I'm leaning towards believing that this "Power5" might instead be an AltiVec equiped GRUL, aka 97x/980.
GRUL would be cheaper than a regular Power5 since it'd be manufactured in larger quantities (IBM blades, workstations, small servers and of course Macs), have reduced complexity, reduced power consumption, have an VMX and it will be single core.
It fits the Blue World profile best, imho.
There will still be a dual core Power5 to replace the Power4+. The single core will only be a dual core that had a fault in one of the cores during manufacture like the current single core Power4 chips (as Programmer also menitoned). So instead of throwing away a defective dual core chip they sell it as single core turning a loss into a profit.
Quote:
Why oh why are all there Opteron based supercomputers popping up now? Since they all run Linux, PPC970 must be the better choice as i see it. If they are going to do any scientific work which can be vectorised the AltiVec WILL blow ANY competition out of the water easily as the NASA benchmarks clearly shows.
It depends if the requirement is for single or double precision floating point. For single precision vectorisable code 970 would win but for dual precision the performance would be much closer and probably greater for the Opteron. Off chip bandwidth is as good on the Opteron also with a dual channel DDR400 memory controller on-chip (6.4 MB/s) and Hypertransport to the system chips. Also, with so many processors in a supercomputer, even though I'm sure they are designed to be fault tolerant, the higher reliability the better. So I see the single core Power chips fitting in better than the 970 which traded reliability for speed.
VMX may not fit the fastpath description (without some possible modifications?).
Perhaps my statement wasn't strong enough: FastPath is [i]not[/i VMX.
Quote:
For single precision vectorisable code 970 would win but for dual precision the performance would be much closer and probably greater for the Opteron.[/B]
Highly debateable -- I think I'd put my money on the 970 at anything close to the same clock rate as long as you have a good compiler and are allowed to use the fused multiply-add instructions. 32 registers, better instruction set, etc.
Perhaps my statement wasn't strong enough: FastPath is not VMX.
I hear you now. But... could IBM leverage any of VMX's features for use by the Fastpath engine or are they totally incompatible? (I don't understand interupts that you mentioned earlier - I'm only a 'user'.)
Quote:
Highly debateable -- I think I'd put my money on the 970 at anything close to the same clock rate as long as you have a good compiler and are allowed to use the fused multiply-add instructions. 32 registers, better instruction set, etc.
Well, like the Athlon, the Opteron has 3 floating point units and performs very well in the many FP heavy tests it's been put to. I agree - it is highly debateable and won't be settled now. I'm looking forward to some really in-depth cross platform tests by reputable sites like Anand Tech. (What's the Mac equivalents? I only lurk here and at Ars and am planning to switch with my next purchase, whenever that is.) I hope the 970 will come out on top but I think the competition will be stiff.
I hear you now. But... could IBM leverage any of VMX's features for use by the Fastpath engine or are they totally incompatible? (I don't understand interupts that you mentioned earlier - I'm only a 'user'.)
Totally independent is a better description. FastPath is intended to handle stuff so that the main CPU doesn't have to. The things it will handle look like things which come at the processor from the outside world and need to be dealt with immediately -- in a processor that is called an "interrupt" because whatever the processor was doing is interrupted and put on hold while this new thing is dealt with. Networking events like a packet arriving are of this nature. The big problem with interrupts like this is that they are fairly unpredictable and require a "context switch" in the processor where at some random time it has to save what it was doing, load the operating system's context, deal with the source of the interruption, save the operating system's context, and then restore what it was doing. From what I've read about FastPath, I believe it is a specialized piece of hardware that handles the part of the interrupt that must be done immediately and any remaining work that has to be done can be scheduled for a time when the processor is going to be doing a context switch anyhow. This ties in beautifully to the SMT features that IBM is also including in POWER5. In a system dealing with a lot of network activity this could streamline a processor's performance quite significantly ... especially in a deeply pipelined processor.
Keep in mind that most of this is supposition, except the part about it not being VMX.
Quote:
Well, like the Athlon, the Opteron has 3 floating point units and performs very well in the many FP heavy tests it's been put to.
I might be off the mark here, but I don't think those 3 FPUs are all equivalent -- they are somewhat specialized in the same way that the G4's various integer units and vector units are specialized. The 970's two FPUs can each execute all FPU instructions, and they operate completely in parallel with the load/store units, integer units, and branch unit. Its been some time since I looked at the AMD offering's tech specs but I seem to recall its arrangement was a little more limiting. You're correct though -- they are close enough that its pretty much a wash and other factors are more important in determining which is "better".
Correct, Programmer, the AMD FPUs are indeed not equivalents, you have FMUL, FADD, and FSTORE, and they all have to do MMX, plus the FMUL and FADD units need to handle 3DNow! too. Disadvantageous compared to the much more flexible G5.
Correct, Programmer, the AMD FPUs are indeed not equivalents, you have FMUL, FADD, and FSTORE, and they all have to do MMX, plus the FMUL and FADD units need to handle 3DNow! too. Disadvantageous compared to the much more flexible G5.
FSTORE counts as a floating point unit? Heh, well then counted like that the G5 has 4 FPUs! No, its more accurate to say that the G5 has 2 and the AMD chips have 2 and only 1 floating point load/store unit. Having only 1 load/store unit is even worse on x86 because there are so few registers you often have to do many more load/stores than the PowerPC (which has 32 floating point registers). And each of the G5s units can do as much work per clock cycle as both the AMD units (i.e. a fused multiply-add instruction). This has an even bigger impact when you consider that the result doesn't need to spend time going from the FMUL to the FADD unit, but instead both operations are done at once.
And then there are the seperate vector units with their own 32 registers (instead of sharing 8 registers with the FPU like x86 MMX does).
Comments
Previously we have been told that Power5 will be dual core just like the Power4, and that it will have special circuitry to accelerate common stuff like TCP/IP-processing.
A VMX unit might do this as Motorola have shown.
But.. I've never heard of a single core Power5 before and it wouldn't make sense to use a chip like that in a massive installation like the Blue World supercomputer. The first thing they do is to bundle a couple of chips together to form a single processing unit so it'd make more sense if they used a dual core chip in the first place so they could reduce complexity.
But.. a ordinairy dual core Power5 might be too expensive, so I'm leaning towards believing that this "Power5" might instead be an AltiVec equiped GRUL, aka 97x/980.
GRUL would be cheaper than a regular Power5 since it'd be manufactured in larger quantities (IBM blades, workstations, small servers and of course Macs), have reduced complexity, reduced power consumption, have an VMX and it will be single core.
It fits the Blue World profile best, imho.
Why oh why are all there Opteron based supercomputers popping up now? Since they all run Linux, PPC970 must be the better choice as i see it. If they are going to do any scientific work which can be vectorised the AltiVec WILL blow ANY competition out of the water easily as the NASA benchmarks clearly shows.
Originally posted by smalM
IBM stated the Power5 will be 4 times as fast as the Power4 when it was introduced. The statement related to a 2GHz Power5 and a 1GHz Power4.
Yes, that was exactly what I was trying to say
I don't know if the POWER5 will be single or dual core... or both... or more. Given a modular design and their automated design process its entirely possible that IBM could generate a couple of chips with different numbers of cores, aimed at different markets. They could also do what they've recently done with the POWER4 and use dual core chips which have had a fatal flaw in one core as single core.
Originally posted by Henriok
The new information regarding Power5, VMX and single core contradicts some of the earlier information we have of it.
Previously we have been told that Power5 will be dual core just like the Power4, and that it will have special circuitry to accelerate common stuff like TCP/IP-processing.
I don't think the info we have contradicts the dual core Power5 or special ciruitry for TCP/IP in addition to the VMX. It just says that there will be a single core version and there might be VMX. And as Programmer pointed out, VMX may not fit the fastpath description (without some possible modifications?).
But.. I've never heard of a single core Power5 before and it wouldn't make sense to use a chip like that in a massive installation like the Blue World supercomputer. The first thing they do is to bundle a couple of chips together to form a single processing unit so it'd make more sense if they used a dual core chip in the first place so they could reduce complexity.
As I menitoned above, they want to use the single chip version in the Blue Planet so that each processor core has use of the full off chip bandwidth. They are trying to maximise the system's sustained performance and not its peak performance so ensuring sufficient data flow to the chips is important. They still bundle together 4 single core Power5s into a multi chip module, put 2 MCMs in a node and stack up the nodes to the moon.
But.. a ordinary dual core Power5 might be too expensive, so I'm leaning towards believing that this "Power5" might instead be an AltiVec equiped GRUL, aka 97x/980.
GRUL would be cheaper than a regular Power5 since it'd be manufactured in larger quantities (IBM blades, workstations, small servers and of course Macs), have reduced complexity, reduced power consumption, have an VMX and it will be single core.
It fits the Blue World profile best, imho.
There will still be a dual core Power5 to replace the Power4+. The single core will only be a dual core that had a fault in one of the cores during manufacture like the current single core Power4 chips (as Programmer also menitoned). So instead of throwing away a defective dual core chip they sell it as single core turning a loss into a profit.
Why oh why are all there Opteron based supercomputers popping up now? Since they all run Linux, PPC970 must be the better choice as i see it. If they are going to do any scientific work which can be vectorised the AltiVec WILL blow ANY competition out of the water easily as the NASA benchmarks clearly shows.
It depends if the requirement is for single or double precision floating point. For single precision vectorisable code 970 would win but for dual precision the performance would be much closer and probably greater for the Opteron. Off chip bandwidth is as good on the Opteron also with a dual channel DDR400 memory controller on-chip (6.4 MB/s) and Hypertransport to the system chips. Also, with so many processors in a supercomputer, even though I'm sure they are designed to be fault tolerant, the higher reliability the better. So I see the single core Power chips fitting in better than the 970 which traded reliability for speed.
MM
Originally posted by MartianMatt
VMX may not fit the fastpath description (without some possible modifications?).
Perhaps my statement wasn't strong enough: FastPath is [i]not[/i VMX.
For single precision vectorisable code 970 would win but for dual precision the performance would be much closer and probably greater for the Opteron.[/B]
Highly debateable -- I think I'd put my money on the 970 at anything close to the same clock rate as long as you have a good compiler and are allowed to use the fused multiply-add instructions. 32 registers, better instruction set, etc.
Originally posted by Programmer
Perhaps my statement wasn't strong enough: FastPath is not VMX.
I hear you now.
Highly debateable -- I think I'd put my money on the 970 at anything close to the same clock rate as long as you have a good compiler and are allowed to use the fused multiply-add instructions. 32 registers, better instruction set, etc.
Well, like the Athlon, the Opteron has 3 floating point units and performs very well in the many FP heavy tests it's been put to. I agree - it is highly debateable and won't be settled now. I'm looking forward to some really in-depth cross platform tests by reputable sites like Anand Tech. (What's the Mac equivalents? I only lurk here and at Ars and am planning to switch with my next purchase, whenever that is.) I hope the 970 will come out on top but I think the competition will be stiff.
MM
Originally posted by MartianMatt
I hear you now.
Totally independent is a better description. FastPath is intended to handle stuff so that the main CPU doesn't have to. The things it will handle look like things which come at the processor from the outside world and need to be dealt with immediately -- in a processor that is called an "interrupt" because whatever the processor was doing is interrupted and put on hold while this new thing is dealt with. Networking events like a packet arriving are of this nature. The big problem with interrupts like this is that they are fairly unpredictable and require a "context switch" in the processor where at some random time it has to save what it was doing, load the operating system's context, deal with the source of the interruption, save the operating system's context, and then restore what it was doing. From what I've read about FastPath, I believe it is a specialized piece of hardware that handles the part of the interrupt that must be done immediately and any remaining work that has to be done can be scheduled for a time when the processor is going to be doing a context switch anyhow. This ties in beautifully to the SMT features that IBM is also including in POWER5. In a system dealing with a lot of network activity this could streamline a processor's performance quite significantly ... especially in a deeply pipelined processor.
Keep in mind that most of this is supposition, except the part about it not being VMX.
Well, like the Athlon, the Opteron has 3 floating point units and performs very well in the many FP heavy tests it's been put to.
I might be off the mark here, but I don't think those 3 FPUs are all equivalent -- they are somewhat specialized in the same way that the G4's various integer units and vector units are specialized. The 970's two FPUs can each execute all FPU instructions, and they operate completely in parallel with the load/store units, integer units, and branch unit. Its been some time since I looked at the AMD offering's tech specs but I seem to recall its arrangement was a little more limiting. You're correct though -- they are close enough that its pretty much a wash and other factors are more important in determining which is "better".
Originally posted by Zapchud
Correct, Programmer, the AMD FPUs are indeed not equivalents, you have FMUL, FADD, and FSTORE, and they all have to do MMX, plus the FMUL and FADD units need to handle 3DNow! too. Disadvantageous compared to the much more flexible G5.
FSTORE counts as a floating point unit? Heh, well then counted like that the G5 has 4 FPUs! No, its more accurate to say that the G5 has 2 and the AMD chips have 2 and only 1 floating point load/store unit. Having only 1 load/store unit is even worse on x86 because there are so few registers you often have to do many more load/stores than the PowerPC (which has 32 floating point registers). And each of the G5s units can do as much work per clock cycle as both the AMD units (i.e. a fused multiply-add instruction). This has an even bigger impact when you consider that the result doesn't need to spend time going from the FMUL to the FADD unit, but instead both operations are done at once.
And then there are the seperate vector units with their own 32 registers (instead of sharing 8 registers with the FPU like x86 MMX does).
The G5 kicks ass, man!