You are comparing the very latest Pentium IV fabbed in the state of the art .13 micron process against a processor that is fabbed in an older .18 micron process. Put that Power 4 beast on IBM's .13 micron SOI process and you should see at least 30% increase in clock speed. Try to compare processors in like processes.
<strong>[code]
P4: INT base FP base
882 861
POWER4(+30%): >1040 >1500
</pre><hr></blockquote>
</strong>
Suddenly it is not so close, and I beleive this is conservative.
Heh, good comment about SPECmarks. Intel does a lot of shady things with the compiler and source code to ensure that the Pentium4's numbers come out looking rosy. I'm sure IBM does a few things as well, but not at many as Intel. <img src="graemlins/oyvey.gif" border="0" alt="[No]" />
SPECmarks also aren't real work, and knowing IBM the POWER4 has been carefully tuned to be a terrific server -- especially its SMP performance (which is not measured by SPECmarks). The POWER4 is also scaling very well from 1 GHz -> 1.3 GHz and could probably scale almost linearly for a while more yet, whereas the Pentium4 isn't scaling so well at its already lofty clock rates. Not entirely fair, however, since the POWER4's bus is no doubt far more expensive than what Intel delivers to the desktop.
Found this interesting blurb while looking for benchmarks:
[quote]<strong>
According to this thread at 3DGPU regarding an interview with Nvidia's CEO in Wired magazine, Nvidia's next generation NV30 GPU is said to consist of 120 million transistors. This contrasts with the 76 million transistor 3DLabs P10 and the 80 million transistor Matrox Parhelia-512. Given this 120 million figure, it is extremely likely that the NV30 will be produced on a 0.13µ process, whereas the 3DLabs and Matrox solutions are on 0.15µ processes. It has been rumored that the Parhelia-512 will clock at around 220 MHz.
Additionally, The Inquirer is reporting that ATi is planning on introducing its next-generation R300 GPU sometime late this summer. The article suggests that the R300 will sport 107 million transistors on a 0.15µ process and will feature 8 pixel pipelines, 4 vertex shader pipelines, and a 256-bit DDR SDRAM memory interface. The R300 was recently used at E3 to demonstrate iD Software's upcoming DOOM III.
</strong><hr></blockquote>
Besides giving a hint at what I've been saying about the coming ATI/nVidia GPUs, this gives a good indication of transistor counts possible on a 0.13 micron process. The POWER4 170 million on a 0.18 process (for a twin core) with a very large die size, I would guess. This would seem to indicate that a single core POWER4-like processor should be doable in 0.13 with a reasonable die size (important for good yields).
I say keep the dual core and cut down on the cache. Unless you want a really small core+cache+memcontroller combo and use IBM's POWER4bus to connect other cores. You'd need a bus that fast to have the memory controllers stay efficient and not degrade too much performance.
<strong>I say keep the dual core and cut down on the cache. Unless you want a really small core+cache+memcontroller combo and use IBM's POWER4bus to connect other cores. You'd need a bus that fast to have the memory controllers stay efficient and not degrade too much performance.</strong><hr></blockquote>
I dunno ... I just don't see it being practical for Apple to use such a wide/fast bus on their desktop motherboards. The future or the personal computer is in narrow/fast packet switched interconnects like RapidIO, or simple chip-chip connections like HyperTransport.
They might be able to stay multi-core and keep most of the cache on a 120+ million transistor chip if they can share execution units (a la HyperThreading). IBM says that the POWER5 will do this and be able to run both threads at full tilt -- as opposed to Intel's 10-30% performance improvement. If you have enough execution units, I don't see why not. This is almost multi-core, but better because its more flexible -- if one thread is doing floating point and the other integer (or perhaps FP vs VP), then they get all of the execution units out of the shared pool and thus go faster than if the cores were completely seperate and half the units were sitting idle on each core.
As Intel scales to 3GHz by the end of this year those numbers will start to look even poorer for IBM who probally won't be able to scale their design much.
<hr></blockquote>
Of course, it is good to remember that a Power4 has two cores per cpu, and four cpu's per die. Even if it s overall cpu score is inflated by the fact that the speed trials were running on one CPU (with effectively bottomless motherboard bandwidth), the machine will still beat the living daylights out of any PC server. Of course, they are in different markets and all that, so it really is only sensible to compare a Power4 to an eight way intel server (or perhaps a 4 way Hyperthreaded intel box). I would bet some serious $ that the Power 4 would beat the living daylights out of such an Intel box- it was designed to be able to feed its processors and to keep them as busy as possible.
The Power4 would at best make a great server CPU for Apple. I don't know if it would sell well in Apple's market if it was going to be priced anywhere near where IBM prices these things. You could buy something like 70 XServe's for the price of a Power4.
<strong>The Power4 would at best make a great server CPU for Apple. I don't know if it would sell well in Apple's market if it was going to be priced anywhere near where IBM prices these things. You could buy something like 70 XServe's for the price of a Power4.</strong><hr></blockquote>
People keep saying this, but nobody has said that they think Apple will just drop an existing POWER4 into a PowerMac and ship it out the door. You're absolutely right that it wouldn't make sense, and the price would be right out of the market. But the point is that IBM could take the POWER4 technology and use it to build a PowerPC that does make sense for Apple to use. Given the 0.13 process technology the resulting machine should still be able to beat up Intel's desktop machines.
<strong>So here is Apple's ideal processor for desktop use:
BIG BROTHER:
Single core
6 fixed point units and 3 floating point units
Altivec
12 stage pipline with strong branch prediction
at least 1MB of L2 cache
64KB-I and 128KB-D L1 cache
DDR-I/DDR-II combo memory controller connected to core 128bit wide at half core speed
40 bit memory addressing
2 of the int units have 64bit extensions
RapidIO 16bit to motherboard
LITTLE BROTHER:
Single core
4 fixed point units and 1 floating point units
Atlivec
12 stage pipline with strong branch prediction
at least 512KB of L2 cache
64KB-I and 128KB-D L1 cache
DDR-I/DDR-II combo memory controller connected to core 128bit wide at half core speed
40 bit memory addressing
1 of the int units have 64bit extensions
RapidIO 16bit to motherboard
2 different parts. One for Pro use and one for Consumer/portable use. They are modular and very compatible so the code doesn't know the difference.</strong><hr></blockquote>
If they have more than 1 version. Probably not because its easier to differentiate with clock rate, not features (you can just sell the high speed rejects as the slower version). The G4 is better for the consumer/notebook machines.
Also, if they have 64-bit execution units they will all be 64-bit execution units. No sense in mixing them. The addressing will be 64-bit in that case, although the memory controller might only support ~40 as you suggest to reduce the pin count.
The on-chip memory controller will be asynchronously connected to memory because you don't want to tie your clock rate to your memory speed.
Four FPUs might let you avoid having a vector floating point unit (which might let you have more of other stuff), since they are cracking instructions anyhow. More integer units would be fun. The pipelines on the POWER4 are 17 stages deep. If you have all of those execution units you may as well hyper-thread the thing to get "fake" dual core.
The on-chip memory controller will be asynchronously connected to memory because you don't want to tie your clock rate to your memory speed.
That's true, I forgot about that. Wouldn't that mean internal PLLs like the 750FX has? RIO would need it's own PLL too I guess.
Another idea is how about beefing the altivec unit up so it can do the job of multiple FPUs and design it in a way so it is transparent to the code. Like when an FP instruction(s) comes through the Altivec unit is switched to FPU mode and acts like 3-4 FPU's. When vector instructions come in then the Altivec unit will process them normally. If done right there won't even be a wasted clock tick. Then the processor would have 6 integer units and one beefy Altivec unit for vector and FP instructions.
i think Intel was trying to convince developers to use SSE2 for FP instructions so they could design the next Pentium without FPU's. But if you make it transparent then this becomes a non-issue.
<strong>The on-chip memory controller will be asynchronously connected to memory because you don't want to tie your clock rate to your memory speed.
That's true, I forgot about that. Wouldn't that mean internal PLLs like the 750FX has? RIO would need it's own PLL too I guess.
Another idea is how about beefing the altivec unit up so it can do the job of multiple FPUs and design it in a way so it is transparent to the code. Like when an FP instruction(s) comes through the Altivec unit is switched to FPU mode and acts like 3-4 FPU's. When vector instructions come in then the Altivec unit will process them normally. If done right there won't even be a wasted clock tick. Then the processor would have 6 integer units and one beefy Altivec unit for vector and FP instructions.
i think Intel was trying to convince developers to use SSE2 for FP instructions so they could design the next Pentium without FPU's. But if you make it transparent then this becomes a non-issue.</strong><hr></blockquote>
I would think this would happen in reverse -- put lots of floating point units in and when an AltiVec instruction arrives, crack it apart and send each part to a different floating point unit.
Intel is trying to convince developers to use the VPU because their floating point instruction set sucks.
I would think this would happen in reverse -- put lots of floating point units in and when an AltiVec instruction arrives, crack it apart and send each part to a different floating point unit.
I think it would be interchangable. It's 4 FP units that can to vector instructions AND it's 1 vector unit that can do multiple FP instructions.
<strong>I would think this would happen in reverse -- put lots of floating point units in and when an AltiVec instruction arrives, crack it apart and send each part to a different floating point unit.
I think it would be interchangable. It's 4 FP units that can to vector instructions AND it's 1 vector unit that can do multiple FP instructions.</strong><hr></blockquote>
Except that a vector unit has to do its instructions in an inherently synchronized manner, which would cause headaches for the instruction scheduler/decoder. I'm not a hardware guy, but if I was building a software system to handle this it would be a lot easier to just crack vector instructions and not try to build them.
Comments
<strong>[code]
P4: INT base/peak FP base/peak
882/896 861/873
POWER4: 804/839 1202/1266
</pre><hr></blockquote>
</strong><hr></blockquote>
Hang on!
You are comparing the very latest Pentium IV fabbed in the state of the art .13 micron process against a processor that is fabbed in an older .18 micron process. Put that Power 4 beast on IBM's .13 micron SOI process and you should see at least 30% increase in clock speed. Try to compare processors in like processes.
<strong>[code]
P4: INT base FP base
882 861
POWER4(+30%): >1040 >1500
</pre><hr></blockquote>
</strong>
Suddenly it is not so close, and I beleive this is conservative.
SPECmarks also aren't real work, and knowing IBM the POWER4 has been carefully tuned to be a terrific server -- especially its SMP performance (which is not measured by SPECmarks). The POWER4 is also scaling very well from 1 GHz -> 1.3 GHz and could probably scale almost linearly for a while more yet, whereas the Pentium4 isn't scaling so well at its already lofty clock rates. Not entirely fair, however, since the POWER4's bus is no doubt far more expensive than what Intel delivers to the desktop.
Found this interesting blurb while looking for benchmarks:
[quote]<strong>
According to this thread at 3DGPU regarding an interview with Nvidia's CEO in Wired magazine, Nvidia's next generation NV30 GPU is said to consist of 120 million transistors. This contrasts with the 76 million transistor 3DLabs P10 and the 80 million transistor Matrox Parhelia-512. Given this 120 million figure, it is extremely likely that the NV30 will be produced on a 0.13µ process, whereas the 3DLabs and Matrox solutions are on 0.15µ processes. It has been rumored that the Parhelia-512 will clock at around 220 MHz.
Additionally, The Inquirer is reporting that ATi is planning on introducing its next-generation R300 GPU sometime late this summer. The article suggests that the R300 will sport 107 million transistors on a 0.15µ process and will feature 8 pixel pipelines, 4 vertex shader pipelines, and a 256-bit DDR SDRAM memory interface. The R300 was recently used at E3 to demonstrate iD Software's upcoming DOOM III.
</strong><hr></blockquote>
Besides giving a hint at what I've been saying about the coming ATI/nVidia GPUs, this gives a good indication of transistor counts possible on a 0.13 micron process. The POWER4 170 million on a 0.18 process (for a twin core) with a very large die size, I would guess. This would seem to indicate that a single core POWER4-like processor should be doable in 0.13 with a reasonable die size (important for good yields).
<strong>I say keep the dual core and cut down on the cache. Unless you want a really small core+cache+memcontroller combo and use IBM's POWER4bus to connect other cores. You'd need a bus that fast to have the memory controllers stay efficient and not degrade too much performance.</strong><hr></blockquote>
I dunno ... I just don't see it being practical for Apple to use such a wide/fast bus on their desktop motherboards. The future or the personal computer is in narrow/fast packet switched interconnects like RapidIO, or simple chip-chip connections like HyperTransport.
They might be able to stay multi-core and keep most of the cache on a 120+ million transistor chip if they can share execution units (a la HyperThreading). IBM says that the POWER5 will do this and be able to run both threads at full tilt -- as opposed to Intel's 10-30% performance improvement. If you have enough execution units, I don't see why not. This is almost multi-core, but better because its more flexible -- if one thread is doing floating point and the other integer (or perhaps FP vs VP), then they get all of the execution units out of the shared pool and thus go faster than if the cores were completely seperate and half the units were sitting idle on each core.
As Intel scales to 3GHz by the end of this year those numbers will start to look even poorer for IBM who probally won't be able to scale their design much.
<hr></blockquote>
Of course, it is good to remember that a Power4 has two cores per cpu, and four cpu's per die. Even if it s overall cpu score is inflated by the fact that the speed trials were running on one CPU (with effectively bottomless motherboard bandwidth), the machine will still beat the living daylights out of any PC server. Of course, they are in different markets and all that, so it really is only sensible to compare a Power4 to an eight way intel server (or perhaps a 4 way Hyperthreaded intel box). I would bet some serious $ that the Power 4 would beat the living daylights out of such an Intel box- it was designed to be able to feed its processors and to keep them as busy as possible.
The Power4 would at best make a great server CPU for Apple. I don't know if it would sell well in Apple's market if it was going to be priced anywhere near where IBM prices these things. You could buy something like 70 XServe's for the price of a Power4.
<strong>The Power4 would at best make a great server CPU for Apple. I don't know if it would sell well in Apple's market if it was going to be priced anywhere near where IBM prices these things. You could buy something like 70 XServe's for the price of a Power4.</strong><hr></blockquote>
People keep saying this, but nobody has said that they think Apple will just drop an existing POWER4 into a PowerMac and ship it out the door. You're absolutely right that it wouldn't make sense, and the price would be right out of the market. But the point is that IBM could take the POWER4 technology and use it to build a PowerPC that does make sense for Apple to use. Given the 0.13 process technology the resulting machine should still be able to beat up Intel's desktop machines.
BIG BROTHER:
Single core
6 fixed point units and 3 floating point units
Altivec
12 stage pipline with strong branch prediction
at least 1MB of L2 cache
64KB-I and 128KB-D L1 cache
DDR-I/DDR-II combo memory controller connected to core 128bit wide at half core speed
40 bit memory addressing
2 of the int units have 64bit extensions
RapidIO 16bit to motherboard
LITTLE BROTHER:
Single core
4 fixed point units and 1 floating point units
Atlivec
12 stage pipline with strong branch prediction
at least 512KB of L2 cache
64KB-I and 128KB-D L1 cache
DDR-I/DDR-II combo memory controller connected to core 128bit wide at half core speed
40 bit memory addressing
1 of the int units have 64bit extensions
RapidIO 16bit to motherboard
2 different parts. One for Pro use and one for Consumer/portable use. They are modular and very compatible so the code doesn't know the difference.
<strong>So here is Apple's ideal processor for desktop use:
BIG BROTHER:
Single core
6 fixed point units and 3 floating point units
Altivec
12 stage pipline with strong branch prediction
at least 1MB of L2 cache
64KB-I and 128KB-D L1 cache
DDR-I/DDR-II combo memory controller connected to core 128bit wide at half core speed
40 bit memory addressing
2 of the int units have 64bit extensions
RapidIO 16bit to motherboard
LITTLE BROTHER:
Single core
4 fixed point units and 1 floating point units
Atlivec
12 stage pipline with strong branch prediction
at least 512KB of L2 cache
64KB-I and 128KB-D L1 cache
DDR-I/DDR-II combo memory controller connected to core 128bit wide at half core speed
40 bit memory addressing
1 of the int units have 64bit extensions
RapidIO 16bit to motherboard
2 different parts. One for Pro use and one for Consumer/portable use. They are modular and very compatible so the code doesn't know the difference.</strong><hr></blockquote>
If they have more than 1 version. Probably not because its easier to differentiate with clock rate, not features (you can just sell the high speed rejects as the slower version). The G4 is better for the consumer/notebook machines.
Also, if they have 64-bit execution units they will all be 64-bit execution units. No sense in mixing them. The addressing will be 64-bit in that case, although the memory controller might only support ~40 as you suggest to reduce the pin count.
The on-chip memory controller will be asynchronously connected to memory because you don't want to tie your clock rate to your memory speed.
Four FPUs might let you avoid having a vector floating point unit (which might let you have more of other stuff), since they are cracking instructions anyhow. More integer units would be fun. The pipelines on the POWER4 are 17 stages deep. If you have all of those execution units you may as well hyper-thread the thing to get "fake" dual core.
All purely fanciful speculation, of course.
That's true, I forgot about that. Wouldn't that mean internal PLLs like the 750FX has? RIO would need it's own PLL too I guess.
Another idea is how about beefing the altivec unit up so it can do the job of multiple FPUs and design it in a way so it is transparent to the code. Like when an FP instruction(s) comes through the Altivec unit is switched to FPU mode and acts like 3-4 FPU's. When vector instructions come in then the Altivec unit will process them normally. If done right there won't even be a wasted clock tick. Then the processor would have 6 integer units and one beefy Altivec unit for vector and FP instructions.
i think Intel was trying to convince developers to use SSE2 for FP instructions so they could design the next Pentium without FPU's. But if you make it transparent then this becomes a non-issue.
<strong>The on-chip memory controller will be asynchronously connected to memory because you don't want to tie your clock rate to your memory speed.
That's true, I forgot about that. Wouldn't that mean internal PLLs like the 750FX has? RIO would need it's own PLL too I guess.
Another idea is how about beefing the altivec unit up so it can do the job of multiple FPUs and design it in a way so it is transparent to the code. Like when an FP instruction(s) comes through the Altivec unit is switched to FPU mode and acts like 3-4 FPU's. When vector instructions come in then the Altivec unit will process them normally. If done right there won't even be a wasted clock tick. Then the processor would have 6 integer units and one beefy Altivec unit for vector and FP instructions.
i think Intel was trying to convince developers to use SSE2 for FP instructions so they could design the next Pentium without FPU's. But if you make it transparent then this becomes a non-issue.</strong><hr></blockquote>
I would think this would happen in reverse -- put lots of floating point units in and when an AltiVec instruction arrives, crack it apart and send each part to a different floating point unit.
Intel is trying to convince developers to use the VPU because their floating point instruction set sucks.
I think it would be interchangable. It's 4 FP units that can to vector instructions AND it's 1 vector unit that can do multiple FP instructions.
<strong>I would think this would happen in reverse -- put lots of floating point units in and when an AltiVec instruction arrives, crack it apart and send each part to a different floating point unit.
I think it would be interchangable. It's 4 FP units that can to vector instructions AND it's 1 vector unit that can do multiple FP instructions.</strong><hr></blockquote>
Except that a vector unit has to do its instructions in an inherently synchronized manner, which would cause headaches for the instruction scheduler/decoder. I'm not a hardware guy, but if I was building a software system to handle this it would be a lot easier to just crack vector instructions and not try to build them.