Second Ars PPC 970 Article

curiousuburb · May 15, 2003 1:06AM

no no... got to bring enough for the whole class

costique · May 15, 2003 1:28AM

In fact, Hannibal disappointed me to some extent. To be honest, I expected more impressive integer performance; but that's my problem

. Correct me if I'm wrong to think that all the file system, Finder and IO subsystems are built around integers. I wonder what impact on the overall performance slow integer units may have.

We can only hope that those little unknowns do their job to boost the CPU. Waiting for a miracle to come...

All in all, everything seems to depend on the success of PPC970. If they perform and sell well enough, we have a good chance that further incarnations of PPC9xx are designed with VMX in mind from the ground up and integer units become more aggressive.

drboar · May 15, 2003 2:33AM

The nagging about slow integer performance has to be seen in contex.

A 1.6 GHz 970have the inetger SPEC mark of a G4 @2.8 GHz and a when IBM ramp it up to 2.5 that is as fast as a 4.5 GHz G4. So while a single 1.6 GHZ 970, hopefully out in July, will not beat a P4/3.6 a dual will8) At least sometimes

The G4 was lame and the surrounding systems as well. Since the arrival of the 100 MHz bus in January 1999 with the B&W G3 the buss speed has increased about 15 MHz a year. IBM sure deliver a CPU vastly superior to the G4 now it up to Apple to deliver a motherboard vastly superior to the current one to go with the 970

krassy · May 15, 2003 5:10AM

http://www-3.ibm.com/chips/techlib/techlib.nsf/techdocs/A2CE393ABF2CE99787256D21006AE8A2

costique · May 15, 2003 5:52AM

Well, we'll see. I must have misinterpreted some architectural details. Having read the Ars article, I even managed to cool down to such an extent that I came to think that PPC970 is not a miracle.

jamm · May 15, 2003 8:16AM

Another small indication of the arrival of the 970 - GCC 3.3 adds support for the Power 4 processor (since the 970 core is derived from it). could this appear in the next dev tools (accompanying 10.3 and the new powermacs)? Check it out:

http://gcc.gnu.org/gcc-3.3/changes.html

programmer · May 15, 2003 8:23AM

Quote:

Originally posted by costique

In fact, Hannibal disappointed me to some extent. To be honest, I expected more impressive integer performance; but that's my problem . Correct me if I'm wrong to think that all the file system, Finder and IO subsystems are built around integers. I wonder what impact on the overall performance slow integer units may have.

We can only hope that those little unknowns do their job to boost the CPU. Waiting for a miracle to come...

All in all, everything seems to depend on the success of PPC970. If they perform and sell well enough, we have a good chance that further incarnations of PPC9xx are designed with VMX in mind from the ground up and integer units become more aggressive.

You should have already known the 970's integer performance, Hannibal was just explaining it. The IBM estimated SPECint should be reasonably accurate (and they claim conservative). The integer units are not slow, they can dispatch just as fast as the floating point units. Most general code (e.g. file system, Finder and IO subsystems) has a fairly even balance between integer, load/store, with a few branches thrown in, and this is exactly what IBM designed this processor to do. Keep in mind that many of their primary markets are servers which are doing exactly this kind of work.

While everything depends on the 970, it doesn't depending on it achieving exactly "N" SPECint. What really matters is that IBM delivers the processor on time, on schedule, with good yields and at a low cost. This will allow Apple to deliver new competitive machines in a reasonable timeframe and at a good pricepoint. Rumour has it that the yields are better than expected, and that is really good news.

rickag · May 15, 2003 9:20AM

Does anyone know the cost of L3 cache? Just curious what 2Mb's of backside cache adds to the cost of the towers? I was under the impression the this backside cache, DDRsram, was very expensive.

The die size of IBM's 970 isn't much greater than the current G4(due to its' 0.18µm design). Assuming yields are the same for IBM's 970 and Motorola's 7455 the cost of the cpu should be comparable, but the system w/ the 970 should cost less due to the lack of L3 cache and more that compensate w/ a modern FSB running @ 1/2 processor speeds.

Quote:

Programmer

The FSB is fast enough that it might be sensible to put the L3 on the other side of the bus, in the chipset. There it would cover stalls and contention in the memory system itself without complicating the processor with additional pins and .....

Would this arrangement use DDRsram like the current L3 backside cache?

Oh, and would it need to be up to 2Mb, like the current backside cache?

programmer · May 15, 2003 9:40AM

Quote:

Originally posted by rickag

Would this arrangement use DDRsram like the current L3 backside cache?

Oh, and would it need to be up to 2Mb, like the current backside cache?

SRAM is expensive so I wouldn't expect Apple to use it as an L3 cache in new systems. The more likely approach would be to take advantage of VLSI and embedded it into their memory controller. The cost of adding this to the chipset might not be too high, and is certainly less than adding all of the pins plus an external SRAM chip. The size of the cache would be whatever they could fit economically, but you'd want it to be larger per-processor than the processor's built-in 512K L2 cache so in a dual system there probably isn't much point in less than 2 MB.

rickag · May 15, 2003 10:02AM

Quote:

Originally posted by Programmer

.... but you'd want it to be larger per-processor than the processor's built-in 512K L2 cache so in a dual system there probably isn't much point in less than 2 MB.

Thank you for the response.

Speaking of duals, if the L3 cache is on the controller, would it have to be per processor or could there be one L3 cache shared by each processor? This might save cost, but would it add extra steps? Also, for those users still hoping for quad machines maybe a shared L3 would be better in the long run?

tht · May 15, 2003 10:45AM

I really don't get why there is so much discussion on L3 cache for PPC 970 systems. I predict that all shipping PPC 970 systems will not have any L3 cache whatsoever. Now, I may waiver on quad systems, but quad systems may not even ship!

One reason is that I don't think it'll buy a system much of anything. The PPC 970 bus for a 1.8 GHz 970 is limited to a ~3.2 GByte/s bidirectional processor bus. The maximum it could read would be 3.2 GByte/s. Bidirectional traffic would be ~6.4 GByte/s, yes, but large memory tech can handle that with dual PC3200 DDR SDRAM channels, dual PC1600 DRDRAM channels or even quad PC800 DRDRAM channel memory systems. There is plenty of memory bandwidth to go around, while the only thing L3 cache memory, be it SRAM, embedded DRAM, et al, will buy the system is a reduction in initial read latency, but it is still limited to 3.2 GByte/s read and write bandwidth, the same as a properly implemented main memory system would be at.

SMP PPC 970 systems would have to share L3 cache in the suggested L3 scheme, (L3 cache embedded in the system ASIC chip). Now the POWER4 has L3 cache. If IBM wanted L3 cache in the PPC 970, one would think they would have left the L3 cache and main memory support circuitry in the PPC 970, but they didn't. So I think that means no L3 cache for PPC 970 systems.

Last time I checked, SRAM costs about $100/MByte. The higher the clock rate, the more expensive it got.

rickag · May 15, 2003 1:17PM

Thanks for the information THT. So an IBM 970 w/out L3 cache would save about $200 over the current high end G4's which have 2 Mb of expensive L3 DDRsram.

And not need it to boot.

Here's hoping the cost of Apple's towers won't go up from their current price points and may even go down.

edit: er, um, make that $200 PER G4 in the high end duals, that's $400 if my math is right.

amorph · May 15, 2003 3:19PM

Exactly. Less the pricey L3 cache, and the pricey bus to get to the L3 cache, and with the motherboard built on a RapidIO fabric (which is designed to be inexpensive to implement), I really don't see Apple having to increase costs or prices at all in order to make the 970 work.

Hannibal's explanation (in the forums) for his blaming Apple for the bus speeds ('they designed the implementation and the memory controller') is uncharacteristically weak for him. He admits that he was venting frustration when he said that; he should just remove it from the article. Within the constraints Motorola imposed, their implementation is actually very nice. It's just that the constraints suck.

Otherwise, I look forward to part III, where he actually has access to cache latency and a finalized, public design for the CPU to work from. The man's impressively thorough and careful.

programmer · May 15, 2003 9:41PM

I respectfully disagree with your sentiments about the L3. Sure its not necessary, but it would add value to the system and if embedded in the chipset ASIC the additional cost could be minimal. A dual channel DDR400 memory system will provide 6.4 GB/sec of bandwidth theoretically, less in practice. A single 970 could consume this much bandwidth, a dual could consume double, and adding the AGP / PCI / FireWire / Ethernet just adds to the demands on memory. If Moki's hints about vector processors in the chipset are correct that could be a HUGE additional load. I could easily see Apple next machine capable of demanding >>16 GB/sec of bandwidth from the memory, which is unlikely to be able to even reach 6 GB/sec in practice. A couple of megs of L3 cache embedded in the chipset ASIC at only marginal additional cost could provide a substantial performance improvement.

IBM could do this too (obviously), and the fact that the 970 has a fast FSB isn't a reason not to do... it is a reason to do it! An L3 like this isn't a hack to get around a slow FSB, its a optimization of the memory subsystem. The fast FSB lets them get away without having to build such a thing into the 970, leaving it to the system designer.

amorph · May 15, 2003 11:28PM

Just out of curiosity, since I do like the way the RAM currently has enough bandwidth to saturate all the busses that request data from its banks, can anyone think of a non-exorbitant way to get that 16GB/s of memory throughput? I can't.

If not, it seems we're back to a memory bottleneck, and in that case an L3 might help. But I think it'd help if it were something significant cheaper than the G4's L3. I'm not sure that Apple's paying $200 a box for the stuff, but it can't be helping any.

programmer · May 16, 2003 12:01AM

Quote:

Originally posted by Amorph

Just out of curiosity, since I do like the way the RAM currently has enough bandwidth to saturate all the busses that request data from its banks, can anyone think of a non-exorbitant way to get that 16GB/s of memory throughput? I can't.

If not, it seems we're back to a memory bottleneck, and in that case an L3 might help. But I think it'd help if it were something significant cheaper than the G4's L3. I'm not sure that Apple's paying $200 a box for the stuff, but it can't be helping any.

ATI Radeon9700 == 19 GB/sec

nVidia geForceFX 9600 (?) == 27 GB/sec

tht · May 16, 2003 12:10AM

Quote:

Originally posted by Programmer

I respectfully disagree with your sentiments about the L3. Sure its not necessary, but it would add value to the system and if embedded in the chipset ASIC the additional cost could be minimal. ... A single 970 could consume this much bandwidth, a dual could consume double, and adding the AGP / PCI / FireWire / Ethernet just adds to the demands on memory. If Moki's hints about vector processors in the chipset are correct that could be a HUGE additional load. I could easily see Apple next machine capable of demanding >>16 GB/sec of bandwidth from the memory, which is unlikely to be able to even reach 6 GB/sec in practice.

Remember the architecture proposed here:

Code:

-----------------

| |

| 1.8 GHz PPC 970 |

| |

-----------------

| /|\\

| |

3.2 3.2

GB/s GB/s

| |

| |

\\|/ |

-----------------

| ............. |

| . L3 cache . |----- AGP

| ............. |----- PCI ----- South bridge I/O

| |----- SATA

| System ASIC |----- Firewire, Ethernet

| |

-----------------

/|\\ /|\\

| |

3.2 3.2

GB/s GB/s

| |

| |

\\|/ \\|/

------- -------

| PC | | PC |

| 3200 | | 3200 |

------- -------

Data to the PPC 970 cannot be any faster than the bandwidth of the processor bus. So, how does an inline L3 cache feed more than 6.4 GB/s to the PPC 970 when it's limited to 3.2 GB/s reads and 3.2 GB/s writes (for an aggregate of 6.4 GB/s)? Year 2003 memory technology can saturate this bus, so any hypothetical inline L3 cache can only improve performance through reduce read latencies.

Now, yes, actual data rates for a dual channel PC3200 solution will probably be in the 3.2 GB/s to 4 GB/s range. But an L3 cache is still limited to a theoretical ~6.4 GB/s, and the bandwidth improvement won't be all that much. That's if L3 cache can have near 100% bus utilization.

If there is an L3 cache implementation, I would hope that IBM would implement a backside cache or at least let the bus be clocked at higher rates. Like 2x900 MHz for a 1.8 CPU for 1.8 GHz bandwidth rates, so that an inline cache could at least have twice as much bandwidth as main memory.

Quote:

A couple of megs of L3 cache embedded in the chipset ASIC at only marginal additional cost could provide a substantial performance improvement.

Generally, backside cache only improves performance by 5 to 10%. An inline cache, especially with a high performance main memory, would improve performance less than that.

Embedded memory in the ASIC could be used as a buffer for all the various subsystems and improve smoothness and reduce contention here and there, but in terms of system performance, I don't think the proposed architecture would do much to help.

shaktai · May 16, 2003 12:13AM

Quote:

Originally posted by Amorph

If not, it seems we're back to a memory bottleneck, and in that case an L3 might help. But I think it'd help if it were something significant cheaper than the G4's L3. I'm not sure that Apple's paying $200 a box for the stuff, but it can't be helping any.

Wow! A quantum leap from slow bottleneck to fast bottleneck.

Excuse a non programmer/techie, but do I understand this correctly? The 970 is powerful enough in theory, that the limitation now actually becomes the memory bus or the speed of the memory? Hence, then L3 cache becomes not a necessity as it was with the G4 to keep the CPU fed, but rather a method whereby the load on the primary memory can be reduced since the memory bus is now the limitation, rather then the CPU bus?

tht · May 16, 2003 12:25AM

Quote:

Originally posted by Amorph

Just out of curiosity, since I do like the way the RAM currently has enough bandwidth to saturate all the busses that request data from its banks, can anyone think of a non-exorbitant way to get that 16GB/s of memory throughput? I can't.

I don't know if the graphics card memory solutions Programmer mentioned are non-exorbitant, but Rambus has a roadmap that provide said bandwidths, as main memory.

Quad channel PC1066 Rambus has 8.4 GB/s bandwidth. PC1200 DRDRAM would have 9.6 GB/s bandwidth. If more is needed, add 2 or 4 more channels. This is year 2003 tech. Gigantic bus improvements can be had in the future with Yellowstone and Redwood tech (50+ GB/s).

drboar · May 16, 2003 2:13AM

The L3 jabber is premature! No matter how you design your system something will be the weakst link! Adding a L3 for quite substantial cost only make sense ins some specific situations. For all we knew (and that is not much) It appears that

1. The 970 will be an improvement over the G4 performance wise

2. The 970 is quite competetive with Intel/AMD CPUs

So why should not Apple keep it simple and use the 970 as is?

Having a L3 option is good thing either adding it to top of the line products like quad blade servers for renedering farms and such things. If there is a long stall in the clocking up of the 970 it can compensated by a L3 being grafted onto the 970.

Apple will get the 970 out of the door and then perhaps do L3 it is is usefull at all. The fact that the L3 helpes the G4 means nothing for the 970! Different CPUs different rules! The first Pentium had a severe bus speed dependence were a 166 MHz CPU on a 2x83 MHz bus was faster than a 200 MHz CPU on a 3x66 MHz bus ( descried on Toms hardware guide) the contemporary 601 and 604 did not show such behaviour.

Second Ars PPC 970 Article

Comments