CONFIRMED IBM Power PC 970

barto · October 23, 2002 5:17PM

[quote]Originally posted by Outsider:



If it is locked at 4x how will this satisfy the need for Apple to have a processor that can scale to speeds beyond 1.8GHz? IBM has stated that the bus works at speeds of up to 900MHz. This would indicate that initially they are limited to a speed of 1.8GHz off the bat and I don't think they can scale their bus any faster at the moment. Although I see no problem in down scaling it (700MHz bus for a 1.4GHz part for example) they will have a harder time in upscaling it. It would make sense that they have flexible CPU/bus ratios. Do you know of any limitations with their bus that would prevent such a thing?<hr></blockquote>

I disagree. IBM will increase clock speed and bus speed together, like with the Power4.

This chip won't be like the Pentium4 or G4, with lots of little speed increases.

Barto

outsider · October 23, 2002 8:55PM

[quote]Originally posted by Barto:



I disagree. IBM will increase clock speed and bus speed together, like with the Power4.

This chip won't be like the Pentium4 or G4, with lots of little speed increases.

Barto<hr></blockquote>

How does that negate what I said? Instead of a 112.5MHz bus they can instead do a 125MHz bus (1000MHz) and get 2GHz. Thus the bus and CPU scale linearly. Or a 133MHz base bus would allow for a 2.33GHz cpu.

majormatt · October 23, 2002 9:24PM

I wonder how much 970 chips will cost.

snoopy · October 24, 2002 12:22AM

From the discussions about the bus, it is not clear whether the bus comes out at half speed or whether there is a divider for lower bus speeds. In any case, it looks like a chip may be needed between the the PPC processor and the motherboard. If so, IBM may supply that chip. IBM likely has plans to use the 970 in some Linux workstations and low end servers, so they need the chip. If IBM makes that chip versitile, it makes the 970 much easier to market to other companies.

ed m. · October 24, 2002 12:54AM

About the 970's bus...

Here is what a friend and reliable source told me...

[[[ IBM's bus is actually 2 32-bit unidirectional buses. Their claim of 6.4 GB/s

is actually 3.2 GB/s of read bandwidth, and 3.2 GB/s of write bandwidth. This

is an important distinction since many applications (with notable exception of

certain mem-copy type stuff) typically stress read bandwidth much more than

write. Don't get me wrong, 6.4 GB/s is a fair peak claim, but it does overstate

what most apps will actually be able to use.]]] - source.

As far as the AltiVec/VMX compatibility goes, he had this to say:

[[[This description does not imply ANY incompatibility. AltiVec "dst" instructions

currently define 4 >software< definable prefetch streams. Hardware prefetch

streams are a separate issue from software prefetch streams and typically

invisible to software.

A good microarchitect might try to reuse some of the logic structures

between the two, with the >possible< result that when software was

using all 4 of it's prefetch streams, the automatic hardware prefetcher

would only have access to 4 other prefetch streams instead of all 8. This is

just speculation, however ... ]]] - source

Programmer, what do *you* think?

Anyone else?

--

Ed M.

programmer · October 24, 2002 8:58AM

[quote]Originally posted by Ed M.:

[[[ IBM's bus is actually 2 32-bit unidirectional buses. Their claim of 6.4 GB/s

is actually 3.2 GB/s of read bandwidth, and 3.2 GB/s of write bandwidth. This

is an important distinction since many applications (with notable exception of

certain mem-copy type stuff) typically stress read bandwidth much more than

write. Don't get me wrong, 6.4 GB/s is a fair peak claim, but it does overstate

what most apps will actually be able to use.]]] <hr></blockquote>

I agree... in fact I already pointed this out a few days back in some thread or another.

[quote]

[[[This description does not imply ANY incompatibility. AltiVec "dst" instructions

currently define 4 >software< definable prefetch streams. Hardware prefetch

streams are a separate issue from software prefetch streams and typically

invisible to software.

A good microarchitect might try to reuse some of the logic structures

between the two, with the >possible< result that when software was

using all 4 of it's prefetch streams, the automatic hardware prefetcher

would only have access to 4 other prefetch streams instead of all 8. This is

just speculation, however ... ]]]<hr></blockquote>

That's a good point. Based on the context I had jumped to a conclusion that may not be valid. The software controlled prefetch stream instructions, however, have two reserved bits at the top of the stream index field which indicates that it would be very simple to extend the instructions to handle up to 8 or 16 streams. Things like this in ISA design are rarely coincidence so it was probably deliberately planned from inception and IBM is just executing on it. Your source is correct that these instructions are just hints, however, and that IBM is free to associate software "logical" prefetch streams with actual hardware in whatever fashion they see fit. Indeed, the POWER4 (which doesn't have the VMX stream control instructions) would use its prefetch hardware in an automatic fashion based on the thread's memory access pattern. We can hope that the 970 will allow the software to control the 8 streams directly, but will automatically use the unused ones in the POWER4 fashion. Until IBM documentation arrives, however, we can only speculate.

mmicist · October 24, 2002 10:16AM

quote:

Originally posted by Ed M.:

[[[ IBM's bus is actually 2 32-bit unidirectional buses. Their claim of 6.4 GB/s

is actually 3.2 GB/s of read bandwidth, and 3.2 GB/s of write bandwidth. This

is an important distinction since many applications (with notable exception of

certain mem-copy type stuff) typically stress read bandwidth much more than

write. Don't get me wrong, 6.4 GB/s is a fair peak claim, but it does overstate

what most apps will actually be able to use.]]]

quote:

originally posted by programmer:

[[I agree... in fact I already pointed this out a few days back in some thread or another.]]

However, there is a lot of overhead on a standard bidirectional bus when you don't have a continuous stream of data in one direction. If you have a single bus that must reverse the data direction continually when requesting and receiving data, this slows it down a lot. Look at the STREAM results for the P4 and Athlon, they get nowhere near the theoretical bandwidth. If you have independent unidirectional busses, one bus can be sending the addresses of required data, whilst the other continuously streams the data, you may well get close to 3GB/s actual data transfer, the removal of bus contention should also allow much better queue management on the bu.

Personally, I'm intrigued by IBM's use of the term "elastic bus". Does this mean it is elastic in terms of frequency, as it can use an integer divisor of the chip frequency and scale with the chip clock as well, or does it imply it's width is variable, as with the Athlon version of Hammer, which has an Hypertransport bus that can function as a single 16bit wide bus or two 8 bit wide busses?

[ 10-24-2002: Message edited by: mmicist ]

programmer · October 24, 2002 11:12AM

Absolutely -- there are probably all sorts of reasons why a pair of uni-directional busses have advantages over a single bi-directional bus.

Virtually all busses work on bursts of data, usually 32-bytes in size since that's the size of a typical cacheline, but potentially much longer. A single burst is an atomic transaction and cannot be interrupted so if you want to send something in the other direction you have to wait for a burst to end. This means you want to keep bursts short enough that you don't interfere with traffic in the other direction. A dual uni-directional bus doesn't have this problem so they can go to longer bursts. Happily these can be generated by the hardware prefetch engines. Each burst has overhead (i.e. synchronization, what address to send, and how much to send) so increasing burst size makes the burst more efficient... send twice as much and the overhead is half the %age of the data.

At the signaling level there are advantages too, I'd expect. If you know all your signals are going in the same direction you can predict the effects of cross-talk between the signal lines better. The signal strengths are all equal as well because they all originate from the same place and drop off at the same rate.

There is also no negotation for who gets to send on the bus... each direction is owned by exactly one chip who can therefore send at will and managed data flow across the bus.

None of which changes the fact that the GPUL maxes out at 3.2 GB/sec in either direction as opposed to having a flexible 6.4 GB/sec.

tjm · October 24, 2002 12:55PM

[quote]Originally posted by mmicist:

Personally, I'm intrigued by IBM's use of the term "elastic bus". Does this mean it is elastic in terms of frequency, as it can use an integer divisor of the chip frequency and scale with the chip clock as well, or does it imply it's width is variable, as with the Athlon version of Hammer, which has an Hypertransport bus that can function as a single 16bit wide bus or two 8 bit wide busses?

[/QB]<hr></blockquote>

Perhaps this "elastic bus" goes with their "Switched fabric" - I've got it! The IBM PPC970 is made out of Spandex!!

OK, I'll shut up and go away now... <img src="graemlins/smokin.gif" border="0" alt="[Chilling]" />

mmicist · October 24, 2002 1:21PM

[quote]Originally posted by Programmer:



None of which changes the fact that the GPUL maxes out at 3.2 GB/sec in either direction as opposed to having a flexible 6.4 GB/sec.

<hr></blockquote>

Absolutely. But frankly, I would do evil things to get 3.2GB/s feed to my processor, especially when that processor can do double precision FP calculations at quite such a spectacular rate.

I've just ordered a dual 2.6GHz Xeon because I need that FP performance *now* to check/run my code (maybe a little bit to sell it too), in that regard, the G4 really doesn't cut it, although I write the code on a B+W G3, because I prefer it.

michael

kupan787 · October 24, 2002 1:46PM

[quote]Originally posted by Programmer:

None of which changes the fact that the GPUL maxes out at 3.2 GB/sec in either direction as opposed to having a flexible 6.4 GB/sec.

<hr></blockquote>

But is the 3.2 GB/sec the base, or could it be double pumped (I have no idea if this wording is even correct). All the talk has changed from MHz bus speeds to bandwidth, so the whole double/quad pumped bus deal has confused me.

So is it 2 unidirectional 450 MHz buses (they just add them to 900 MHz for marketing), or is it 2 unidirectional 450MHz busses that are both double pumped so they are effectively BOTH 900MHz busses?

Also, the 3.2 GB/sec number, that is after subtracting overhead, yes? I remember hearing that it was 7.2, but after factoring in overhead, it was 6.4 (or something along those lines).

What does the current G4 bus get (after overhead)?

What could it get if it was a true ddr bus?

What does the P4 get (with its quad pumped bus)?

[ 10-24-2002: Message edited by: kupan787 ]

mmicist · October 24, 2002 2:20PM

[quote]Originally posted by kupan787:



But is the 3.2 GB/sec the base, or could it be double pumped (I have no idea if this wording is even correct). All the talk has changed from MHz bus speeds to bandwidth, so the whole double/quad pumped bus deal has confused me.

So is it 2 unidirectional 450 MHz buses (they just add them to 900 MHz for marketing), or is it 2 unidirectional 450MHz busses that are both double pumped so they are effectively BOTH 900MHz busses?

<hr></blockquote>

It's 2 unidirectional 450MHz busses that are both double pumped to 900 MHz.

[quote]



Also, the 3.2 GB/sec number, that is after subtracting overhead, yes? I remember hearing that it was 7.2, but after factoring in overhead, it was 6.4 (or something along those lines).

What does the current G4 bus get (after overhead)?

What could it get if it was a true ddr bus?

What does the P4 get (with its quad pumped bus)?

<hr></blockquote>

Correct, as for the actual throughput figures, I'll see what I can find, I don't have anything to hand at the moment.

michael

[ 10-24-2002: Message edited by: mmicist ]

mmicist · October 24, 2002 7:03PM

[quote]Originally posted by mmicist:



Correct, as for the actual throughput figures, I'll see what I can find, I don't have anything to hand at the moment.

michael<hr></blockquote>

Some *very approximate* figures, averaged over various reported measurements:

Athlon 2100+ (DDR 266 bus) ~1000MB/s

P4 2.0GHz (QDR 400 bus) ~1500MB/s

PowerMac G4 867 (SDR 133 bus) ~700MB/s

but

POWER4 ~22000MB/s

(I think the whole thing fits into the POWER4's L3 cache, though)

michael

programmer · October 24, 2002 9:42PM

[quote]Originally posted by mmicist:



Some *very approximate* figures, averaged over various reported measurements:

Athlon 2100+ (DDR 266 bus) ~1000MB/s

P4 2.0GHz (QDR 400 bus) ~1500MB/s

PowerMac G4 867 (SDR 133 bus) ~700MB/s

but

POWER4 ~22000MB/s

(I think the whole thing fits into the POWER4's L3 cache, though)

michael<hr></blockquote>

The PIV and Athlon rates are probably memory limited, as opposed to bus limited. The G4, on the other hand, is bus limited. The new MPX 167 MHz bus gets about 880 MB/sec. Those G4 numbers are without using the AltiVec streaming instructions -- with those I've seen rates as high as 840 and 1000 MB/sec for highly optimized pieces of code. Unfortunately the G4 shares the bus between processors. I don't think the PIV Xeon does..?

bigc · October 25, 2002 12:30AM

[quote]Originally posted by mmicist:



Some *very approximate* figures, averaged over various reported measurements:

Athlon 2100+ (DDR 266 bus) ~1000MB/s

P4 2.0GHz (QDR 400 bus) ~1500MB/s

PowerMac G4 867 (SDR 133 bus) ~700MB/s

but

POWER4 ~22000MB/s

(I think the whole thing fits into the POWER4's L3 cache, though)

michael<hr></blockquote>

22,000MB/s (22 GB/s) for Power4 or 2,200MB/s

moki · October 25, 2002 1:29AM

[quote]Originally posted by Programmer:



No, the Motorola compilers are actually pretty good. AFAIK, SPEC runs on very large blocks of data with poor cache coherency and this hurts the G4 worse than most other chips because of its relatively slow MPX bus. If the benchmarks emphasized a heavily used working set of 1 MB then the G4 would, I'm sure, stack up much better.<hr></blockquote>

SPEC is a set of demi-real world tasks -- have a look:

<a href="http://www.spec.org/osg/cpu2000/papers/COMPUTER_200007-abstract.JLH.html"; target="_blank">http://www.spec.org/osg/cpu2000/papers/COMPUTER_200007-abstract.JLH.html</a>;

the specific benchmarks for SPEC INT: <a href="http://www.spec.org/osg/cpu2000/CINT2000/"; target="_blank">http://www.spec.org/osg/cpu2000/CINT2000/</a>;

the specific benchmarks for SPEC FP: <a href="http://www.spec.org/osg/cpu2000/CFP2000/"; target="_blank">http://www.spec.org/osg/cpu2000/CFP2000/</a>;

mmicist · October 25, 2002 8:47AM

[quote]Originally posted by Bigc:



22,000MB/s (22 GB/s) for Power4 or 2,200MB/s<hr></blockquote>

22 GB/s

michael

mmicist · October 25, 2002 8:50AM

[quote]Originally posted by Programmer:



The PIV and Athlon rates are probably memory limited, as opposed to bus limited. The G4, on the other hand, is bus limited. The new MPX 167 MHz bus gets about 880 MB/sec. Those G4 numbers are without using the AltiVec streaming instructions -- with those I've seen rates as high as 840 and 1000 MB/sec for highly optimized pieces of code. Unfortunately the G4 shares the bus between processors. I don't think the PIV Xeon does..?<hr></blockquote>

Yes, I know, I just wanted to give some ballpark figures. The G4 figure I gave was for an old SDR memory machine.

The P4 Xeon does use a shared bus, the Athlon MP uses point-to-point.

michael

programmer · October 25, 2002 10:15AM

[quote]Originally posted by moki:



SPEC is a set of demi-real world tasks -- have a look:

<a href="http://www.spec.org/osg/cpu2000/papers/COMPUTER_200007-abstract.JLH.html"; target="_blank">http://www.spec.org/osg/cpu2000/papers/COMPUTER_200007-abstract.JLH.html</a>;

the specific benchmarks for SPEC INT: <a href="http://www.spec.org/osg/cpu2000/CINT2000/"; target="_blank">http://www.spec.org/osg/cpu2000/CINT2000/</a>;

the specific benchmarks for SPEC FP: <a href="http://www.spec.org/osg/cpu2000/CFP2000/"; target="_blank">http://www.spec.org/osg/cpu2000/CFP2000/</a><hr></blockquote>;

I've seen that before, unfortunately it doesn't really say how large the data sets in question are. The G4 spec marks tossed about make no sense relative to the other processors, even accounting for a 50% clock rate deficit. I know from first hand experience that the G4 isn't that much slower. The difference must be bandwidth, and possibly the use of lame compilers...? Motorola has some good compilers though so I'd be surprised if they didn't use them.

moki · October 25, 2002 10:54AM

[quote]Originally posted by Programmer:



I've seen that before, unfortunately it doesn't really say how large the data sets in question are. <hr></blockquote>

they do list what they use; here's the gzip test for instance:

.....

164.gzip's reference workload has five components: a large TIFF image, a webserver log, a program binary, random data, and a source tar file. With the exception of the random data, these components were selected as a reasonably representative set of things that gzip might be most often used on. The random data is present to test gzip's worst-case behavior.

.....

...and you can download the reference data files to run the tests yourself.

As for why the G4 scores so poorly, I honestly have no idea -- it isn't _that_ slow of a processor.

CONFIRMED IBM Power PC 970

Comments