Second Ars PPC 970 Article

shaktai · May 16, 2003 3:15AM

DrBoar,

I agree that a lot of the talk is premature, but a little speculation does no harm and is actually fun.

I think we will more likely see (based upon known statements) that one of Apple's primary goals will be to keep the price down. I am doubtful that they would implement any type of an L3 cache scheme unless there would be a very good reason for it. The benefit of implementation must outweigh the cost. For initial marketing purposes though, I suspect if given a difference of a 5% performance boost, or being able to sell a box for $1999 instead of $2299, they will opt for the lower cost, since the 970 will already bring a dramatic improvement to performance. Still, if Programmer's, or a similar concept could be implemented without raising the cost dramatically and yet turn a performance increase of at least 10%, it would seem like a viable option.

A scenario might be if a high end dual processor model can "overall" match a P4 3.06's performance without the L3 cache, but could obtain at least a 10% performance gain with it (equivelent to a 3.3ghz P4, wouldn't it be prudent to do so if it provided a marketing and performance edge?

Of course this is all speculation. We will have to wait and see as too little about the real performance is known yet. Once real world performance is known, then a better evaluation can be made.

Still the possibilities are there.

thai moof · May 16, 2003 4:32AM

Quote:

Originally posted by Amorph

...

Hannibal's explanation (in the forums) for his blaming Apple for the bus speeds ('they designed the implementation and the memory controller') is uncharacteristically weak for him. He admits that he was venting frustration when he said that; he should just remove it from the article. Within the constraints Motorola imposed, their implementation is actually very nice. It's just that the constraints suck.

I second this thought. In fact, it sort of soured my whole feeling on the article that something so obviously wrong was stated at the start. I realize he was using it as a prop for his 'now I know what Apple has been doing the last two years' line, it still is not appropriate for a techical article to play so fast and loose with facts. It gave me the the impression it was an op-ed piece, and not a techinical paper. Too bad, since he obviously spent a great deal of time on it.

Quote:

Originally posted by Amorph

...

Otherwise, I look forward to part III, where he actually has access to cache latency and a finalized, public design for the CPU to work from. The man's impressively thorough and careful.

As do I. Though I hope I will be using the chip while reading the article!

mmicist · May 16, 2003 9:05AM

Quote:

Originally posted by THT

Data to the PPC 970 cannot be any faster than the bandwidth of the processor bus. So, how does an inline L3 cache feed more than 6.4 GB/s to the PPC 970 when it's limited to 3.2 GB/s reads and 3.2 GB/s writes (for an aggregate of 6.4 GB/s)? Year 2003 memory technology can saturate this bus, so any hypothetical inline L3 cache can only improve performance through reduce read latencies.

Now, yes, actual data rates for a dual channel PC3200 solution will probably be in the 3.2 GB/s to 4 GB/s range. But an L3 cache is still limited to a theoretical ~6.4 GB/s, and the bandwidth improvement won't be all that much. That's if L3 cache can have near 100% bus utilization.

If there is an L3 cache implementation, I would hope that IBM would implement a backside cache or at least let the bus be clocked at higher rates. Like 2x900 MHz for a 1.8 CPU for 1.8 GHz bandwidth rates, so that an inline cache could at least have twice as much bandwidth as main memory.

Generally, backside cache only improves performance by 5 to 10%. An inline cache, especially with a high performance main memory, would improve performance less than that.

Embedded memory in the ASIC could be used as a buffer for all the various subsystems and improve smoothness and reduce contention here and there, but in terms of system performance, I don't think the proposed architecture would do much to help.

There are a number of points to note about programmer's proposed L3 solution, which have been discussed over at Ars.

1) Consider dual processor systems. Apple *may* use dual channel PC3200, but will not use quad channel, so there will be a big shortfall on memory bandwidth (the dual processors have a combined bidirectional bandwidth of 12.8GB/s, twice the memory bandwidth.

2) Non-CPU memory access (AGP, DMA storage, etc.)

3) There is a huge disparity in architecture between the 970's bus and the memory bus. The memory runs at a different speed, and uses a single bidirectional bus, as opposed to the 970's dual unidirectional busses, and uses a non-packetised transmission as well. The memory bus takes a long time to switch from reading to writing (and vice-versa), interlacing reads and writes dramatically reduces the available bandwidth. These facts indicate that a fairly considerable buffer will be required as a minimum on the chip and extending this to a L3 would provide further benefits, especially if it includes a decent prefetch mechanism (as on the nVidia nForce series). As an example, consider a CPU writing out a buffer to memory, at the same time that a disk is reading a block from memory using DMA. Without the L3 they keep on interrupting each other, and switching the direction of the memory bus, with the L3, the CPU write is sent to the L3 until the DMA access is over, and then streamed as a large block to memory, both accesses get the full memory bandwidth with no penalty.

4) The L3 is embedded DRAM, the cost should not be more than about $20 added to the controller chip.

I would expect a better than 10% improvement for the majority of uses.

michael

outsider · May 16, 2003 9:39AM

Using embedded DRAM can be useful not only for it's low latency (the clock cycles used waiting for the fetch and retrieval to the DIMMS will be cut drastically creating a lower latency situation) but you can add more than we have had in the past. 8MB or 16MB would be minimal and 32-64MB would not be out of the question.

amorph · May 16, 2003 9:52AM

Quote:

Originally posted by Programmer

ATI Radeon9700 == 19 GB/sec

nVidia geForceFX 9600 (?) == 27 GB/sec

Yeah, but if I can't hop over to Crucial.com and pick it up in desktop quantities (512MB+) for a sane price Apple isn't going to use it.

The key word here is "non-exorbitant."

THT: As for RDRAM, the fact that performance degrades as the number of RIMMs increases is bad news for a 64-bit platform. It seems to me that, as you've said eariler, that it's best suited to small, multi-channel implementations like consoles. (And, of course, I don't want Apple to have to deal with them at all until their lawyers have been humbled within an inch of their lives.)

mmicist · May 16, 2003 10:01AM

Quote:

Originally posted by Outsider

Using embedded DRAM can be useful not only for it's low latency (the clock cycles used waiting for the fetch and retrieval to the DIMMS will be cut drastically creating a lower latency situation) but you can add more than we have had in the past. 8MB or 16MB would be minimal and 32-64MB would not be out of the question.

Unfortunately the fact that embedded logic process DRAM cells are much larger than memory process DRAM cells, as well as the necessity of holding the L3 tables etc., means that if the L3 is embedded on the memory controller chip, larger than 8MB (about 80 million transistors for the whole cache) would start getting *very* expensive. I would expect to see somewhere between 4MB and 8MB.

michael

tht · May 16, 2003 10:25AM

Quote:

Originally posted by Amorph

THT: As for RDRAM, the fact that performance degrades as the number of RIMMs increases is bad news for a 64-bit platform.

Rambus memory is used in one of highest performance 64 bit platforms available today, HP AlphaServer systems:

Rambus RDRAM Memory Interface Ships In Award-Winning HP AlphaServer Systems

RDRAM® Memory Interface Provides 16 Gigabytes/sec of Memory Performance for EV7 AlphaServer Family

... the Alpha EV7 processor achieves higher performance and scalability through the use of dual integrated memory controllers. Each of the two memory controllers on the EV7 supports four 18-bit RDRAM channels and offers each processor 12.8 Gigabytes/sec of raw bandwidth. To enhance the reliability of these systems, two more channels of RDRAM are available in a memory RAID (Redundant Array of Independent Disks) configuration as backup for fault tolerance purposes. In total, each HP AlphaServer system can accommodate ten 800MHz RDRAM channels that offer 16 Gigabytes/sec memory bandwidth combined...

Quote:

It seems to me that, as you've said eariler, that it's best suited to small, multi-channel implementations like consoles.

My proposals have been for a backside cache implementation using 4 channel, single chip per channel Rambus architecture for desktop high performance CPUs.

The dream is pretty much dead. But the mind wonders if a single RDRAM chip of say 32 MBytes could clock up to 1 GHz reliably (for 4 GB/s bandwidth, and with 4 channels, it'll have 16 GB/s bandwidth).

Quote:

(And, of course, I don't want Apple to have to deal with them at all until their lawyers have been humbled within an inch of their lives.)

There's always more then one side to a story. Yes, Rambus, Inc. behavior hasn't been that great, but the company has some of the best memory and bus engineers in the world. If Apple wants high performance computer architectures, Rambus is a good place to go.

outsider · May 16, 2003 10:30AM

Quote:

Originally posted by mmicist

Unfortunately the fact that embedded logic process DRAM cells are much larger than memory process DRAM cells, as well as the necessity of holding the L3 tables etc., means that if the L3 is embedded on the memory controller chip, larger than 8MB (about 80 million transistors for the whole cache) would start getting *very* expensive. I would expect to see somewhere between 4MB and 8MB.

michael

Then at those sizes, for dual and any possible quad systems, it would not increase performance anything significant.

tht · May 16, 2003 10:59AM

Quote:

Originally posted by mmicist

1) Consider dual processor systems. Apple *may* use dual channel PC3200, but will not use quad channel, so there will be a big shortfall on memory bandwidth (the dual processors have a combined bidirectional bandwidth of 12.8GB/s, twice the memory bandwidth.

I said I may waiver on quad systems in regards to L3 cache, but I think Apple will live with the performance degradation on dual SMP systems as long as they have a memory tech that can deliver the theoretical 6.4 GB/s, or even one that comes close (>3.2 GB/s?).

If Apple saddles a PPC 970 with single channel PC2700 or less, there would be a need for L3 cache. I don't think there would be an argument about that, and everyone would be disappointed if Apple did so.

Quote:

2) Non-CPU memory access (AGP, DMA storage, etc.)

I've actually conceded this point. But the architecture of the system ASIC comes more into play here, and large buffers will probably be needed in any case.

Quote:

These facts indicate that a fairly considerable buffer will be required as a minimum on the chip and extending this to a L3 would provide further benefits, especially if it includes a decent prefetch mechanism (as on the nVidia nForce series).

You mean replacing the system ASIC buffers with an L3? Or in addition to?

Quote:

I would expect a better than 10% improvement for the majority of uses.

Only if the system is memory starved and if the processor bus is able to feed it to the processor.

ebolazaire · May 16, 2003 11:14AM

Quote:

Originally posted by Tomb of the Unknown

Uhm, no. If Apple cuts their margins in half they will make half as much money.

What you are thinking about is how to maximize revenue. That is, at what point does the price begin to reduce demand for the product? That's the sweet spot to find. But merely reducing the price won't necessarily allow you to sell enough additional units to make up the difference in lost revenue.

Look at it another way. If Apple could easily sell twice as many units at the current price, then they should raise the price until they are selling as many as they have at as high a price as the market will allow.

OK, as a correction for the mathmatically challenged... If Apple sells twice as many computers with half of the profit, profit dollars remain constant.

Alright, now that is out of my system. You are correct that in an ideal situation Apple would want to raise prices keeping sales constant to maximize profits. However, the situation does not appear ideal, at least to me. Apple has a lot to gain from increasing market share (Music sales, iPod sales, software sales, etc... to name a few). The case can be made either way. We are not likely to see a price drop as drastic as halving the prift margin, but a modest 5-10% drop? Not out of the realm of possibility...

outsider · May 16, 2003 11:17AM

THT brings up a good point. If Apple chooses to implement an L3 cache of say 4-8MB instead of developing an advanced and probably expensive multi-channel memory system, then it would make sense for an L3 cache on the asic controller and reduce complexity in the asic. Especially if going with a single channel RAMBUS solution like the newish 1066MHz 4.2GBps RAMBUS modules out now. Coupling RAMBUS with a low latency big L3 cache seems like the ideal solution with out going into the more exotic dual channel model.

amorph · May 16, 2003 11:19AM

Hmmm. I can see how an EV7/RDRAM combo would be quite potent as long as the Alpha spent a bare minimum of time waiting for the RDRAM to get around to answering a request. Very useful for server work, and computations over large datasets. Obviously my "not good for 64 bit" comment wasn't that well thought through.

I'm still not entirely convinced that it's a good desktop solution, though, unless there was a moderate-sized DRAM cache between just to handle those little tiny requests that are the worst case for RDRAM. Oh, and the company sucks.

(I'm sure they get along great with Fiorina, though. Birds of a feather flock together.)

The obvious (ha!) alternative for dual procs is NUMA, which was discussed at some length last fall. There could be a single RapidIO bus linking the two, which would be something of a bottleneck in some cases, but the scheduler might be able to work around those. The bugbear then becomes DMA access. Hrm.

Who'd have thought that a nice fat pipe to the CPU would be such a problem? Bring back MaxBus!

amorph · May 16, 2003 11:24AM

Quote:

Originally posted by ebolazaire

OK, as a correction for the mathmatically challenged... If Apple sells twice as many computers with half of the profit, profit dollars remain constant.

Alright, now that is out of my system. You are correct that in an ideal situation Apple would want to raise prices keeping sales constant to maximize profits.

That's not what TotU said, though. You want to find the place where you sell the most units at the highest price - where, if you raise the price any higher, unit sales would start to go down. Both the price and the (projected) number of units sold are variables.

There is not a linear correlation between price and sales, and it's not a given that Apple halving their profit margin would double their sales. In fact, unless you factor in other variables (such as the PowerMac suddenly becoming a fire-breathing, Pentium-crushing monster of a machine) it's extremely unlikely that anything like that would happen.

Note that Apple slashed PowerMac prices in January, by a good deal more than 10% on some models. Sales went down. So, I say again: Price is not the PowerMac's problem. Any model that only considers price will fail to address the problem. The price decrease you want will not boost sales. The PowerMac needs power, so that Apple's professional prices represent good value for money again - especially in this economy.

smalm · May 16, 2003 11:35AM

Quote:

Originally posted by Programmer

ATI Radeon9700 == 19 GB/sec

nVidia geForceFX 9600 (?) == 27 GB/sec

At a 256 bit connection. You're thinking about a monster chip of a main controller.

We're talking about a desktop machine not a high end server. Add a L3 solution to a 100,000$ server that costs 400$/cpu and gives you 10% more power that's fine. Add it to a 1500$-3000$ desktop machine then it's a waste of money.

programmer · May 16, 2003 11:58AM

Quote:

Originally posted by THT

Remember the architecture proposed here:

Aaaaeeeeiiii!!!! Did you even bother to read what I wrote?

Your diagram completely misses the important link -- off to the left of the system ASIC should be the connection to RAM. This will likely be no more than a 128-bit wide DDR400 connection. Impressive, to be sure, but still probably less than 6 GB/sec. ~16 GB/sec worth of demands on a 6 GB/sec. Hmmmm, might there be room for about 10 GB/sec of improvement?

Embedded DRAM is one option, but it is limited by the need to avoid inflating the ASIC cost too much. 2-4 MB should be doable, and ought to be reasonably effective. The notion of an RDRAM port for L3 is potentially interesting because it makes the cache optional so that Apple could use one ASIC in all its machines but toss in the cache chip where more performance is needed. The increase in pincount on the ASIC is a bit painful, and the ASIC will still need buffering internally between the various buses.

An even more wild option (for those thinking about how to get lots of bandwidth) would be to build in a GPU-like memory controller and 128 MB of soldered high speed DDR-II memory in a 256-bit wide bus, in addition to the "normal" memory. This would be an extra ~$2-400, judging by the price of video cards that do this. The system could then use the virtual memory pager to DMA memory pages between the "normal" DIMMs and the fast soldered stuff.

Don't get me wrong, I'm not saying Apple will do any of these things... but there are some options that could be used to supply bandwidth to the voracious consumers that will be in the next Apple system. A dual channel DDR400 system will not fully satisfy their hunger in anything but a low-end single 1.4 GHz processor system. It is very likely that Apple will say that a 5x improvement (in processor bandwidth) is "enough for now" and just ship exactly that.

programmer · May 16, 2003 12:05PM

Quote:

Originally posted by smalM

At a 256 bit connection. You're thinking about a monster chip of a main controller.

We're talking about a desktop machine not a high end server. Add a L3 solution to a 100,000$ server that costs 400$/cpu and gives you 10% more power that's fine. Add it to a 1500$-3000$ desktop machine then it's a waste of money.

It doesn't have to be one chip -- HyperTransport (or even the 970's FSB) could be used in the "wild" scheme outlined in my previous post to split the memory controllers into two simpler chips.

You're right that this scheme wouldn't be used in the current price range of PowerMacs, but the 970 is opening up options at the top end for Apple. A $5000 quad processor from Apple is finally a real possibility, and it will require more ingeneous solutions to the memory architecture.

smalm · May 16, 2003 12:08PM

Quote:

Originally posted by Programmer

It is very likely that Apple will say that a 5x improvement (in processor bandwidth) is "enough for now" and just ship exactly that.

This is exactly what I think they will do. Just waiting for the next step the industry will do after DDR400. I hope Apple will never again use a RAM solution which anybody else wants to use - thinking of all the money I had to pay for that

programmer · May 16, 2003 12:35PM

Quote:

Originally posted by smalM

This is exactly what I think they will do. Just waiting for the next step the industry will do after DDR400. I hope Apple will never again use a RAM solution which anybody else wants to use - thinking of all the money I had to pay for that

This is an advantage of a soldered solution ... it doesn't affect the user. They should stick to standard DIMMs and disks. This doesn't prohibit the use of more advanced things soldered to the board anymore than it prohibits their use in the high end video cards.

amorph · May 16, 2003 12:45PM

Quote:

Originally posted by Programmer

This is an advantage of a soldered solution ... it doesn't affect the user. They should stick to standard DIMMs and disks. This doesn't prohibit the use of more advanced things soldered to the board anymore than it prohibits their use in the high end video cards.

This is what they already do - nobody has to buy SRAM to upgrade their machines.

(This is more a note to others than a reply to you, Programmer).

It sounds like nVIDIA's able to offer a robust dual-channel DDR SDRAM solution inexpensively, so maybe dual-DDR333? It wouldn't completely saturate the bus (and dual processors would require a sizeable intermediate cache to avoid getting starved) but at least it's common and standard.

Isn't DDR II close to being finalized? That'll do nicely.

arkangel · May 16, 2003 2:36PM

Maybe Programmer and Amorph or someone can help me understand or establish a better context for what appears to be a discussion on Apple's potential implementation of the 970.

Before I state my need for clarification let me say that first, I feel that Apple does need to more clearly differentiate the WorkStation/Pro segments vs. Consumer segments of its offerings and at the very least make sure that each is comparable to it's equivalent in the PC market in performance first and foremost and then in price. Obviously, AMD has the Opteron(server/highend) and soon enough the Athlon64(mid and/or low don't really know) and finally the XP(has got to be the lowend in the new AMD world order). I won't mention Intel because without the "Yamhill" to A64 conparison it's difficult to draw direct correlations.

Anyway, the crux of my difficulty is this. Apple pretty much only has the 970. I don't see maintaining the G4 in it's offerings because of the heretofore mentioned Motorola induced limitations on the G4's cache and bus sizes and speeds. Someone mentioned IBM's Gobi earlier, but I don't know/recall what the cache sizes are on it.

IF, my understanding of the chips involved is correct it would appear that for Apple to attain/maintain some competitive parity across the PC spectrum, it's going to have to take one chip; the 970, and differentiate it's offerings based on chip speed, mobos, bus topologies, slots and other stuff. Or at least that's how I think Apple should resolve this.

I've said all this to say that in reading the discussion, I really only see reference or what I infer to be references to Apples's potential highend as more than two processors and I guess the consumer line as being 2 or less processors which appears to imply that most think one or two mobos and vary the lines based on chip speed and number of chips. In addition there seems to be some feeling that the price points need to stay roughly near where they are now. Yet, most of what I've seen above appears to reference to the maximum capabilities that the 970 architecture will allow.

I propose:

Top of the Line

970 X-Serve

970 X-StationBlade (X-Serve form factor, but highend workstation specs.)

970 X-StationTower (varying internals, but tower form factor. I am partial to the Lian-Li/Directron brushed metal look myself.)

Mid-range

970 ConsumerTower (bto options allow specs up to X-Station class machine or Apple spec.ed midrane

I don't care about the low-end, so I won't get into that.

NOTEBOOK: wise, I think the Apple's catagories are pretty much what they should be, but I don't know what the specs should really look like for next year or rather with the 970.

My questions are [please don't flame me]despite our pent-up desire to have the fastest possible machines on the market:

1. If the actual performance of the 970 outstrips our wildest expectations, should Apple restrain itself from putting all the bells and whistles onboard, allowing itself some performance cushion to respond to Intel/AMD as they modify their offerings, when necessary?

2. Should L3 either embededded in the chipset or otherwise only be availble in the highend?

3. If the mid-range (dual or smp) trounces the best of Intel/AMD, what is the God-Box ie. fastest cpu(s) avail/fastest memory/most memory/most bandwidth (just to say you own it) really worth to you.

4. Given that Apple needs a loss leader bad, does the current line with rediculous price drops serve that purpose?

5. Should Apple take the first batch of available speeds sell them full spec.ed and reserve the 2.0 GHz and above for Workstation machines and sell that way until the 980 arrives 2H2004 if rumor serves correct, which might actually be a previous question asked a different manner.

What are your thoughts?????

Second Ars PPC 970 Article

Comments