And the Magic Number is.... 3.2 GHZ

programmer · May 20, 2005 11:25PM

To be clear: I am not suggesting that Apple will go with some form of Cell before the 970MP appears. I would be very surprised if Apple got a Cell-based chip before the PS3 ships, and I expect to see an Apple 970MP machine before then.

Quote:

Originally posted by rickag

Wouldn't adding a SPE, even one, to a 970 derived PPE be too hot? I would think that Apple/IBM will have to significantly modify the core of a 970 in order to add any SPE's to the point that it might not even be recognizable as derived from a 970.

I mean, the current 970 has what, 8 execution units capable of retiring 5 instructions/cycle, can manage somewhere around 216 in flight insturctions. It almost seems that in the effort to execute instructions, the 970 was designed from the beginning to blow a lot of bubbles internally in order to feed the execution units.

These two designs seem so at odds I doubt we'll see any SPE's attached to a 970 derived core.

But then again, I'm really, really ignorant of the technical issues involved. but I do thoroughly enjoy both THT's and Programmer's posts.

The thing that most people seem to miss is that the Cell is about the bus (the on-chip one). It is an architecture for moving large volumes of data between computing resources. The 970 (as it exists) has a bus interface unit that connects it to its FSB. The 970MP will have a modified bus interface unit that connects 2 970s to one FSB -- this implies some degree of seperation/isolation, which is usually done in a good design. To build a Cell you don't "add SPEs to a 970", you take a Cell chip (primarily the super-fast ring bus), and add devices which interface to this bus: Power core(s), SPE cores, memory controller, I/O interfaces, and who knows what else in the future.

If Apple wanted a Cell with better single threaded performance on generic code they could ask IBM to graft on the Cell's bus interface node onto the 970 core in place of its existing FSB bus interface node, and then drop one or two of them onto a Cell chip. The 970FX is 62 sq mm compared to the first Cell's whopping 221 sq mm (SPEs are ~15 sq mm each). That means that on the same die size as the first Cell at 90nm, Apple could put about at least 2 970s + 4 SPEs... and achieve roughly 160 GFLOPS @ 4 GHz (the 970s at half that speed), compared to the 970MP 48 GFLOPS @ 3 GHz (using a lot more power/heat).

Of course I don't think Apple/IBM would build what I outlined here -- more likely they'll have a new core (call it the 980) with SMT thrown in and other improvements (we hope... maybe longer pipes for higher clocks and lower power), designed for 65 nm and aiming for late 2006 delivery. IBM says +25% die size for SMT, add more cache et al... and we have a core that is ~90M transistors and a die size of something like 50 sq mm on 65nm. SPEs are down to 8 sq mm, so they can probably fit 2 980 cores, memory controller, I/O, and 8 SPEs in an area of less than 200 sq mm (smaller than the first Cell @ 90nm). The first Cell clocks at 4-5 GHz, so this thing could probably manage 5 GHz in production at decent power levels with the 980 cores running at 2.5 GHz... for a peak of about 360 GFLOPS (assuming no VMX improvements in the 980, although they could probably double its throughput on the budget I'm allowing for a total of 400 GFLOPS). Best of all they could use Sony's trick of using chips with failed cores in lower end machines... a whole range of 1-2 980s + 1-8 SPEs + the normal range of clock rates, allowing lots of variation in the iMac mini through top end PowerMac. For even more fun Apple (or IBM) could connect multiple Cell chips together for machines with teraflop level performance, NUMA architectures, and crazy amounts of bandwidth.

This Cell technology stuff has a lot of potential.

**note: I'm throwing around peak GFLOPs numbers in the same spirit as Microsoft and Sony are. These are theoretical peak rates, not achieveable sustainable speeds on real code. Nonetheless this is a useful way of comparing these chips, especially since they're all using more-or-less the same ISA.

programmer · May 20, 2005 11:31PM

Quote:

Originally posted by Groover

Yeah where did 3,2 Ghz come from. It is staggering that the PS3 runs at 2 TFLOPS and the cell has 7 cores. This is a console that will be out next year in the spring. This could make for some exciting surprises by Apple before the PS3 is released. If not I would love it if you could get a Linux kit for the PS3 like the one that was available for the PS2.

Take that 2 TFLOPS number with a very big grain of salt -- the Cell they are claiming only 218 GFLOPS for, the rest is their GPU... and measuring the GFLOPS of a GPU is a whole other kind of joke. Very inventive marketing. Yes its a fast GPU, but they are really stretching the point when they rate it like that...

sunilraman · May 21, 2005 2:55AM

Quote:

Originally posted by Programmer

Take that 2 TFLOPS number with a very big grain of salt -- the Cell they are claiming only 218 GFLOPS for, the rest is their GPU... and measuring the GFLOPS of a GPU is a whole other kind of joke. Very inventive marketing. Yes its a fast GPU, but they are really stretching the point when they rate it like that...

well we can safely bury Moore's law at this stage, huh?

\

well maybe not ~ but it looks like it can be dependent on how you measure computing power, a whole 'nother issue...

towards the end of this decade, the mhz myth, ghz myth, and gflops myth and teraflops myth will be alive and well,

selling computers will be about inventive marketing.

fieldor · May 21, 2005 3:41AM

http://www.anandtech.com/tradeshows/...spx?i=2420&p=5

Games at E3 are running on PowerMac G5 and the console itself.

sunilraman · May 21, 2005 3:46AM

Quote:

Originally posted by fieldor

http://www.anandtech.com/tradeshows/...spx?i=2420&p=5

Games at E3 are running on PowerMac G5 and the console itself.

i seriously doubt the "Ruby Demo" is running on an "actual" xbox360 console. there a powermac g5 hidden somewhere behind there

http://www.anandtech.com/tradeshows/...spx?i=2420&p=6

the xbox360 lighted console is a dummy box, IMHO.

brendon · May 21, 2005 10:25AM

Quote:

Originally posted by Programmer

[B}SNIP

If Apple wanted a Cell with better single threaded performance on generic code they could ask IBM to graft on the Cell's bus interface node onto the 970 core in place of its existing FSB bus interface node, and then drop one or two of them onto a Cell chip. The 970FX is 62 sq mm compared to the first Cell's whopping 221 sq mm (SPEs are ~15 sq mm each). That means that on the same die size as the first Cell at 90nm, Apple could put about at least 2 970s + 4 SPEs... and achieve roughly 160 GFLOPS @ 4 GHz (the 970s at half that speed), compared to the 970MP 48 GFLOPS @ 3 GHz (using a lot more power/heat). [/B]

Wild speculation begin here. I wonder how easy it would be for IBM to take the VMX unit and treat them like SPEs connected to the chip bus, have a memory controller, and a large cache. They are more built for general processing, then the SPEs, but are very powerful. I wonder what the performance of the VMX units would be if they had memory controllers and large caches? Each 970MP could have two 970 cores minus the VMX unit and then have four VMX units attached to the chip bus, along with memory controller and large cache.

programmer · May 21, 2005 11:23AM

Quote:

Originally posted by Brendon

Wild speculation begin here. I wonder how easy it would be for IBM to take the VMX unit and treat them like SPEs connected to the chip bus, have a memory controller, and a large cache. They are more built for general processing, then the SPEs, but are very powerful. I wonder what the performance of the VMX units would be if they had memory controllers and large caches? Each 970MP could have two 970 cores minus the VMX unit and then have four VMX units attached to the chip bus, along with memory controller and large cache.

That is essentially what the SPEs are -- VMX units extended with enough general purpose instructions to operate alone, and mated to 256K of on-chip memory with their own bus interface & DMA engine. The local memory instead of a cache is more useful because it gives finer control and is smaller/faster than having a cache and all its associative mechanisms.

The VMX unit closely tied to the Power core is still useful in many cases, so having both is better than just having the seperate vector units (and it doesn't really cost you that much in today's terms).

brendon · May 21, 2005 12:10PM

Quote:

Originally posted by Programmer

That is essentially what the SPEs are -- VMX units extended with enough general purpose instructions to operate alone, and mated to 256K of on-chip memory with their own bus interface & DMA engine. The local memory instead of a cache is more useful because it gives finer control and is smaller/faster than having a cache and all its associative mechanisms.

The VMX unit closely tied to the Power core is still useful in many cases, so having both is better than just having the seperate vector units (and it doesn't really cost you that much in today's terms).

Wow. So it would be no problem to have two 970cores with VMX and have like two or four more VMX cores with the added instructions, bus interface and DMA engine. I would thnk that this would not require lots of power heat to run and the benefits would be huge. It is probable that Apple may go in this direction?

EDIT

Do you have any idea what the performance would be for something like that?

hiro · May 21, 2005 12:40PM

Quote:

Originally posted by sunilraman

well we can safely bury Moore's law at this stage, huh? \

well maybe not ~ but it looks like it can be dependent on how you measure computing power, a whole 'nother issue...

towards the end of this decade, the mhz myth, ghz myth, and gflops myth and teraflops myth will be alive and well,

selling computers will be about inventive marketing.

Don't bury Moore's Law yet. It is very alive and kicking.

Moore's Law only applies to the number of transistors per unit area, it has been bastardized and confounded with it's corollaries related to the processing throughput or Hz ratings of the parts.

wizard69 · May 21, 2005 5:22PM

Quote:

Originally posted by THT

I'm sure that many applications today could be made to be properly multithreaded, but that goes against economics.

It also go against economics to develop yur own windowing environment and system services. In ways multi threading can be seen as a way to deliver services without overloading a single processor.

Quote:

Developers will do the minimum necessary to get an application running with a minimum of bugs. [That's the way it is with hardware too.] I'm not quite sure of what the amount of increased development time needed for making multithreaded application is, but I wouldn't be surprised if the development time increased more than linearly per thread.

The only time the extra effort to thread an application fully is acceptable is when there is some sort of pay off. Generally that pay off is keeping ahead of the competition. Certainly some apps don't need to be threaded and some are difficult to thread but that isn't where the heavy competition will be. When a developer can leverage parllel processing it is then that its usage will come into place.

ONE thing that is clear is that there are many unexplored possibilities here.

Quote:

So processors with good single threaded performance offers a high reward for poor development practices. That will be difficult to overcome.

If we are to pit a dual 2.5 GHz 970mp (4 total cores, 4 threads) versus a dual 4-core 4 GHz PPE system (8 total cores, 16 threads) given equal memory systems, I think the 970mp system wins for most usages. 4 threads are enough (maybe too much?) and single thread performance will be better.

See this is exactly the sort of thing that makes no sense to declare. One we don't yet have viable systems to judge performance on. The other is context. Context makes all the difference and has many faces, just how would one manage the described systems in a portable for example.

Quote:

If there is a killer multithreaded application like something greater than HD encoding/decoding that becomes prominent in everyday computing, perhaps it can tip the scales. Then again, Apple can just move to PPE-derived systems be management fiat too.

One should not focus to closely just on a single application, it is the capability of the system to deliver that makes all the differnece. The reality it that in the near future we will simply not have enough compute horse power to handles some applications no matter what we do. Having several processors per machine just makes handling these sort of issue possible without overloading a desktop CPU. More spcifically the multiprocessor machines give us the ability to deliver PCs the ability to handle the growing work loads of users gracefully.

Quote:

Terrific yields for FSG. We've been waiting on low-k and DSL for a long time now which should push the 970 to 3 GHz and make the 970mp power consumption reasonable. Maybe Cell and Xenon are only at 3.2 GHz because Fishkill is only using FSG?

I suspect that Cell and Zenon ae at 3.2 GHz for the same reason that the original game boxes had middle of the road processors for their time. It is hard to say how fast a quad core PPE based machine might run on one of Apples water cooled machines. In any event my preference would be for more core rather than high clock rates anyways.

Quote:

I suppose it could also be because they need to be conservative since the Cell is also to be fabbed on Sony and Toshiba's unproven 65 nm fab.

Is compilation time important to you? If so, can compilers be made multithreaded? (It would seem a very difficult problem, and I don't think there is a mass market compiler available that is parallelized.)

I'm not sure why the issue of compilers always comes up. Multithreading has never been the path to follow with compilers. The trick is to compile in parallel via compilers running in seperate processes. In fact this is a good example of where multiprocessors can have a dramatic effect on application performance (build times) without the need to thread the code itself. The application in this sense is the build process not specfically the compiler.

Now if we start to talk about RADs and interpeted environments that is another thing. To the best of my knowledge there is not a good environment to do this in besides JAVA and that really isn't a RAD or strictly an interpeted environment.

Dave

wizard69 · May 21, 2005 5:36PM

Cell, atleast with the PPE's attached has focused potential. Potential that is huge by any standards but yet still focused on specific areas.

Take a little of Cell and add several PPE cores and you have a much more interesting general purpose processor. This is the route I think Apple is going possiblly a 4 core unit with enhanced vector processing in each core.

Dave

Quote:

Originally posted by Programmer

This Cell technology stuff has a lot of potential.

programmer · May 21, 2005 10:29PM

Quote:

Originally posted by Brendon

Wow. So it would be no problem to have two 970cores with VMX and have like two or four more VMX cores with the added instructions, bus interface and DMA engine. I would thnk that this would not require lots of power heat to run and the benefits would be huge. It is probable that Apple may go in this direction?

Call them SPEs, its much easier.

Quote:

EDIT

Do you have any idea what the performance would be for something like that?

Read my big post, 2nd from top of this page.

programmer · May 21, 2005 10:32PM

Quote:

Originally posted by wizard69

Cell, atleast with the PPE's attached has focused potential. Potential that is huge by any standards but yet still focused on specific areas.

Take a little of Cell and add several PPE cores and you have a much more interesting general purpose processor. This is the route I think Apple is going possiblly a 4 core unit with enhanced vector processing in each core.

General purpose cores are much less interesting than specialized ones. The SPEs were carefully designed to deliver very good performance, high clock rates for a small footprint and low power consumption. The PPE is mildly interesting, as is an improved 970... but they have fairly traditional performance characteristics and so aren't terribly compelling. You need one or two of them, but things really get interesting when you toss in a handful of specialized cores.

sunilraman · May 22, 2005 12:57AM

Quote:

Originally posted by Programmer

General purpose cores are much less interesting than specialized ones. The SPEs were carefully designed to deliver very good performance, high clock rates for a small footprint and low power consumption. The PPE is mildly interesting, as is an improved 970... but they have fairly traditional performance characteristics and so aren't terribly compelling. You need one or two of them, but things really get interesting when you toss in a handful of specialized cores.

this is all well and good, i just know that when the PS3 or xbox360 comes out, there'll be tons of mac-bashers just waiting to go,

well, my 3.2ghz xbox360 sh1ts all over your crappy 1.5ghz apple G4 piece of expensive crap... and it does xxxxx fps on xxxxx game...!!!111!! suck3rS1!!

programmer · May 22, 2005 1:07PM

Quote:

Originally posted by sunilraman

this is all well and good, i just know that when the PS3 or xbox360 comes out, there'll be tons of mac-bashers just waiting to go,

well, my 3.2ghz xbox360 sh1ts all over your crappy 1.5ghz apple G4 piece of expensive crap... and it does xxxxx fps on xxxxx game...!!!111!! suck3rS1!!

Heh, well that's a given no matter what happens.

vinney57 · May 23, 2005 6:41AM

You see, every now and then along comes a thread that makes wading through all the other crap worthwhile. Thanks guys.

sunilraman · May 23, 2005 6:47AM

Quote:

Originally posted by vinney57

You see, every now and then along comes a thread that makes wading through all the other crap worthwhile. Thanks guys.

tht · May 23, 2005 9:29AM

Quote:

Originally posted by wizard69

It also go against economics to develop yur own windowing environment and system services. In ways multi threading can be seen as a way to deliver services without overloading a single processor.

Not sure what you're trying to say. Part of Apple's business is to get developers for their system. Providing a windowing environment and system services is economically necessary for them. At least that's the theory. Not sure if it really is that way in reality, at least for the general personal computer market.

Quote:

The only time the extra effort to thread an application fully is acceptable is when there is some sort of pay off.

I certainly agree with this. My argument however is "how many times" is that going to be and "how much effort it will take." Making an app run in parallel is hard to do, and increased single threaded performance rewards developers who don't make the effort to do so. The eventual tradeoff to make is between a high performance single threaded machine (970mp), be multi-core, multiprocessor, or both, or a high performance multithreaded machine (PPE-derived) with average single-thread performance.

So, the big question is, how many threads can developers make use of?

Quote:

One should not focus [too] closely just on a single application, it is the capability of the system to deliver that makes all the differnece.

Not to a user. Generally all they, we, care about is working with one app at a time. I certainly buy into and like that fact that multiprocessor machines are very smooth and all, better for the user overall.

Quote:

The reality it that in the near future we will simply not have enough compute horse power to handles some applications no matter what we do.

Wait and see. It will really depend on how many and what market those applications are going to be in.

Quote:

It is hard to say how fast a quad core PPE based machine might run on one of Apples water cooled machines.

It's hard to say anything about PPE machines right now. I'm still waiting for the other shoe to drop about what sort of performance penalties they have.

Quote:

I'm not sure why the issue of compilers always comes up.

Just trying to think of an example where a developer would require more single threaded performance.

sunilraman · May 23, 2005 9:39AM

Quote:

Originally posted by THT

....Generally all they, we, care about is working with one app at a time. ....

eh? people nowadays will have email open, dashboard running, bittorrent downloading, surfing pr0n, word in another window, photoshop somewhere in the background they forgot to quit out of...

programmer · May 23, 2005 9:50AM

Quote:

Originally posted by sunilraman

eh? people nowadays will have email open, dashboard running, bittorrent downloading, surfing pr0n, word in another window, photoshop somewhere in the background they forgot to quit out of...

Not to mention that very few of these things is actually processor bound in application code. Some are just waiting for the user, some are waiting for the network, some are waiting for Quartz/OpenGL/QuickDraw to draw something, etc etc. In all of these cases you could speed up the application's code 100x and the machine wouldn't run any faster. Conversely if you did nothing to the application but sped up the GUI, the network, the drawing, and other parts of the system you would notice an improvement. The Cell's SPEs provide a mechanism to speed up all those other things, possibly so much that you could cut the speed of the actual application code and the system would still feel faster. Of course then the occasional application that is compute bound would feel slower...

And the Magic Number is.... 3.2 GHZ

Comments