PowerBook G5

tomb of the unknown · November 20, 2003 1:58PM

Quote:

Originally posted by Amorph

Once you scale the design down to something built around 440s, you either have to be content to run applications that are happy executing on a 440, or you have to start targeting the architecture specifically. The ratio of the power of each core to the demands of each application is a lot lower.

Despite Nr9's insistance that Cell == clustering, keep in mind that future iterations of this technology will likely embody technologies that address this issue and may well change how we look at the problem.

OK, the following is entirely blue skying and may not be practical at all, but it's why I don't think Cell is "just another clustering implementation".

Don't think of each core or dual core chip as a stand alone node with it's own FPU, L2, and VMX unit. Instead think of each as a cell on the fabric, or better yet as each unit being built out of smaller, more generic execution units. Need some SIMD lovin' for your PS project? No problem, just dedicate a dozen or so cells (execution units) as a VMX unit and start decoding and processing instructions. Need some scalar FPU muscle? Just rededicate those VMX cells to DP math and rock and roll.

I think you might need some kind of magical load/prefetch units to keep this thing fed, but assuming that's possible, you can keep transistor counts down by making them do double and triple duty. (Obviously, some things, like cache would be more or less dedicated, but if your bus is fast and your memory controller does really smart prefetches then you can keep cache sizes down.)

Eh, its a pipedream.

wizard69 · November 20, 2003 2:08PM

It is always good to be skeptical when presented with new ideas. But do realize that we have had now several implementations of the PPC in our Macs all of them using a different number of transistors delievering different peroformance.

If you look at this whole consept as a way to deliever low power and high performance to the portable market then things look different. The nice thing is that you can get off IBM's web site documentation for the 440 and a couple of chips it has been implemented in. This is very much a low power device, some of the low power comes from giving up fucntionality found in the desktop procesors. How everything would work and perform when glued together is an open question.

There are a number of things to like about this machine even if it doesn't exist in reality. It is best to have a little fun with this discussion and wait for more signals from Apple. It has become apparent that Apple is workig on somethng for the portable market. As time passes I have less and less belief that the 970 will function well in a laptop no matter how much it is shrunk. So time will tell.

Dave

Quote:

Originally posted by rickag

My view is obviously over simplified, due to lack of knowledge, but going back to the G4 comparison. The G4 has 1 integer, 1 floating point and 4 altivec units(I think?), is using a 7 stage pipeline and is now at 52 million transistors.

The proposed 440 derivative uses a 7 stage pipeline, tacks on a floating point and altivec unit and the claim is that 2 dual core 440's each with altivec and added floating point units will be 10 million transistors. I just can't wrap my head around this at all. Plus the fact that each core will have to have additional systems for communicating between themselves to make any parallelism efficiently work.

While the Power 4 is not an ideal example, it uses a ring topology(???terminolgy), has up to four dual core cpu's runs at a relatively pokey speed compared to desktops, but uses up to 132Mb's of L3 cache per processor in order to perform it's intended job, which I presume to be massive parallelism, multi-tasking, multi-threading on a huge scale. Scaling this philosphy down to a desktop using 10 million transistors seems, well, uh, mm, quite difficult, let alone having to conjole developers into optimizing software for this design.

For the technologicaly impaired like me, please don't put to much stock in what I say, but I'm still skeptcal of Nr9's claims.

tomb of the unknown · November 20, 2003 2:23PM

Ok, you've basically ignored my arguments or ignored their context and I've really no interest in correcting your assumptions beyond addressing this bit:

Quote:

Originally posted by wizard69

Just look around at all of the clusters operating in the world using standard off the shelf processors. I don't have to do the convincing prior art exists. Even then if they do extend the instruction set it doesn't mean anything to the user at all.

Of course there are PPC clusters out there. The #3 in the world is VT's "X". So what?

Quote:

Did all of the user apps suddenly fail to work when Alt-Vec was introduced? That one additioned added more capability and instructions to the PPC than we are ever likely to see from hardware additions to support message passing. On top of that is the reality that all of the above could be done with out adding any new instructions at all to the PPC base. But agian this has been done again and again to the PPC programming model without breaking user apps, with DSP and vector instructions.

Point to one PPC cluster that will run Word for Mac OS X. There aren't any. Even VT's "Big Mac" is not running an OS X image. Can you run Word for Mac OS X on one of VT's nodes? Yes, of course you can. But you can't dispatch the job to it from a head node and you'll have a bit of a problem with scrolling through your documents since there's no monitor, but sure you can.

As I see it, the problem you have is that you don't seem to understand what it is that a cluster does, and like Nr9, you have confused clustering with Cell.

nr9 · November 20, 2003 2:26PM

Quote:

Originally posted by Tomb of the Unknown

Despite Nr9's insistance that Cell == clustering, keep in mind that future iterations of this technology will likely embody technologies that address this issue and may well change how we look at the problem.

OK, the following is entirely blue skying and may not be practical at all, but it's why I don't think Cell is "just another clustering implementation".

Don't think of each core or dual core chip as a stand alone node with it's own FPU, L2, and VMX unit. Instead think of each as a cell on the fabric, or better yet as each unit being built out of smaller, more generic execution units. Need some SIMD lovin' for your PS project? No problem, just dedicate a dozen or so cells (execution units) as a VMX unit and start decoding and processing instructions. Need some scalar FPU muscle? Just rededicate those VMX cells to DP math and rock and roll.

I think you might need some kind of magical load/prefetch units to keep this thing fed, but assuming that's possible, you can keep transistor counts down by making them do double and triple duty. (Obviously, some things, like cache would be more or less dedicated, but if your bus is fast and your memory controller does really smart prefetches then you can keep cache sizes down.)

Eh, its a pipedream.

eh, no, that would require something magical. Cell is more likely to be a really high bandwidth MPI cluster of low power cores. Programs written for cell are likely o have to be MPI threaded. There is no way you can take a single thread and run it across these "cells"

nr9 · November 20, 2003 2:28PM

Quote:

Originally posted by Tomb of the Unknown

Ok, you've basically ignored my arguments or ignored their context and I've really no interest in correcting your assumptions beyond addressing this bit:

Of course there are PPC clusters out there. The #3 in the world is VT's "X". So what?

Point to one PPC cluster that will run Word for Mac OS X. There aren't any. Even VT's "Big Mac" is not running an OS X image. Can you run Word for Mac OS X on one of VT's nodes? Yes, of course you can. But you can't dispatch the job to it from a head node and you'll have a bit of a problem with scrolling through your documents since there's no monitor, but sure you can.

As I see it, the problem you have is that you don't seem to understand what it is that a cluster does, and like Nr9, you have confused clustering with Cell.

there is no confusion. i think you are confused. There is no MPI version of Word and Word doesnt need one.

The cell concept is not just a hardware thing. It will also define how software is to be written.

dfryer · November 20, 2003 2:35PM

Would it be possible that despite being a message-passing architecture main memory is shared? This would eliminate some of the advantage that SMP has over MPI, but I'm not sure that it would be all that easy to implement, and it might still be visible at the application level.

If it's possible to make MPI look a lot like SMP, this approach would be a whole lot more feasible.

Perhaps it is possible to modify Darwin/Mach to control allocation of physical memory among various CPUs, and hand them off at the appropriate time- i.e. the kernel manages some large-scale locking of pages. Would this be any more efficient than NUMA? (I'm assuming that strict copy-all-the-data-over-the-connection style used in clusters isn't being considered, it just seems kinda slooow)

Just some random and not well thought out thoughts.

nr9 · November 20, 2003 2:55PM

The L2 and L3 cache are coherent among pairs of processor, and yes, the main memory is shared.

directory sharing is perfectly feasible for small # of proc.

i do not believe sharing memory will be feasible on a large scale however ie thousands of processors though

snoopy · November 20, 2003 3:17PM

Quote:

Originally posted by Amorph

. . . That's not to say that Apple isn't considering a major revision to OS X that would separate Mach back out into its own thing with its own address space. It's a better design generally (you ain't seen uptime yet), and the power and bandwidth of forthcoming hardware platforms might make the performance hit forgivable. However, this is not a project to take lightly, at all, and that's with only one microkernel running. The sort of system Nr9 is describing would keep Apple's engineers up nights for a good long time. (Avie might relish the challenge, though.)

Ah, thank you for an explanation of the Darwin kernel. All this time I believed OS X has a microkernel because it is based on Mach.

I'm guessing that cell computing must work the way Nr9 says, with each cell being a micro computer passing messages, since cells can be extended over a network. With all the work involved to turn such a chip into a real product, maybe the processor Nr9 refers to is an engineering experiment or prototype.

On the other hand, companies often start working on promising technologies far ahead of time. With IBM involved in the cell project, maybe Apple has been working on it for a year or two already. Apple may have most of the details worked out and a prototype OS X ready to test. (Well, maybe that is unrealistic considering the amount of engineering that went into Jaguar and Panther.)

amorph · November 20, 2003 3:28PM

My thoughts keep bouncing around on this subject. Oh well. Another wait for a build, another post.

A point of trivia: Software honcho Avie Tevanian is the father of the Mach kernel. He worked on it for years as part of his graduate and postgraduate work (NeXTStep ran on Mach, as I recall). I doubt that he gets his hands dirty with that kind of plumbing work anymore (a shame, really, since good systems programmers are hard to come by), but he's doubtless familiar with the code and with the issues around it. And, of course, they have BSD guru Jordan Hubbard on board. So if there's any company that has the knowledge and the staff to go playing with microkernels in general and Mach in particular, it's Apple. (Darwin, in addition to rolling in Mach, apparently rolled in some of the work on NuKernel, the basis of the doomed Copland/Gershwin project, so they've already been playing.)

I'll have to do some more reading tonight.

tomb of the unknown · November 20, 2003 3:41PM

Quote:

Originally posted by Nr9

There is no way you can take a single thread and run it across these "cells"

No, not currently. But give STI a few years and lets see. Those are some smart folk with deep pockets, if anyone can come up with a next generation architecture they can. Of course, they're just looking for the next gaming chip, so we probably won't see anything quite as dramatic as I outlined. But I could see special purpose cores knitted together by a fast bus matrix of some kind. Almost like exploding out the execution units of a CPU.

By the way, iif MS Word doesn't need MPI why would it define how the software is written? Aren't these statements contradictory?

But as far as your idea of clustering being the next big thing: let's try this on for size. Let's say Apple rewrites their OS so it can run as a single image on the VT cluster. Then MS rewrites Office for this version of the OS.

Would you be able to scroll through a large Word document any faster than you can now? After all, you have 2200 G5 CPUs to throw at it. Shouldn't you be able to scroll to the end of the document even before your finger even clicks the mouse?

No.

Why? Because it's essentially still only running on one CPU. You can't split the job of scrolling though a document up into 2200 pieces, do it, then reassemble it for display. If anything, it may even happen slower because the head node has to find a free CPU, dispatch the task to it, and then return the results to you so you actually have more overhead than running the same task on your machine at home.

Messaging is not the holy grail, folks. And clustering is not the answer to every computing problem.

nr9 · November 20, 2003 3:53PM

Quote:

Originally posted by Tomb of the Unknown

No, not currently. But give STI a few years and lets see. Those are some smart folk with deep pockets, if anyone can come up with a next generation architecture they can. Of course, they're just looking for the next gaming chip, so we probably won't see anything quite as dramatic as I outlined. But I could see special purpose cores knitted together by a fast bus matrix of some kind. Almost like exploding out the execution units of a CPU.

By the way, iif MS Word doesn't need MPI why would it define how the software is written? Aren't these statements contradictory?

But as far as your idea of clustering being the next big thing: let's try this on for size. Let's say Apple rewrites their OS so it can run as a single image on the VT cluster. Then MS rewrites Office for this version of the OS.

Would you be able to scroll through a large Word document any faster than you can now? After all, you have 2200 G5 CPUs to throw at it. Shouldn't you be able to scroll to the end of the document even before your finger even clicks the mouse?

No.

Why? Because it's essentially still only running on one CPU. You can't split the job of scrolling though a document up into 2200 pieces, do it, then reassemble it for display. If anything, it may even happen slower because the head node has to find a free CPU, dispatch the task to it, and then return the results to you so you actually have more overhead than running the same task on your machine at home.

Messaging is not the holy grail, folks. And clustering is not the answer to every computing problem.

My point is that MS Word is not an example of a future application. Why the hell do you wnat it to scroll fast anyways? For any applications that require computing power, there will be a way to apply clustering.

Single Threaded word processing applicatiosn aren't likely to run faster on cell either.

amorph · November 20, 2003 4:07PM

I hate talking about Word, just because it's such a bletcherous pile of code, but:

Word is - or at least, acts - multithreaded in a number of ways (I might just fire it up and watch it for threads, actually, but not now

). If one CPU is paginating a document while others are running the check-as-you-type formatting and spelling and grammar services, and one more controls access to the document itself, then the thread dedicated to the view of the document will be able to scroll smoothly and responsively.

The VT cluster example is a bit silly because of bandwidth and latency issues between machines that would not be a problem on an MCM or even on a common motherboard. But, in fact, a group of small processors just might be able to make Word run more responsively, if Word trusts all those worker tasks to threads.

stoo · November 20, 2003 5:23PM

Quote:

For any applications that require computing power, there will be a way to apply clustering.

Not true: not all applications are inherently (sensibly) parallelisable.

tjm · November 20, 2003 5:36PM

Here's a handy related development:

"Engineered Intelligence Corp. (EI) on Thursday announced the release of its Mac OS X-compatible version of CxC. Aimed at research labs and other markets that require high-performance computing, CxC is parallel programming software designed to simplify the process of writing code that can run on clusters of computers."

Here's the link (MacCentral): http://maccentral.macworld.com/news/...=1069342314000

Apparently, EI thinks that there is a market for multi-threading apps in the Mac world.

nr9 · November 20, 2003 5:46PM

Quote:

Originally posted by Stoo

Not true: not all applications are inherently (sensibly) parallelisable.

most applications that require computing power will be to some degree parallelisable.

you would have to rethink your whole programming model; ie, what problems should be solved, and what problems shouldn't be. We can live in a computing world where most applications are parallel, its just that we are used to a different programming paradigm today.

tomb of the unknown · November 20, 2003 5:53PM

Quote:

Originally posted by TJM

Apparently, EI thinks that there is a market for multi-threading apps in the Mac world.

There is:

Quote:

Application areas include: cellular automata, artificial neural networks, fluid dynamics, particle dynamics and other numerical applications.

Folks involved in supercomputing applications all over the country are interested in building systems like VT's Big Mac because for the first time ever really serious computing power is available at relatively low cost. (It's unheard of for a system that breaks 10 TFlops to cost as little as VT's.) EI apparently realized it would be trivial to port their products to OS X and it would open up a significant market for them.

Amoph:

Yes the example was rather extreme.

But sometimes you have to go to extremes to make a point.

tomb of the unknown · November 20, 2003 5:56PM

Quote:

Originally posted by Nr9

We can live in a computing world where most applications are parallel, its just that we are used to a different programming paradigm today.

What do you mean we, white man?

In the world I live in, saying it doesn't make it so.

wizard69 · November 21, 2003 11:55AM

Quote:

Originally posted by Tomb of the Unknown

Ok, you've basically ignored my arguments or ignored their context and I've really no interest in correcting your assumptions beyond addressing this bit:

It is not a question of ignoring your arguments, the problem is that your arguments are invalid and apparently not based on contemporary knowledge in the field.

Quote:

Of course there are PPC clusters out there. The #3 in the world is VT's "X". So what?

The point is that you have been arguing that the described technology is impossible because a new instruction set is required or new hardware is required or a new operating system is required. Clusters in general and the PPC cluster in particular, should demonstrate clearly that it is possible to implement the discussed technology WITHOUT the need for instruction set modifications or special processors.

Quote:

Point to one PPC cluster that will run Word for Mac OS X. There aren't any. Even VT's "Big Mac" is not running an OS X image. Can you run Word for Mac OS X on one of VT's nodes? Yes, of course you can. But you can't dispatch the job to it from a head node and you'll have a bit of a problem with scrolling through your documents since there's no monitor, but sure you can.

First; off tcfslides.pdf states clearly that the cluster is running OS/X on page 13. This comes right from VT website, http://computing.vt.edu/research_computing/terascale/ if you will. So the only thing you would need to run word, besides a licensed copy, is a keyboard and video screen. Well that and getting by the system administrator! So At least we agree that the nodes are capable of running conventional code while being part of a cluster. There is no new instruction set to be dealt with to support legacy applications. At the same time you have support for message passing programs.

Quote:

As I see it, the problem you have is that you don't seem to understand what it is that a cluster does, and like Nr9, you have confused clustering with Cell.

At this point I apparently have a rather good understanding of what a cluster does, it is not a mystery. If you really want too, you could build one in your basement. How you would make use of it is another matter.

It has become apparent that Nr9 is not confused at all, wether he is just pulling our legs or actually has a line on good information is yet to be seen. Nothing he has described is impossible and could very well be a future path of development.

I do know a couple of things though. One is that Apple will have hard time stuffing a 970 into a laptop even with a die shrink. The second is that Apple is probally in the best position of any company, in the computing business, to be able to successfully launch such a machine. This is largely a result of previous efforts to support the G4 in dual processor configurations.

Quote:

dfryer · November 21, 2003 12:35PM

I think one the problem with introducing a (hypothetical) "cluster-on-a-motherboard" laptop is that if the individual processors are slow, that will severely limit the performance of single-threaded tasks. Saying that people will have to change their apps is all well and good, but would anyone spend money on a product whose performance depends on app developers switching to a new paradigm?

Switching from SMP-style multithreading to an MPI-style multiple-process model is probably not trivial for most applications.

I'm not saying it's impossible, but I would find the appearance of the technology in a shipping apple laptop in the next year surprising.

I see few advantages to MPI as opposed to NUMA SMP, unless we're talking about completely separate address spaces (i.e. separate machines networked together) We will probably see a MPI API push by apple (for people who want their apps to be clusterable) before we see cluster-on-a-board computers.

barto · November 21, 2003 5:54PM

Quote:

Originally posted by dfryer

Saying that people will have to change their apps is all well and good, but would anyone spend money on a product whose performance depends on app developers switching to a new paradigm?

That's the thing. This design is the future, but Apple may not have the luxury of continuing along their current path. Look at the introduction of the dual processor G4s (in fact the introduction of the G4 in the first place). Apple had no other choice.

Quote:

Originally posted by dfryer

Switching from SMP-style multithreading to an MPI-style multiple-process model is probably not trivial for most applications.

The great thing about MPI is that it can be done gradually. You can slowly port an entire application to MPI, you don't have to do it all at once (unlike say, porting an application to Carbon).

Quote:

Originally posted by dfryer

I see few advantages to MPI as opposed to NUMA SMP, unless we're talking about completely separate address spaces (i.e. separate machines networked together) We will probably see a MPI API push by apple (for people who want their apps to be clusterable) before we see cluster-on-a-board computers.

Again, what if Apple doesn't have a choice in this? What if the (standard) G4 has topped out at 1.3GHz, and by the time Crolles has ramped up, Motorola no longer cares about the G4? And you answered your own question.

That said, I do have problems with much of what Nr9 is saying, specifically about the inflexibility of the design. I would be shocked if there was any incompatibility with current applications, including the inability to gain performance from multithreading applications. But Cell-style architecture is the future. Even if it doesn't appear next year, it will appear in future Macs.

Barto

PowerBook G5

Comments