Massively parallel PowerMac

the one to rescue · November 18, 2004 4:19AM

(This message is pure speculation, I have absolutely no information about built prototypes or projects under development at Apple)

With IBM/Sony developping Cell, a CPU designed for use in massively parallel configurations, I was wondering if Apple could use such architectures in future computers.

After all :

- Mac OS X is an SMP-aware UNIX-based OS, and efforts have been made to build smart kernel locking techniques to boost performances for SMP/SMT configurations.

- Apple has some experience in the area of SMP since it has been shipping dual configurations for quite a long time.

- IBM should be able to build cheap Cell-like chips for Apple : basically, you'd just have to redesign some of the PPC750 to boost I/O and SMP performances, add an Altivec module, and you're done. Such a CPU could sell pretty well in the embedded market too, which could be a good point for IBM.

- This would help solving heat problems since it's easier to cool several warm chips than a very hot chip.

- This would make Apple less dependent to the arrival of new chips : if IBM don't manage to manufacture faster chips, Apple would just increase the number of chips on the mobo as the price of the chips would go down.

Of course :

- The mobo design would be a little more tricky... Nothing impossible though, plus those designs would be more flexible than the current ones (laptops and desktops could share most of their mobo designs).

- Not every Mac OS X application is multithreaded, though it does not seem impossible to build compilers that include OpenMP directives when it is necessary (basically, that's the same kind of job as creating a compiler that vectorize source code, and such compilers already exist - IntelCC and preversions of GCC)

What do you think?

9secondko · November 18, 2004 10:50AM

Tadow!

SLI, RAID and dual everything!

WooHoo!!!!

I only wish. Not that is is impossible though. This IS Apple.

Alienware is already heading that way. We've got dual CPU. RAID hard drives and SLI video standard would be tremendous for a top fo the line animal.

9secondko · November 19, 2004 10:33AM

I figured so.

I just wanted to rant about something that is possible currently. Just a dream list.

The "cell" concept is very promising, but, in order to accomplish what computers do for the next two years will be more expensive than having dedicated hardware for various operations vs. a microprocessor pool that is a jack of all trades, albeit with a better ability to allocate resources where needed.

I just don't see that for at least two years. The dual everything concept I think is more probable.

the one to rescue · November 21, 2004 10:48AM

Quote:

Originally posted by 9secondko

I figured so.

I just wanted to rant about something that is possible currently. Just a dream list.

The "cell" concept is very promising, but, in order to accomplish what computers do for the next two years will be more expensive than having dedicated hardware for various operations vs. a microprocessor pool that is a jack of all trades, albeit with a better ability to allocate resources where needed.

I just don't see that for at least two years. The dual everything concept I think is more probable.

I wasn't talking about replacing all the chips in the computer with standard CPUs, I was only talking about replacing the main CPU (or the two CPUs for duals) with 16, 32, or 64 very cheap chips, but well enough conceived to work together well.

You're right about doing the same kind of things to GPUs and other components.

The fact is that tons of embedded systems use such configurations. Personal computers don't use them because developers would have to rewrite most of their software if they want them to use all of the power of the massive SMP configuration, but I'm sure that we'll be there in a few years.

trtam · November 21, 2004 11:11PM

Could you do multiple processors?

the one to rescue · November 22, 2004 4:04AM

Quote:

Originally posted by trtam

Could you do multiple processors?

Not sure I understand the question... Do you mean "Is it possible to build configurations with multiple processors?". If you do mean it, the answer is yes, of course!!!

rhumgod · November 23, 2004 9:08AM

I think Apple's direction is software - xGrid.

synp · November 23, 2004 9:13AM

Quote:

Originally posted by The One to Rescue

I wasn't talking about replacing all the chips in the computer with standard CPUs, I was only talking about replacing the main CPU (or the two CPUs for duals) with 16, 32, or 64 very cheap chips, but well enough conceived to work together well.

How about something a little different: Keep the 1-2 big CPUs (1.8-2.5 GHz) and add a bank of 16-64 weaker CPUs (500-1000 MHz). These weaker CPUs will be simpler, will still have L1 and L2 cache, and may not have Altivec.

What the operating system does, is schedule threads on processors. Threads will be sorted into two types: regular threads, which get scheduled only on the strong processors. This is especially good for foreground applications, and weak threads which can be scheduled on any processor, with a preference for the weak processor (i.e. if a regular thread is ready to run, a weak thread will not get a strong processor)

This will allow a responsive GUI. Applications that can thread easily will create their threads as weak (and be able to run on many processors at once). Applications that don't need threading, will run on the main processor - there is no good reason to have multiple executing thread in a mail application.

programmer · November 23, 2004 9:52AM

More chips == more complexity == more cost. Highly parallel systems that use many chips like you describe are very hard to build and program for in an efficient manner, which is why you don't see any of them in common use. A massively parallel system has hundreds or thousands of small processors, and typically only runs highly specialized scientific problems. We're not going to see such a thing from Apple.

What we will see is as processor transistor counts grow, more and more cores will be put on a single chip. Systems built around a single (or dual) chip configuration will not be any more complex than today's systems from a hardware perspective. The software will have to adapt, but while in the 2-4 core range individual programs don't necessarily have to be multi-threaded in order for the overall system performance to benefit -- Apple can take advantage of the hardware at the OS level. At higher numbers of processors we'll need more threaded applications, and those will start to show up in larger numbers as the tools for building them, and the market advantages for supporting that kind of hardware, improve.

xGrid is primarily for distributed computations, but it may play a role on a multiprocessor as well.

the one to rescue · November 24, 2004 7:59AM

Quote:

Originally posted by Programmer

More chips == more complexity == more cost. Highly parallel systems that use many chips like you describe are very hard to build and program for in an efficient manner, which is why you don't see any of them in common use.

I'm not talking about now, I'm talking about 3-5 years from now. Moore's law won't be eternal, you've got to find other ways to improve performances than doubling the amount of transistors on a chip.

I think that the idea of 2 big CPUs and several small CPUs around them is a good idea, too... That's what most game consoles use/used and that worked pretty well.

programmer · November 24, 2004 9:21AM

Quote:

Originally posted by The One to Rescue

...you've got to find other ways to improve performances than doubling the amount of transistors on a chip.

Why?

The economics of putting more chips on a board haven't really changed since they started doing that. If anything the cost has been increasing because of the bus speed requirements between the chips (more layers, better quality).

We might see PentiumPro-like packaging of multiple chips, but even that is pretty expensive.

Quote:

I think that the idea of 2 big CPUs and several small CPUs around them is a good idea, too... That's what most game consoles use/used and that worked pretty well.

Not really -- most game consoles until now have given the developers a single CPU to use, and a graphics processor. The PS2 had an IO processor and an oddly slaved vector processor in addition to that, and the Dreamcast was just plain odd. The problem with a non-symmetrical configuration like that is the develop typically needs to know and codify that knowledge into the software which has the side effect of limiting future hardware configurations to that which the software can support. A symmetric arrangement simplifies life in many ways. We are seeing pools of symmetric specialized processors of multiple kinds in a single system, with the OS (or driver) managing the mapping of work to processor (GPUs are a current example, Cell is a future example). These are multiple "cores" per chip, and the hardware is chip-level not system level design. We will likely see Apple using this kind of technology eventually, but not trying to assemble it themselves out of seperate CPU chips.

the one to rescue · November 24, 2004 9:57AM

Quote:

Originally posted by Programmer

Why?

The economics of putting more chips on a board haven't really changed since they started doing that. If anything the cost has been increasing because of the bus speed requirements between the chips (more layers, better quality).

We might see PentiumPro-like packaging of multiple chips, but even that is pretty expensive.

Why? Because there will be the need to do that in the not-so-near future IMO. And it's still expensive because research on CPU is way more active than research on buses (and trust me, I know some stuff about that)... when CPUs won't be able to evolve so much, there'll be the need to invest in better solutions for communications between chips.

Quote:

Not really -- most game consoles until now have given the developers a single CPU to use, and a graphics processor. The PS2 had an IO processor and an oddly slaved vector processor in addition to that, and the Dreamcast was just plain odd. The problem with a non-symmetrical configuration like that is the develop typically needs to know and codify that knowledge into the software which has the side effect of limiting future hardware configurations to that which the software can support. A symmetric arrangement simplifies life in many ways. We are seeing pools of symmetric specialized processors of multiple kinds in a single system, with the OS (or driver) managing the mapping of work to processor (GPUs are a current example, Cell is a future example). These are multiple "cores" per chip, and the hardware is chip-level not system level design. We will likely see Apple using this kind of technology eventually, but not trying to assemble it themselves out of seperate CPU chips.

Game consoles achieve performances in some case comparable to personal computers, though the hardware is damn cheap and old. Of course, it's a pain in the ass writing soft for such systems. But research on compilers have been very active, and some of the optimization and repartition process will be done automatically in the near future. So why not?

wizard69 · November 24, 2004 1:57PM

Quote:

Originally posted by Programmer

Why?

The economics of putting more chips on a board haven't really changed since they started doing that. If anything the cost has been increasing because of the bus speed requirements between the chips (more layers, better quality).

Actually it is the economics that make such systems plausible. As we move closer and closer to systems contained entirely on one chip it becomes much easier from the standpoint of ecomomics to throw another processor or chip of processorS on the board. Now the arraingement of this additional processors is another issue.

With todays technologies it would be very possible to tack 4 603 class processors on a chip. That is a PPC processor without an outstanding implementation of AltVec and multiprocessor enabled, understanding that any core implemented today would likely to be a performance improvement over the 603. 4 cores like that running under 2GHz would provide significant boost to many systems. Granted not all would benefit but a large number of systems would.

One of my bigger concerns with such systems would be the likely need to go to 64 bit addressing or extended addressing on 32 bit systems. A 64 bit approach while a nice mental exercise is currently stymied due to the lack of good 64 bit processors. Maybe the rumored efforts for a laptop 64 bit chip will produce a core that is power and size optimized and could lead to such hardware.

Quote:

We might see PentiumPro-like packaging of multiple chips, but even that is pretty expensive.

I'm not convinced that one would have to go that far. All you really need is high integration on the chips themselves and high speed communications between chips. We are nearing the stage where it is possible to contain an entire computing system on a die, so for many applications where communications does not have a huge impact these are very valuable approaches. Sure this isn't the cure for every possible computing illness but will work well for many. AMD is one to watch here, as they are close to implementing what I would consider ideal systems for the future.

It is not like I would want to see these simple CPU's give up the instruction set that we currently have. Just that I would rather see many cores on a chip and take a tiny performance hit as opposed to having just two. Thus my statement about a 603 class chip with out a super altvec unit containing instead a simpler implementation of the altvec instrucitons. I believe that the instruction set needs to be balanced more so than processor performance.

There are a whole host of system tasks that could just as well run on lower performance cores. In effect that is what we will get with SMT, one thread will likely have a priority advantage over another. The Thread running at reduced priority will in effect be running on a lower performance processor. So why not take the approach that the many of the threads simply get a lower performance core?

Quote:

...The problem with a non-symmetrical configuration like that is the develop typically needs to know and codify that knowledge into the software which has the side effect of limiting future hardware configurations to that which the software can support. A symmetric arrangement simplifies life in many ways.

While your statements are not wrong as stated above I'm not sure I'd take the same meaning from them that you do. Many common implementations today are not symmetric at all when looked at from thread performance. SMT and all of the other thread processor approaches do apply priorities to threads, so threads really are not running with symmetric capabilities. Same with multiprocessors systems, where in some one processor runs at a disadvantage to another. The operating systems of today have to take knowledge about thread performance into account when scheduling so the complexity issue is already there. Frankly putting as many processors as possible on one chip brings some symmetry back inot the equation (atleast for the processors on that chip)

Quote:

We are seeing pools of symmetric specialized processors of multiple kinds in a single system, with the OS (or driver) managing the mapping of work to processor (GPUs are a current example, Cell is a future example). These are multiple "cores" per chip, and the hardware is chip-level not system level design. We will likely see Apple using this kind of technology eventually, but not trying to assemble it themselves out of seperate CPU chips.

Apple could get a market advantage if they could convince IBM to go along buy implementing many cores on chip. It may very well be possible to implement as many as four cores on one chip while the competition is trying to cram two of thier cores on chip. If such a chip where to exists tommorrow, Apple would be able to implement it well immediately with the current system software they have. I think you would see very few complaints about systems with 4 or 8 processors at todays prices.

Remember if you build the ball field people will come.

Dave

trevord · November 24, 2004 2:16PM

Quote:

there is no good reason to have multiple executing thread in a mail application.

I think many people underestimate how much you can really benefit from threading.

A mail application can use threads in tons of places. Everytime it goes on the network to send or receive mail should be done in a thread (this is especially apparent if you're running on a slow or unreliable network). Any kind of scheduled tasks, whether they be user-defined or application-standard, could be handled by one or more threads. Spell-checking should be handled in a thread. Anytime the user tries to search should be done in a thread. And future email apps will probably add a bunch of new features and eye candy that probably belong in their own threads too.

These are just the things I'm thinking of off the top of my head. There are probably many other things that could be threaded right now, and there will certainly be many more in the future. Basically anything that makes the user wait (or at least has the potential to make the user wait) can be spun off into another thread so that the app remains completely responsive at all times.

In fact, lots of current apps do fancy coding to make a single thread appear to be doing multiple things at once. A lot of code complexity (and therefore, potential bugginess) can be eliminated by just writing the app with real threads, and as an added bonus, the app automatically takes advantage of multiple processors.

I'm stuck using Outlook 97 at work on a slow network, on an overworked computer... you have no idea how often Outlook completely freezes up on me just because it's busy doing something. For instance, I can't look at messages in my inbox if Outlook is sending an email with large attachments.

synp · November 26, 2004 2:03PM

Quote:

Originally posted by TrevorD

I think many people underestimate how much you can really benefit from threading.

... and also many people underestimate the problems caused by threading. Multiple threads access the same data. Unlike event-driven programs that can rely on a single execution thread, multi-threaded programs have to constantly worry about re-entrancy, resource serialization and stuff like that.

To handle this, they have to allocate multiple buffers instead of using static memory, and they have to use serialization like semaphores or mutexes. All of these add a lot of complexity to all functions. OTOH all of the things needed to be event-driven can be concentrated into a nice well-debugged library.

Unless you need very high performance, and unless your job is very amenable to being split up into chunks, it is far better to use an event model rather than multiple threads.

That's why I'm proposing non-symetric multiprocessing. Only apps that can really benefit (think Photoshop and rendering apps) will use the high parallelism. Others can run in a single thread. There's really no need to make a thread and have it sleep waiting for a network event. Even high-performance servers like Apache use only a few threads.

hiro · November 26, 2004 9:12PM

Yes threading can add complexity. But the lions share of things which can be threaded actually don't need constant multiple references to shared data. They merely separate the code likely to block from code that should never ever be blocked for that reason. Granted figuring it out isn't for the newbiwes or faint of heart, but well designed threaded code will almost always outperform less threaded code except in the design corner not requiring shared resources or i/o. Even on a single CPU machine.

spankalee · November 26, 2004 10:19PM

Quote:

Originally posted by The One to Rescue

Of course :

- The mobo design would be a little more tricky... Nothing impossible though, plus those designs would be more flexible than the current ones (laptops and desktops could share most of their mobo designs).

- Not every Mac OS X application is multithreaded, though it does not seem impossible to build compilers that include OpenMP directives when it is necessary (basically, that's the same kind of job as creating a compiler that vectorize source code, and such compilers already exist - IntelCC and preversions of GCC)

What do you think?

Auto vectorization doesn't help any with SMP. Vectorization usually involves unrolling loops so that you can take advantage of SIMD parallelization like AltiVec. Auto threading would be a much, much more complicated task because the compiler would have to either synchronize access to shared data or analyze and ensure that synchronization isn't necessary. I hesitate to say that's impossible, but if it isn't it's close.

I think the first step we'll see in this direction is the dual-core G5 and possibly the FreeScale dual-core G4. It's likely that when dual-core G5s arrive Apple will have at least one dual dual-core offering which will appear to the OS as four processors. The kernel will need a little tweaking because even though OS X can already scale to more than 2 processors, communication between two threads on the same module through the shared cache is going to be orders of magnitude faster than communication between threads on different chips. Tightly linked threads will gain a lot of performance by running on the same chip.

The next step will be the PPC derivative of the POWER5. That will introduce SMT and suddenly the dual dual-core setup will appear as 8 processors to the OS.

It looks like this will start happening next year. I'm not sure that moving to a cell architecture is necessary. The cell idea is nice for economics and throughput, but I think it might suffer on latency.

programmer · November 26, 2004 11:48PM

Quote:

Originally posted by Hiro

well designed threaded code will almost always outperform less threaded code except in the design corner not requiring shared resources or i/o.

I think you got that backwards.

There are two broad classes of uses for threads -- those which exist to wait for things to happen (and wake up and do something when it does), and those which exist to compute something (typically something big which you can hopefully divide up into seperately solvable parts). I/O usually falls into the first category since the I/O device is slow and the processor has to wait for it to do its thing (during which the thread sleeps). Shared resources will typically hurt the performance of multi-threaded programs because they must synchronize with eachother to access the shared resource, and that can slow them down... a lot. Computational threads with no shard resources on a multi-processor machine are where we typically see the big performance wins over single threaded programs, unfortunately running the same set of threads on a single processor machine will result in worse performance than a single threaded version because of the cost of switching between threads. Fortunately it is relatively easy to create a thread per available hardware thread and thus scale nicely with the available hardware resources. The user interface is a form of I/O and thus can sleep most of the time, allowing the computation to run while the user goes about his business in typical human super-slow-motion.

hiro · November 28, 2004 5:15PM

I think you understimate threading. Every negative example you give falls into the poorly designed category. Those contentions are the things a well designed program minimizes. Hardly a convincing argument that threading an application will lower it's performance. The sentence you quoted is a tad imprecise and open to mis-interpretation I suppose, maybe this will tighten it up a bit.

Many changes in threading use happened over the last ten years in avionics design. Especially with digital flight control systems replacing push-rod bellcranks and CRT/LCD multi-function displays replacing old fashioned analig steam guages and gyro repeaters. You would probably be stunned by how many threads these systems use. Ada may have a couple language constructs that make threading a tad easier to implement, but the concepts of what to thread when are independent of language. Never let a thread block if it can still do work (e.g. split off i/o and support worker threads); give a thread A (singular) job (this prevents all kinds of potential performance inhibitors during execution); make the critical region code as small as possible (prevents unnecessary performance hits to other threads). Also like Ada, Posix based threads incur relatively small performance hits when they belong to the same process, there is no expensive full context switch. The first block (extremely expensive) the process avoids pays the bill for many of these cheap in-process thread switches. The more thread execution resources are available the better the overall performance skews in favor of the well designed threading.

That "give a thread A (singular) job" is something most coders have a really hard time getting their mind around and time in the design process to make work. The coding is relatively straight forward if the design is right. The fact most projects have poor and impatient design teams if any at all, which either under-thread or hamfist the threading out of poor understanding doesn't change the fact that properly designed and threaded code will ALWAYS outperform poorly threadded code. "Properly designed and threaded" is the toughest part to get right, but doesn't take magic. And yes sometimes the proper design may be a single thread, but that is becoming increasingly rare.

spankalee · November 28, 2004 8:34PM

Quote:

Originally posted by synp

... and also many people underestimate the problems caused by threading. Multiple threads access the same data. Unlike event-driven programs that can rely on a single execution thread, multi-threaded programs have to constantly worry about re-entrancy, resource serialization and stuff like that.

I'm not too sure why you give event-driven programs as an example of a good place to use a single-thread. Event driven programs usually benefit the most from a mutli-threaded design, even on one processor.

UI is probably the most common example. If the user pushes a button that starts a time consuming process, why should the whole interface block until the process is over? Launch a worker thread and immediately return to your event loop allowing the UI to still work.

I think the best example though is something like a web server. That's an event driven program; the HTTP requests are events. Could you image a web server that's NOT threaded? Only one request could be serviced at a time.

Massive multi-processing may one day have a role in desktop computing, but if it does it's a long way out. One area where many less powerful processors could help is with heavily graphical user interfaces - especially those with lots of widgets. Take Mail.app for example. When resizing its window the gui updates "live". All the sizes and positions of all the widgets in the lists, along with the wrapping of the text in the visible mail, have to be recomputed with every mouse event. None of these calculations are that expensive, but multiplied by hundreds or thousands, and then repeated for every mouse-drag event, and you can see why even on a fast machine live resizing can be a bit jerky. If you launched a bunch of threads running on separate cores and divided up the work (1 thread per paragraph for re-wrapping, 1 thread per list-column for size updating) you could make things much snappier. It's a pretty mundane usage of multi-processing, and probably not worth it really, but who knows how complex GUIs will get? The GPU can't do everything.

programmer · November 28, 2004 8:48PM

Quote:

Originally posted by Hiro

I think you understimate threading. Every negative example you give falls into the poorly designed category. Those contentions are the things a well designed program minimizes. Hardly a convincing argument that threading an application will lower it's performance.

I'm not under-estimating threading -- I think you're over-estimating how many good software designers there are, and how many problem domains can be cleanly factored as you describe.

Quote:

Many changes in threading use happened over the last ten years in avionics design. Especially with digital flight control systems replacing push-rod bellcranks and CRT/LCD multi-function displays replacing old fashioned analig steam guages and gyro repeaters. You would probably be stunned by how many threads these systems use.

Not at all, I'm fully aware of what modern heavily threaded real-time embedded systems look like. That is only one set of problem domains, however. It also doesn't cover the full gamut of modern hardware, either.

Quote:

Also like Ada, Posix based threads incur relatively small performance hits when they belong to the same process, there is no expensive full context switch. The first block (extremely expensive) the process avoids pays the bill for many of these cheap in-process thread switches. The more thread execution resources are available the better the overall performance skews in favor of the well designed threading.

We are talking about threading so I wasn't even considering the cost of a process context switch. A threaded context switch can be fairly expensive on chips like the 970, especially when you calculate the cost in terms of the amount of computation that could have been performed in the same time as the switch.

Quote:

The fact most projects have poor and impatient design teams if any at all, which either under-thread or hamfist the threading out of poor understanding doesn't change the fact that properly designed and threaded code will ALWAYS outperform poorly threadded code. "Properly designed and threaded" is the toughest part to get right, but doesn't take magic. And yes sometimes the proper design may be a single thread, but that is becoming increasingly rare.

I would further qualify your statement to say: "threaded code will ALWAYS outperform poorly threaded code..." in a known hardware/software environment. Unfortunately the PC/Mac environments rarely known and the configuration can impact your "proper" design decisions. Getting these decisions right in all cases is not just difficult, it can be impossible. In an embedded system (or a game console) the hardware is cast-in-stone, but on the PC there are a wide number of types of cores, varying number of threads per core, varying numbers of cores, configured on different busses, with different memory subsystems, and different operating systems. In some designs you might be able to come up with a winning design that is always going to do better than others... but in my experience the performance tradeoffs are manifold.

Massively parallel PowerMac

Comments