Simultaneous Multithreading

hmurchison · March 2, 2003 4:13PM

The Power5 is going to have SMT which is probably one of the reasons why it's up to 4X faster than a Power4 in some functions. That's exciting. Intel is already delivering Hyperthreading to Consumers.

When we we get to benefit from this technology? Is this something that can be added to a 90nm 970+ in the next revision or would we most likely be waiting until a derivative of the Power5(PPC 980 ?) hits in 2005?

I would love to see a progression that goes something like

Initial Launch

Single and Dual processors 130nm

Next Revision

90nm Single and Dual Cores

Final Revision before PPC 980

90nm Single & Dual Cores in a SMP config with SMT

This would give the capability to have the "functional" equivalent of an Octet Processor as seen by Applications. Very powerful and flexible as well as you still are only utilizing Two physical processors.

Are there any limitations to a scenarios such as this?

amorph · March 2, 2003 4:24PM

I don't think we'll see a multithreaded PPC before the POWER5 derivative, unless that's one of the "surprises" left to disclose. But seeing as IBM has provided rather detailed diagrams of the 970, I think those have more to do with the technologies, applications and timelines surrounding the CPU proper.

I've argued that the scenario you lay out is actually a good one for Apple to pursue, because it grants the advantages of an increasingly SMP platform without ever making the motherboard more complicated than a dual processor board would be: All the busses and scheduling and context switching and cache coherence logic ends up on the CPU dies. That will go a long way toward helping Apple to keep costs down.

Of course, should they decide to, they could set up their dual (physical) CPU board like a blade, with some sort of connector linked to a HyperTransport bus, and then you could build a fabric of linked dual-CPU boards, and scale up even higher. The less cost-conscious IBM could link more than two CPUs together with its own higher-bandwidth busses for real high-end performance.

I see a great deal of promise in this approach. Two CPUs, each with two cores, each capable or running two threads simultaneously, would be formidable indeed.

hmurchison · March 2, 2003 4:49PM

Hannibal from Arstechnica wrote a great article on SMT <a href="http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html"; target="_blank">Here</a>

[quote] I see a great deal of promise in this approach. Two CPUs, each with two cores, each capable or running two threads simultaneously, would be formidable indeed <hr></blockquote>

Yes and I was shocked to read that there may only be a die increase of %5 to support SMT. That's not bad at all.

IBM is sure to have minimized any Cache Conflicts by the time they add this to the PPC 9xx processor.

SMT enabled systems seem to a naturals for Throughput. I read that Sun is working on 4 Core Sparcs with SMT to each core.

I think we'll be there right when we need to be. However since Intel is shipping P4's with Hyperthreading already Apple will have to ensure that they are aggressive in keeping the Powermacs ..."Powered".

Thanks for the tidbits Amorph. I think I will start my Warchant for Dual Dual Core SMT systems! <img src="graemlins/cancer.gif" border="0" alt="[cancer]" />

amorph · March 2, 2003 5:11PM

Keep in mind that the HT P4 is as much a workaround for the P4's lack of support for SMP as it is anything else. And although the performance boost on threaded apps is definitely worth the transistor cost, it's not hard to do better.

A hyperthreaded (or SMT, if you prefer) successor to the 970 would simply use it as another way to get parallelism, in concert with (not instead of) multicore and multi-CPU configurations. IBM's approach is more scalable and likely to be more powerful.

powerdoc · March 3, 2003 8:33AM

What is the difference between HT and MT, i have read somewhere, but i did not understand very well the difference.

whisper · March 3, 2003 12:11PM

[quote]Originally posted by Powerdoc:

What is the difference between HT and MT, i have read somewhere, but i did not understand very well the difference.<hr></blockquote>

Normally, a CPU can only actually one thing at a time (ignore pipelining). Hyperthreading is Intel's marketing name for a way to let a single CPU to work on two or more (just two in Intel's case) threads at the same time. If their implementation didn't appear to suck so badly, it would be a really big deal. Someone here quoted someone else as saying there was a 10% or so speedup. I think I remember reading that IBM was expecting an 80% boost when they release their version, and the implementation that Sun is supposed to be working on will let a single CPU execute either 4 or 8 ( can't remember which) threads simultaniously. Multithreading just refers to having multiple threads in an application. Make sense?

@homenow · March 3, 2003 12:20PM

[quote]Originally posted by Whisper:



Normally, a CPU can only actually one thing at a time (ignore pipelining). Hyperthreading is Intel's marketing name for a way to let a single CPU to work on two or more (just two in Intel's case) threads at the same time. If their implementation didn't appear to suck so badly, it would be a really big deal. Someone here quoted someone else as saying there was a 10% or so speedup. I think I remember reading that IBM was expecting an 80% boost when they release their version, and the implementation that Sun is supposed to be working on will let a single CPU execute either 4 or 8 ( can't remember which) threads simultaniously. Multithreading just refers to having multiple threads in an application. Make sense?<hr></blockquote>

The software has to be Multi-Threaded to take full advantage of this. OS X is fully capable of taking advantage of MT, and MP. Idealy all the software that could take advantage of MT would be written to take advantage of them as well so that the tasks could be sent to as many processors/cores/thread execution nodes (whatever the name is) as possable to get the work done.

kupan787 · March 3, 2003 2:04PM

Quick question. I am wondering how, performance wise on a multithreaded task, it breaks down. Say you had the following four systems:

single processor - 4.0 GHz

single processor with HT - 3.0 GHz

single processor with dual cores - each core is rated at 2.0 GHz

dual processor - each proc is rated at 2.0 GHz

Obviously the single processor machine would be the worst performer in a multithreaded task, but I am a bit confused about the next three. Wouldn't the dual proc machine perform the best? I would think that it would go just as I list it (from worst to best performer). If I am wrong (which I may be) could someone explain why.

And this is only refering to multithreaded tasks. If we were talking about single threads here, that would be a whole different story (actually reverse order of how I list it, right?)

hmurchison · March 3, 2003 2:17PM

From what I've read I don't think so.

When you have dual cores much of the hassle of Cache Coherency seems to be handled in Silicon and very efficiently at that.

A Dual Core SMT processor should be able to maximize SMT if what I've read is correct(or if I'm understanding correctly)

Please anyone with much more knowledge shed some light.

programmer · March 3, 2003 9:18PM

A single core has to pay more for context switching, but some software is written to fall back on a non-threaded version of the computation when there is only one processor.

A multi-chip SMP system has to communicate between chips about which memory each chip is modifying. This can introduce a fair bit of overhead, but it depends on the algorithms in use. For the G4 its even worse because they share a bus.

A multi-threaded processor (hyperthreading is Intel's name for it) has to share its execution units between the threads. If each thread is getting good utilization of all the execution units then running multiple threads isn't going to speed you up much because the threads are trying to use the same units. If you have a lot of bubbles or stalls in your pipelines then multi-threaded hardware is a big win. It starts to make a lot of sense to add more execution units in this situation as well.

A multi-core processor (like POWER4) has efficient cache level communication and doesn't have to share execution units. Its also expensive from a transistor budget point of view.

In the future I think we'll see multi-core chips with hardware multi-threading in each core and larger numbers of execution units. Probably a big shared on-chip L2, plus per core L1's as well.

kupan787 · March 3, 2003 11:59PM

[quote]Originally posted by Programmer:

In the future I think we'll see multi-core chips with hardware multi-threading in each core and larger numbers of execution units. Probably a big shared on-chip L2, plus per core L1's as well.<hr></blockquote>

So a multi cored chip has it better off than a dual processor system (two single cored chips)? Is it possibel to make quad cored chips? Or would it be more likely that we see dual-dual cored chips (2 dual cored chips)?

wmf · March 4, 2003 12:18AM

SiByte is doing quad-core and Sun is doing octo-core.

powerdoc · March 4, 2003 12:39AM

[quote]Originally posted by Whisper:



Normally, a CPU can only actually one thing at a time (ignore pipelining). Hyperthreading is Intel's marketing name for a way to let a single CPU to work on two or more (just two in Intel's case) threads at the same time. If their implementation didn't appear to suck so badly, it would be a really big deal. Someone here quoted someone else as saying there was a 10% or so speedup. I think I remember reading that IBM was expecting an 80% boost when they release their version, and the implementation that Sun is supposed to be working on will let a single CPU execute either 4 or 8 ( can't remember which) threads simultaniously. Multithreading just refers to having multiple threads in an application. Make sense?<hr></blockquote>

Thanks i understand HT is HypersuckingMT

the swan · March 4, 2003 7:51AM

I don't know the details of this, but HT or SMT is only going to speed you up so much as you've got pipeline bubbles to accomodate additional execution. If you had one thread executing and all branch predictions were correct and all memory accesses were L1 cache hits then you're not going to be able to cram in execution of another thread. HT and SMT are people finally saying "Look, we can't do perfect branch prediction, so let's try to stop wasting cycles." So how much faster a processor is with SMT or HT is somewhat inversely proportional to how good your branch prediction and cache algorithms are.

J

ast3r3x · March 4, 2003 8:42AM

man i love reading stuff like this, u learn so much

airsluf · March 4, 2003 9:55AM

eskimo · March 4, 2003 11:45AM

[quote]Originally posted by kupan787:



So a multi cored chip has it better off than a dual processor system (two single cored chips)? Is it possibel to make quad cored chips? Or would it be more likely that we see dual-dual cored chips (2 dual cored chips)?<hr></blockquote>

Yes, it's better from the standpoint that the two cores can communicate with eachother at speeds on order of the core clock (GHz) instead of at typical bus speeds (MHz). The limit to the number of cores you can fit on a single die is a function of manufacturability. There is a limit to the amount of silicon one can devote to a single chip and expect yields and cost variables to allow one to sell it for the prices demanded at the consumer PC level. Also as you increase the number of cores on a single die you vastly increase the complexity of the packaging technology needed which again adds to your costs. In the consumer space you will not see dual core solutions before the 90nm technology node and more probably 65nm.

programmer · March 4, 2003 11:53AM

Its also not just an issue of branch misprediction. There are many kinds of stalls between instructions of one instruction stream and a full multi-threaded hardware implementation ought to be able to fill in most of these stalls as well.

Consider the case of a sequence of instructions where each instruction uses the output of the previous instruction. Normally the dependent instruction(s) waits until the the previous one generates the required value, and this waiting introduces a bubble into the pipeline. If another thread was running its instructions could be inserted into the pipeline between the dependent ones.

Execution units are generally not created equal: there are simple integer units, complex integer units, floating point units, simple vector units, complex vector units, floating point vector units, vector permute units, branch units, load/store units... and who knows what else in the future. By analysing which units are the least likely to have bubbles, the processor designer could add more of just those which would increase the idle time and thus make "more room" for the multiple threads to run. This will also benefit some algorithms that run in a single thread and can use the extra execution units.

There are other resources in question too -- a multithreading scheme can use the shared rename register pool which makes it even more effective to increase that pool's size. Normally there is a point of diminishing returns for the average algorithm where adding extra rename registers just don't buy you much... but there will always be some algorithms which do benefit from more. In a multi-threaded processor you can use many more rename registers because on average you will consume that many via two threads... but what that algorithm that uses a lot of them runs, it can hog them from the other thread and it will run faster. This applies to all sorts of things like cache, branch hit tables, lookaside tables, etc. Basically it shifts all the points of diminishing return and since those things are all algorithm dependent, some of your single-threaded algorithms will benefit.

g::masta · March 4, 2003 12:21PM

[quote] Its also not just an issue of branch misprediction. There are many kinds of stalls between instructions of one instruction stream and a full multi-threaded hardware implementation ought to be able to fill in most of these stalls as well.

Consider the case of a sequence of instructions where each instruction uses the output of the previous instruction. Normally the dependent instruction(s) waits until the the previous one generates the required value, and this waiting introduces a bubble into the pipeline. If another thread was running its instructions could be inserted into the pipeline between the dependent ones.

Execution units are generally not created equal: there are simple integer units, complex integer units, floating point units, simple vector units, complex vector units, floating point vector units, vector permute units, branch units, load/store units... and who knows what else in the future. By analysing which units are the least likely to have bubbles, the processor designer could add more of just those which would increase the idle time and thus make "more room" for the multiple threads to run. This will also benefit some algorithms that run in a single thread and can use the extra execution units.

There are other resources in question too -- a multithreading scheme can use the shared rename register pool which makes it even more effective to increase that pool's size. Normally there is a point of diminishing returns for the average algorithm where adding extra rename registers just don't buy you much... but there will always be some algorithms which do benefit from more. In a multi-threaded processor you can use many more rename registers because on average you will consume that many via two threads... but what that algorithm that uses a lot of them runs, it can hog them from the other thread and it will run faster. This applies to all sorts of things like cache, branch hit tables, lookaside tables, etc. Basically it shifts all the points of diminishing return and since those things are all algorithm dependent, some of your single-threaded algorithms will benefit. <hr></blockquote>

ok .. now in english please.

<img src="graemlins/lol.gif" border="0" alt="[Laughing]" />

the swan · March 4, 2003 3:23PM

Programmer, thanks for taking the time to reply.

Airsluf, if you have a million functional units and one pipeline you can only fetch so many instructions through it. Your number of functional units is irrelevant if you can only fetch one instruction per clock. If one thread keeps the pipe full (this is unlikely to actually happen for reasons that Programmer went through) then no one gets to execute at the same time.

J

mmicist · March 4, 2003 5:50PM

[quote]Originally posted by Programmer:



There are other resources in question too -- a multithreading scheme can use the shared rename register pool which makes it even more effective to increase that pool's size. Normally there is a point of diminishing returns for the average algorithm where adding extra rename registers just don't buy you much... but there will always be some algorithms which do benefit from more. In a multi-threaded processor you can use many more rename registers because on average you will consume that many via two threads... but what that algorithm that uses a lot of them runs, it can hog them from the other thread and it will run faster. This applies to all sorts of things like cache, branch hit tables, lookaside tables, etc. Basically it shifts all the points of diminishing return and since those things are all algorithm dependent, some of your single-threaded algorithms will benefit.<hr></blockquote>

Rename registers aren't visible to the program, architectural registers are (for PPC 32 integer, 32 floating point, 32 vector, and a few odds and ends). The multi-threading adds a second logical set of independent architectural registers for the second thread to use, but this doesn't mean you have to add more rename registers. In an out of order execution processor, usually all registers are rename registers, that is registers which can be associated with a particular architectural register at a particular point in the instruction stream. The number of rename registers should be equal to the maximum possible number of in-flight instructions, so if you can have a maximum of 100 integer instructions in flight, each of which can write to at most one register, you need 100 integer rename registers, regardless of the number of threads executing. Each rename register in a multithreaded processor, must carry with it more state information, however, to indicate to which thread the register is related.

You only need to increase the number of rename registers if you lengthen the pipeline, or add functional units.

michael

Simultaneous Multithreading

Comments