Why not .13??

amyklai · March 4, 2002 12:01PM

Actually, SMT helps extracting parallelism (by feeding instructions from multiple threads to the processor instead of just taking instructions from only one thread at a time), so it is a good tech for wide architectures, regardless of pipeline depth.

I don't see how it should affect long pipelines more than short pipelines, cause it won't help with branch prediction and won't prevent flushes and such.

[ 03-04-2002: Message edited by: amyklai ]

amyklai · March 4, 2002 12:11PM

[quote] I wonder how this would be possible, given that even with SMT, there's still just a single pipeline that can only fetch and retire a given number of instructions per clock. <hr></blockquote>

The EV8 has a lot of parallel pipes (all modern microprocessors have multiple pipelines, and the Alpha is probably one of the widest designs), so it's crucial to keep all of them busy if you want to get max performance out of the chip.

That's where SMT comes into play, because you can only extract so much parallelism from one thread (this is exactly the problem that Intel tried to solve with EPIC, btw).

For a super-wide architecture, SMT can result in huge speed gains, because without it, there's a good chance that a lot of the pipelines will be instruction-starved most of the time.

[ 03-04-2002: Message edited by: amyklai ]

razzfazz · March 4, 2002 4:15PM

[quote]Originally posted by amyklai:

Actually, SMT helps extracting parallelism (by feeding instructions from multiple threads to the processor instead of just taking instructions from only one thread at a time), so it is a good tech for wide architectures, regardless of pipeline depth.<hr></blockquote>

Well, again, if you can keep your EUs reasonably fed without SMT, adding it doesn't help much. Now, looking at <a href="http://arstechnica.com/cpu/01q2/p4andg4e/figure3.jpg"; target="_blank">the G4's architecture</a>, it seems to me that a single integer-heavy thread shouldn't have a lot of difficulties keeping the respective two EUs busy. SMT would only really help here in case the second thread primarily used other EUs, such as the FPU. So, to be really effective, you'd always have to pair threads from two of the groups integer, FP and AltiVec. While this is completely possible, it's not the common case.

Now, take the P4's integer units, for example. Here we have a complex IEU, and two identical double-pumped simple IEUs which offer the same functionality each. As such, a single thread is not likely to continually use both of them at once because of data dependencies and such. But since they both offer the same functionality, they can much more easily be shared among similar threads.

[quote]I don't see how it should affect long pipelines more than short pipelines, cause it won't help with branch prediction and won't prevent flushes and such.<hr></blockquote>

That quote ("memory stalls, branch misprediction, instruction dependencies etc inherent in single thread execution") was from what Mr. DeMone posted over at the AT forums. Also, if one instruction in the pipeline stalls, it might be possible to have an instruction from another thread (and independent of the first one) "overtake" it. Also, the instructions from other threads wouldn't necessarily have to be flushed in case of a pipeline stall, again because they are not affected by what the first thread does or where it branches.

Bye,

RazzFazz

razzfazz · March 4, 2002 4:28PM

[quote]Originally posted by amyklai:

The EV8 has a lot of parallel pipes (all modern microprocessors have multiple pipelines, and the Alpha is probably one of the widest designs)

<hr></blockquote>

Ignoring multi-cores, current processors only have one main pipeline for fetching, decoding and finally completing instructions. They do however hab multiple "sub-pipelines" for the executions stage. This can be seen <a href="http://arstechnica.com/cpu/01q2/p4andg4e/figure9b.jpg"; target="_blank">here</a> and especially <a href="http://arstechnica.com/cpu/01q2/p4andg4e/g4e_anim1.gif"; target="_blank">here</a>.

[quote]That's where SMT comes into play, because you can only extract so much parallelism from one thread (this is exactly the problem that Intel tried to solve with EPIC, btw).<hr></blockquote>

How does EPIC change how much parallelism you can get out of a thread? From what I understand, it's all about moving instruction scheduling to the compiler in order to be able to save the silicon for doing it (plus out-of-order-execution etc.) in hardware.

[quote]For a super-wide architecture, SMT can result in huge speed gains, because without it, there's a good chance that a lot of the pipelines will be instruction-starved most of the time.<hr></blockquote>

Well, again, this assumes that the two threads' instructions would go to different execution pipelines in the first place.

Bye,

RazzFazz

amyklai · March 5, 2002 2:34AM

- Well, you'll be surprised : even at decoding stage, x86 procs can work on multiple instructions at the same time (PII had two iirc). It's not like parallel execution only exists at certain pipeline stages.

- The whole idea of EPIC (Explicit Parallel Instruction Computing, IIRC) is to extract parallelism at the compiler stage and group the instructions accordingly. Since at compile stage there is more time available for code analysis, Intel thought that this would be the way to extract the most parallelism and save some silicon at the same time, because the chip doesn't have to extract parallelism at run time.

We know that things didn't work quite as well as Intel expected, but the idea was to improve ipc by exstracting lots of parallelisam at the compiler stage.

- Yes, with SMT, instructions of different threads will go to different execution units at the same time (instead of only executing instructions from 1 thread at a time).

[ 03-05-2002: Message edited by: amyklai ]

nicsta · March 5, 2002 5:43AM

[quote]Originally posted by RazzFazz:



Well, guess I shouldn't have said "narrow", but it's certainly true for "deep". As Mr. DeMone says, the situations where SMT is useful are caused by "memory stalls, branch misprediction, instruction dependencies etc inherent in single thread execution", which have much more impact (and as such present much more opportunity to benefit from SMT) in deeply pipelined architectures, whereas architectures with shorter pipelines can keep their execution units occupied much better (and SMT only works if you have EUs that are not in use in the first place).

Also, he claims that on the EV8, the speedup would have been "100% or more" - I wonder how this would be possible, given that even with SMT, there's still just a single pipeline that can only fetch and retire a given number of instructions per clock. Is he assuming the EV8 couldn't possibly keep half of its EUs busy without SMT?

Bye,

RazzFazz

BTW: Who is that Paul DeMone? Anyone I should know?<hr></blockquote>

Not really, Paul just knows his way around semiconductors, he writes about them <a href="http://www.realworldtech.com/listing.cfm?section=columns&subject=insider"; target="_blank">here.</a>

I think this particular <a href="http://www.realworldtech.com/page.cfm?section=columns&AID=RWT121300000000&p=3"; target="_blank">link</a> might answer some of your questions.

grad student · March 5, 2002 10:44AM

"grad student"?

oh right, like in Good Will Hunting!!!

more like as in USC bitch... you don't know me, you don't know where I am from...

stoo · March 6, 2002 6:58PM

[quote]Two 1 GHz G4s almost eat up as much power as a single 2.2 GHz P4.<hr></blockquote>

Out of interest, is that with or without the L3 cache?

altivec_2.0 · March 6, 2002 7:23PM

It's with the L3 Cache. It's amazing that twin 1Ghz G4 processors eat up as much power at a 2.2GHz Pentium 4. Pentium is using more power because of the size of the chips.

G4

_________________________

Pentium

____________________________________

sizes for both processors are real scale

Why not .13??

Comments