Dumb question Re: Pipelines & the "Myth"
Not so long ago, Apple was spouting about the "Megahertz Myth" and how PPC processors had a (forgive my ignorance) a shorter pipeline stage (?) and could get more things done per-cycle... etc. At least that was how I understood it.
Now... with a move to Intel? and rumors of a modified Pentium known as a Pentium M to run on less power... will it also reflect a shorter pipeline...? Or is Apple going to concede some truth to the idea that "SOMETIMES", a faster cycle speed IS faster...?
Just curious about that end.
Since I am not to savvy on anything hardware related, it was easy for me to "buy-into" the "Mhz-Myth" story... just curious if (A) it was true to begin with, and if so (B) what is different now...? Why is it aceptable to have a longer pipeline now...?
(Not trying to be inflamatory... I really am curious!)
Now... with a move to Intel? and rumors of a modified Pentium known as a Pentium M to run on less power... will it also reflect a shorter pipeline...? Or is Apple going to concede some truth to the idea that "SOMETIMES", a faster cycle speed IS faster...?
Just curious about that end.
Since I am not to savvy on anything hardware related, it was easy for me to "buy-into" the "Mhz-Myth" story... just curious if (A) it was true to begin with, and if so (B) what is different now...? Why is it aceptable to have a longer pipeline now...?
(Not trying to be inflamatory... I really am curious!)
Comments
Keep in mind that Intel is scrapping the whole Pentium 4 netburst architecture in favor of the Pentium M. Next year will have the last of the Pentium 4 CPUs on the desktop and the introduction of it's replacement in the form of a revamped Pentium M tweaked for desktop use.
Again I'm not sure about numbers but I think the Pentium M is close to a G5 in terms of stages.
What's kinda interesting about all this is that the M is sorta-kinda an application of PPC principles to the Pentium architecture. Shorter pipelines, higher IPC, emphasis on power consumption, etc. The direction of the G3/G4 was the right one, long-term; much more so than the PIV. But I guess Mot/IBM just didn't have the resources to keep delivering.
The revamped G4+ core upped it to 7 pipes.
The G5 has over 18 I believe and maybe as many as 25. They don't list it as definively on spec sheets for G5.
Intel is in fact moving away from high clocked high heat causing parts to a more efficient architecture based on shorter pipes. Also note that because of shorter pipes it becomes much harder to add threading capabilities like Hyperthreading. Thus it'll be a while before we are likely to see a hyperthreadiing Banias core.
No problem to me...just add another actualy core or two and I'll be happy.
Originally posted by Scott Finlayson
Not so long ago, Apple was spouting about the "Megahertz Myth" and how PPC processors had a (forgive my ignorance) a shorter pipeline stage (?) and could get more things done per-cycle... etc. At least that was how I understood it.
This is still true. It is even true between IBM PowerPC architectures! (Let alone between IBM and Motorola PowerPC processors.)
Now... with a move to Intel? and rumors of a modified Pentium known as a Pentium M to run on less power... will it also reflect a shorter pipeline...? Or is Apple going to concede some truth to the idea that "SOMETIMES", a faster cycle speed IS faster...?
The Pentium M has existed for over 2 years now. It's not a rumor. Banias was the 1st gen chip, fabbed at 130 nm. Dothan is the 2nd generation 90 nm chip in current laptops. Yonah will be a 3rd generation 65 nm chip.
Merom is a new microarchitecture, likely not based on the Banias microarchitecture, but follows many of the same design tradeoffs. It's 64 bit, supposedly 4-issue wide, and optimised for low power. If it was anything, it was the Merom architecture that convinced Apple to switch to Intel.
If Intel wants to stick with "Pentium" branding, Merom will likely be called a Pentium M as well, even though it will be very different.
Going into the future, Apple is likely only going to use Yonah and Merom Intel processors for laptops and Conroe (a desktop version of Merom) for desktops. Since these processors are clocked lower, and will be clocked lower, than Pentium-4 processors of today, Apple will concede nothing and simply not talk about the MHz-Myth anymore because the P4 is an Intel processor.
The answer to your question is complex and will involve a many page thread.
Since I am not to savvy on anything hardware related, it was easy for me to "buy-into" the "Mhz-Myth" story... just curious if (A) it was true to begin with, and if so (B) what is different now...? Why is it aceptable to have a longer pipeline now...?
It's true, but it's much more complicated. The MHz-Myth refers to the fact that clock speed is not the sole indicator of the performance of a CPU. It is merely one of many.
The PPC 970 and the PPE core in Cell and Xenon is a good example for illustration. Performance can be gained through a variety of ways. One way is by increasing clock rate. Another way is by making the processor compute more instructions at the same clock rate. In order to execute more instructions per cycle though, the processor has to have more complexity. But the more complex a processor is, the harder it will be to clock it higher.
The PPE core in Cell and Xenon follows a set of design tradeoffs to get performance by simplifying the design, but clocking it very high. It has 1 integer execution unit (among other things), but has a very deep pipeline allowing it to clock in a range from 3 to 4.5 GHz on a 90 nm fab.
The PPC 970 follows a set of design tradeoffs to get performance through making the processor complex, but as a result it can't be clocked very high. It has 2 integer execution units (among other things), but has middling deep pipeline allowing it to clock in a range from 1.6 to 2.7 GHz on a 90 nm fab.
So, for a 4 GHz PPE core, it can only execute 1 integer instruction per clock cycle, but can execute 4 billion of them a second. A 2 GHz PPC 970 can execute 2 integer instructions per clock cycle but can only execute them at a rate of 2 billion of them a second. In the end for this very simplified example, a 4 GHz PPE and a 2 GHz 970 has the same performance, even though though one is clocked twice as high as the other.
The nuance of the MHz-Myth between the Pentium 4 and MPC 74xx (G4 processors) is a tradeoff involving CPU pipeline depth and branch prediction. One factor among many in performance is the ability for a processor to be executing an instruction every clock cycle. A pipeline in a processor is like an assembly line. One stage of the assembly line does one task and passes it off to the next stage of assembly line to do its task. The basic stages of a CPU pipeline can be boiled down to 1) fetching an instruction, 2) dispatching an instruction to the execution unit, 3) executing the instruction, and 4) sending the results back.
Remember, the more complex the circuitry, the more difficult it is to clock. What can be done to increase the clock rate, thereby performance, is to split those stages into many smaller ones. How many stages is a tradeoff in design. I made a list below. By splitting the stages into smaller ones, the circuitry becomes less complex, making it easier to clock at higher clock rates. Voila, higher clock rate means more performance! Not precisely.
One of the things that prevent that from happening are "if" questions in the instruction stream a processor executes. An "if" question is a serially dependent situation. A pipelined processor can have a different instruction at each stage of a processor pipeline. A 20 stage pipeline processor can have 20 instructions in transit, one at each stage. The "if" question is a situation where the processor does not know what instruction to execute next because it has to wait for the result.
A 4 stage pipeline CPU has to wait 4 clock cycles. A 20 stage pipeline has to wait 20 clock cycles. No instructions are being executed during this time. A 1 GHz 4-stage pipeline CPU has to wait 4 nanoseconds before it can start executing another instruction. A 2 GHz 20 stage pipeline CPU has to wait 10 nanoseconds before it can execute another instruction. That's twice as long! A slower clocked processor can actually perform faster than higher clocked processor because of this.
To negate that, designers use branch prediction. It's a scheme that predicts what the result of the "if" question is and submits the next instruction into the pipeline right after the "if" instruction, thereby keeping the pipeline full resulting in the processor executing instructions every clock cycle.
As things are in life, branch prediction is not perfect, so the tradeoff remains. A deeply pipelined processor pays a big penalty in time if it has a branch mis-prediction while a shallowly pipelined processor pays a small penalty in time.
This was the MHz-Myth Apple was talking about. AMD had a MHz-myth they needed to dispel versus the Pentium 4 as well, but that was more of a combination of execution units (like the PowerPC example above) and pipeline depth penalty differences between the Athlon and Pentium-4.
Pipeline lengths for various processors (not accurate, but close):
CPU No. stages (minimum)
PPC 750 (G3) 4
PPC 604 6
MPC 7400 (G4) 4
MPC 7450 (G4+) 7
PPC 970 (G5) 16
IBM PPE (Cell, Xenon) 21
IBM Power4 14
Pentium Pro/II/III 12
Pentium 4 (Northwood) 21
Pentium 4 (Prescott) 31 (?)
Pentium M (Banias) 14 (?)
Intel Merom/Conroe 16 (?)
Itanium 8
Athlon (K7) 10
Athlon 64 (K8 ) 12
Thanks for all the replies... I guess it's clear(er) now.
-sf
Originally posted by hmurchison
Also note that because of shorter pipes it becomes much harder to add threading capabilities like Hyperthreading. Thus it'll be a while before we are likely to see a hyperthreadiing Banias core.
Pipeline depth is not truly an indicator for difficulty in SMT. It's correlative, but not causal.
SMT is a scheme to keep a processor executing, to have as many of the pipeline stages executing instructions as possible. SMT helps deeply pipelined processors negate pipeline "latencies". Not the right way to say that really. Hmm. Starting over.
SMT is a scheme to keep a processor executing, to have as much of the processor's execution resources executing as possible. The difficulty in implementing SMT is directly proportional to the amount of execution resources.
Deeply pipelined processors have "execution resources" in the form of lots of stages (20+) of execution rather than lots execution units (ALU, FPU, SIMD, L/S). They are "good" candidates because of the nature of a deeply pipelined CPU, branch mispredictions cause lots of bubbles or empty pipeline stages, and since those stages are empty, those stages could be used to execute another instruction stream. Hence, Pentium 4 and IBM PPE at 20+ stages are "good" candidates for extracting thread-level parallelism.
The PPE much more so because the P4 has Out-of-Order Execution while the PPE does not. OOOE fills more of the pipeline stages and therefore negates the usefulness of SMT. Really good OOOE and really good branch prediction pretty much negates SMT to trivial performance increases, and this is part of the reason Pentium-4 isn't a "great" candidate for implementing SMT on.
On the other hand, the PPE is an in-order machine, with unknown branch prediction (BPU) capability. If the BPU isn't very good, SMT on the PPE could result in a dual-processor like speedup: ~60% on average compared to the poor speedups seen for the P4 (<15%) on single-thread to multi-threaded apps.
An example of CPU that has SMT and a short pipeline? The Alpha 21464 [EV8] processor. Planned but never shipped after Compaq destroyed Alpha, which coincidently resulted in Alpha CPU engineers going to Intel, and SMT appearing on the P4 a couple of years later. EV8 had an 8-stage pipeline. Only eight. But it was 8-issue wide with 8 ALUs, 4 FPUs and 4 of the ALUs doubling as LSUs. It could do 16 branch predictions per cycle!
Those were enough execution resources for a 4-way SMT machine (operating system sees it as quad CPU system). Also had a 10 (?) channel Rambus memory controller on-die. An SMT design such as this was the best of all worlds, combining the best of instruction level parallelism and thread level parallelism, but as can be seen, it's monstrously complex. Multi-core systems are just simpler and cheaper to implement.
Anyways, it's execution resources, not pipeline depth, that is really the indicator for suitability of SMT.
Originally posted by THT
On the other hand, the PPE is an in-order machine, with unknown branch prediction (BPU) capability. If the BPU isn't very good, SMT on the PPE could result in a dual-processor like speedup: ~60% on average compared to the poor speedups seen for the P4 (<15%) on single-thread to multi-threaded apps.
Branch stalls aren't the biggest opportunity for SMT in the PPE, there are much bigger latencies to be filled.
Originally posted by Programmer
Wow, THT is all over this thread. Nicely done.
Thank you. But lots of grammar problems.
Branch stalls aren't the biggest opportunity for SMT in the PPE, there are much bigger latencies to be filled.
You actually know what the PPE latencies are? The only one I've read about is the 1 cycle penalty for some sort of instruction cache access.
Originally posted by Scott Finlayson
Not so long ago, Apple was spouting about the "Megahertz Myth" and how PPC processors had a (forgive my ignorance) a shorter pipeline stage (?) and could get more things done per-cycle... etc. At least that was how I understood it.
Now... with a move to Intel? and rumors of a modified Pentium known as a Pentium M to run on less power... will it also reflect a shorter pipeline...? Or is Apple going to concede some truth to the idea that "SOMETIMES", a faster cycle speed IS faster...?
Just curious about that end.
Since I am not to savvy on anything hardware related, it was easy for me to "buy-into" the "Mhz-Myth" story... just curious if (A) it was true to begin with, and if so (B) what is different now...? Why is it aceptable to have a longer pipeline now...?
(Not trying to be inflamatory... I really am curious!)
Whatever Apple will use, Apple will say it's better.
The big difference is that Apple will not have to worry if his machines are slowler than the one of the X86 world.
There will still be a competition with AMD (good for Apple, and for every customers in general wether they use AMD or Intel chips), but nothing like we see now (IBM against Intel and AMD)
Originally posted by THT
You actually know what the PPE latencies are? The only one I've read about is the 1 cycle penalty for some sort of instruction cache access.
Well for starters go and look at a SimG5 profile on some code running on a 970. The biggest stalls, by far, are cache misses. L2 is about 11 cycles, and memory is several hundred. The PPE's clock rate is higher, so it isn't going to be any better, that's for sure.