x86 vs. PPC

overtoasty · July 11, 2003 8:39AM

Quote:

Originally posted by rickag

First off, remind me not to argue with you, your reasoning power is impressive.

I do have a question though. I have been told that Windows XP is based on the NT operating system which at one time did run on PPC, is this true, somewhat true or complete hogwash?

Garsh (blush) ... thanks!

Last I heard, I think the story about RISC and Windows went something like this (I'm sure I won't get this entirely right, but the gist will have to do) ...

One of the bigger reasons why Microsoft originally bothered with NT way-back-when, was because they where afraid that RISC would really take off (this is back in the very early 90's) and that their x86 chips would not be able to keep up.

So essentially, NT was a kind of insurance policy ... I know that NT ran on RISC as far as 3.51, perhaps it even went as far as 4 ... but eventually, it was just dropped. Likely because the incredible code cracking that Intel did, made the speed differential between RISC chips and x86, moot for most of the purposes the NT market was going after (and also because - so I hear - NT on RISC was MEGA-slow).

Anybody else know anything about this?

rickag · July 11, 2003 9:05AM

More questions:

CISC instructions must be translated into µops, correct. The cpu then executes these µops. Wouldn't these µops then need to be reassembled and paired up with the original CISC instruction to retire it?

If this is the case, then the orginal CISC instruction must be stored somewhere, no? The G5 has what, 216 in flight instructions, how many in flight instructions do the latest greatest Intel Pentium IV cpus have?

If Intel is moving toward more parallelism with instructions, for example hyperthreading, might the need for storing and correctly matching retired reassembled µops become an issue? More realestate(transistors) dedicated to the translation of more and more in flight instructions? After all, the G5 has 58 million transistors and the Pentium 3.06GHz has ??? over 100 million?

Maybe this even affected Intel's implementation of SIMD. A dedicated SIMD unit like Altivec retiring 3 instructions/cycle in addtion to other instructions might have cost too much realestate in the CISC environment?

Then again, I really am way out of my league here, and may have completely screwed this up.

tht · July 11, 2003 9:31AM

I think you guys are making a little much over the CISC v RISC thing. For all intents and purposes, instruction set architecture has become fairly irrelevant to CPU design. It plays a part, used to be a huge part, but nowadays, the ISA maybe directly effects 10% of the CPU transistor budget, increasingly less as time goes on.

The single biggest factor in the x86 ISA is the stack based FPU, which seemingly hasn't caused AMD or Intel much trouble to overcome.

Quote:

Originally posted by OverToasty

Moore's - instead of using all those transitors for code cracking, they could be better used for actual processing elsewhere, thus speeding things up.

Amdahl's - let's assume we had two infinitely fast processors, one with the fastest code cracking front end that intel could come up with, and one that just accepted instructions directly ... the one without the code cracking front end would still be infinitely faster.

If the architecture is pipelined, the additional stages for turning intructions into micro-ops has virtually zero penalty.

Quote:

The problem with this way of looking at things is it doesn't do much when the incoming code is close to as good (or perhaps even better) as what any cracker could have come up with anyway ... then there's no escaping Moore's (original) law ... and any transisters not needed for cracking can be better put to use elsewhere actually processing.

If the number of transistors dedicated to "cracking" is small, less than 10%, it won't effect the cost of a chip much. If you look at the number of transisters dedicated to logic in todays 50+ million transistor CPUs, you'll find that it's around 15 to 20 million. The rest is for cache. Tomorrow's CPUs will for push the logic-to-cache transistor ratio lower and lower. Thus making the cost of cracking lower and lower. And as before, pipelining eliminates much of the performance penalty induced be the cracking stages.

tht · July 11, 2003 9:39AM

Quote:

Originally posted by rickag

If Intel is moving toward more parallelism with instructions, for example hyperthreading, might the need for storing and correctly matching retired reassembled µops become an issue?

Can't answer the other questions, but for this one no.

Quote:

More realestate(transistors) dedicated to the translation of more and more in flight instructions? After all, the G5 has 58 million transistors and the Pentium 3.06GHz has ??? over 100 million?

The P4 has around 60 to 70 million, most of it due to cache as is the 970 and 7457.

Quote:

Maybe this even affected Intel's implementation of SIMD. A dedicated SIMD unit like Altivec retiring 3 instructions/cycle in addtion to other instructions might have cost too much realestate in the CISC environment?

As I recall, the original MMX/SSE instructions used the ALU/FPU for backward compatibility reasons (for software). Not sure why the P4 is designed the way it is. Maybe Intel just doesn't like dedicated units. As for wasting real estate, I don't think that's the problem either, even though the P4 was huge at 180 nm.

programmer · July 11, 2003 1:52PM

Quote:

Originally posted by THT

I think you guys are making a little much over the CISC v RISC thing. For all intents and purposes, instruction set architecture has become fairly irrelevant to CPU design. It plays a part, used to be a huge part, but nowadays, the ISA maybe directly effects 10% of the CPU transistor budget, increasingly less as time goes on.

Agreed -- its not a huge deal. Given exactly the same process technology and design goals, a PowerPC would come out better... but not by much, and in practice we'll never see this really happen because IBM, Intel, and AMD have different processes (well AMD is going to use IBM's now, but Intel has its own), their own design methodologies, their own techniques & technologies, and their own goals. These other factors will always overshadow the ISA differences, at least at this point. Back in the mid-90's it was a big deal, but that is history.

If you ignore the caches I expect the ISA decoder takes an approximately constant percentage of the transistor budget because it needs to be improved to feed a better core. The processors are shrinking relative to the cache sizes, however, so overall you are correct.

overtoasty · July 11, 2003 9:25PM

Quote:

Originally posted by Programmer Earlier

they aren't saddled with the baggage of decoding a stupid, backward, hacked together instruction set.

Quote:

Originally posted by Programmer Later

Agreed -- its not a huge deal.

Okie Dokie ...

... but if the ISA's no biggie, why the earlier Vitriol Mr P.?

apeiros · July 11, 2003 11:06PM

Sorry if this was asked before but what exactly does "in-flight instruction" mean - what is it?

snoopy · July 12, 2003 12:27AM

Quote:

Originally posted by OverToasty

. . . One of the bigger reasons why Microsoft originally bothered with NT way-back-when, was because they where afraid that RISC would really take off (this is back in the very early 90's) and that their x86 chips would not be able to keep up. . .

It is my impression that Windows NT was developed to get away from DOS. Versions up to at least Windows 98 ran on top of DOS. Microsoft hired someone from DEC (I think) to work on it. He had worked on the VMS operating system and was not happy with the company. He called it Windows NT for the initials WNT. By going to the next letter in the alphabet, VMS becomes WNT, just as HAL becomes IBM. Just a bit of trivia.

powerdoc · July 12, 2003 1:34AM

The 10 % of transistors needed in the front end of the X86 chip is not a very great deal, but the SIMD implementation of the X86 chips is a real one.

As someone mentionned the MMX unit share transistors with the FP unit, thus making a deep performance penalty when switching from the MMX mode to the FP mode. If my memory is correct SSE2 don't have this problem.

But the problem of the X86 is MMX, MMX2, 3DNOW, SSE, SSE2. What a nightmare for the programmers.In the contrary Altivec is a unique set of 162 instructions on 128 bits, offered in nearly all the macs, at the exception of the ibook, who give a good bump in multimedia performance.

Both Intel and AMD should start the SIMD unit from scratch, and make one like altivec, and send MMX and consorts to hell. Luckily for the Mac supporters, there is only a tiny chance that it will occur. MMX at the beginning was essentially a marketing trick from Intel. AMD replied with his 3D now instructions, then there where SSE, SSE 2 ... , all marketings trick in order to gain an advantage. The result : no clear norm (even if the majority of the stuff is from Intel), and no unity. I am ready to bet , that it won't change since the end of the X86 aera.

programmer · July 12, 2003 8:01AM

Quote:

Originally posted by OverToasty

... but if the ISA's no biggie, why the earlier Vitriol Mr P.?

I'm not sure what you mean -- I suspect you are reading more vitriol into your reading than I am putting into my writing. At least when it comes to performance...

The x86 ISA is a hacked, and awkwardly designed (if you can even use the word "designed") mess. It galls me that people refer to things like Apple's DDR solution as a "hack" and yet accept the x86 as it is. I much prefer programming on the PowerPC as compared to the x86, especially if I have to do something in floating point or SIMD. It is far more elegant, orthogonal, predictable and understandable so the human side of the design is far and away superior to the swamp that is the x86-ISA. And AMD's 64-bit extensions (not to mention the 4+ SIMD extensions!!!) just make a bad situation worse, whereas the PowerPC ISA was designed as 64-bit with a 32-bit subset from day one, and has been extended once to add SIMD in an elegant and powerful way. But this is something only a programmer cares about... the end user really just cares about price/performance and compatibility (the latter has always been the pillar of x86's strength).

zapchud · July 12, 2003 8:18AM

Quote:

Originally posted by Powerdoc

As someone mentionned the MMX unit share transistors with the FP unit, thus making a deep performance penalty when switching from the MMX mode to the FP mode. If my memory is correct SSE2 don't have this problem.

SSE2 indeed has these problems. It's shared with the FPU's on the P4, and yes, it's a penalty, especially if you want to optimize just parts of crucial code for SSE2, and let the rest of it be FPU, and optimize that, to ensure the code runs well on more than just the P4. Then the P4 cannot execute the code in parallel, and some of the point of SIMD is lost, IMO

rickag · July 12, 2003 12:19PM

THT and Programmer

Thanks for the information and drat, I thought maybe there was more the the translation to µops than there apparently is. Especially concerning

Quote:

:Originally posted by rickag

If Intel is moving toward more parallelism with instructions, for example hyperthreading, might the need for storing and correctly matching retired reassembled µops become an issue?

THT: Can't answer the other questions, but for this one no.

overtoasty · July 12, 2003 4:38PM

Quote:

Originally posted by Programmer

The x86 ISA is a hacked, and awkwardly designed (if you can even use the word "designed") mess. It galls me that people refer to things like Apple's DDR solution as a "hack" and yet accept the x86 as it is. I much prefer programming on the PowerPC as compared to the x86, especially if I have to do something in floating point or SIMD. It is far more elegant, orthogonal, predictable and understandable so the human side of the design is far and away superior to the swamp that is the x86-ISA.

... gotcha (I think) ... so basically, all things being equal, the cracking doesn't make all that much difference, the problem is, all other things aren't equal (as usual), and to try and write powerful efficient code on x86 is like trying to unscramble a dogs breakfast ... once you've managed this messy feat, it all works pretty much just as peachy as it would using the PPC ISA (though perhaps AlitVec accepted?).

This perhaps goes a long way to explaining the superiority of the G5 in photoshop bake-offs ... while the existence of AltiVec might explain many of the speed advantages, the other half of the coin is that the PPC is just one hell of a lot easier to optomise for, so coders can put in the time speeding things up once they get them working - whereas, with the x86, just getting things working takes most of the time?

Is this about right?

\

programmer · July 12, 2003 7:32PM

Quote:

Originally posted by OverToasty

... gotcha (I think) ... so basically, all things being equal, the cracking doesn't make all that much difference, the problem is, all other things aren't equal (as usual), and to try and write powerful efficient code on x86 is like trying to unscramble a dogs breakfast ... once you've managed this messy feat, it all works pretty much just as peachy as it would using the PPC ISA (though perhaps AlitVec accepted?).

Close, although because of the OoOE capabilities in the latest chips figuring out the best instruction order is harder and less important. What I'm talking about is more just how hard it is to write & read code, which is something usually overlooked. These days writing code directly in the ISA is rare, but while debugging and tracking down problems programmers often have to read & understand it.

AltiVec is superior for a variety of reasons, and a really big one is the extensions to C/C++ that AIM defined. This was possible because they did a really good job of designing the instruction set. The mess that is MMX, MMX2, SSE, SSE2, 3DNow, etc is such a nightmare that the C extensions couldn't really be defined in a reasonable way (I think there are some compiler intrinsics now, but they suck next to AltiVec).

Quote:

This perhaps goes a long way to explaining the superiority of the G5 in photoshop bake-offs ... while the existence of AltiVec might explain many of the speed advantages, the other half of the coin is that the PPC is just one hell of a lot easier to optomise for, so coders can put in the time speeding things up once they get them working - whereas, with the x86, just getting things working takes most of the time?

Is this about right?

For performance measurement I would disagree, actually. GCC's PPC code generation still isn't as good as its x86 code generation, and I doubt most of the applications tested have been intensely G5 optimized yet (Photoshop may be an exception since the guys at Adobe often get early access to new hardware and do a decent job of optimization).

No, IMO the G5 does well on the application benchmarks (and SPECmarks when done on a level playing field as Apple had them done) because it is really damn fast.

The fun will really start when developers get their hands on the new hardware and Apple's new performance tools, and then take the time to make the code fast on the Mac. This has been a huge problem for Apple -- due to the small size of the market many developers just don't spend the time optimizing their Mac version.

mac the fork · July 12, 2003 9:53PM

Quote:

These days writing code directly in the ISA is rare, but while debugging and tracking down problems programmers often have to read & understand it.

Just to expand on that a little, it's not that x86 code is hard to read and understand on a superficial level, but the challenges for me are in the details of each instruction. Often times, data you use will have to be in a specific place, so you have to shuffle data around needlessly in your program. When I write assembly code, I have to have a textbook right next to me.

Code:

movax, cs

movds, ax

moves, ax

This code copies the contents of the cs register into the ax register, then copies the contents of ax into ds and es. Why not just copy from cs into ds and es? Well, I can't copy from cs directly to ds and es; I have to copy to ax first, and then copy from ax to ds and es, that's why. I can't remember if I can copy from cs to bx, cx, dx, si or di, so I only copy to ax.

Oh yeah, and there's also the sheer number of instructions available on these things, often times doing variations of the same thing or 'simplifying' several instructions down to one big complicated one. Weird stuff like repne scasd and xlat.

smalm · July 14, 2003 7:51AM

Quote:

Originally posted by snoopy

It is my impression that Windows NT was developed to get away from DOS. Versions up to at least Windows 98 ran on top of DOS. Microsoft hired someone from DEC (I think) to work on it. He had worked on the VMS operating system and was not happy with the company. He called it Windows NT for the initials WNT. By going to the next letter in the alphabet, VMS becomes WNT, just as HAL becomes IBM. Just a bit of trivia.

MS hired six guys from DEC to do the job. The OS was originally build for the i860 (which had the code name N-10). They needed five years to get WinNT out of the door and another three years to build a usable GUI (ever had to use WinNT 3.5?

).

But your explanation for WNT is more fun

x86 vs. PPC

Comments