G5 : 64 bits or 32 bits ?

programmer · April 2, 2002 9:50PM

[quote]Originally posted by 123:

64bit addressing can wait for now. What we really need is 64bit FP SIMD (altivec)!

<hr></blockquote>

I disagree... if you have vector doubles then you're only doing two operations per instruction (unless you double the size of the already huge vector register set and introduce all sorts of compatibility issues with current AltiVec code), but you still need to go through the effort of vectorizing your code. Much better would be to have multiple FPUs so that more than one double FP operation executes per clock cycle, and this happens on scalar and vector code.

[quote]

Actually, this depends a lot on the processor design and calling convention. Most processors have gp registers that are not saved for simple function calls. Then, there are leaf procedures... As for regular function calls, if you have a machine with multiple register windows (SPARC), you don't have to save your registers at all (well, most of them, most of the time). However, I entirely agree that generally stack size will increase, but how much is that?

<hr></blockquote>

In multi-threaded object oriented code it can amount to quite a bit of bandwidth. It also has the effect of putting extra pressure on the caches in the system because the stack consumes cachelines that would otherwise be used for other data.

[quote]

...comments about how reads come in as 64-bit anyhow...

<hr></blockquote>

While it is true that individual reads are at least 64-bits (wider from L3 -> L2 -> L1), and they come in bursts of 4, this does not mean that reading a 64-bit integer is the same cost as reading an 8-bit integer. The problem is that you can't consider machine performance on element at a time. If all the machine is doing is reading just the one data element, and then doing nothing else, you are correct that either read is of equal cost. However, if the machine is reading a sequence of these values then the 8-bit reads will find 32 entries in the same cacheline, whereas the 64-bit reads will have to pull in 8 cachelines to get 32 entries. If you look at data structures in the aggregate (especially ones designed for good cache performance) then using larger types has a performance impact if you are memory bound ... and the G4 spends a lot of time waiting for memory.

razzfazz · April 3, 2002 3:56AM

[quote]Originally posted by powerdoc:



So how , do you explain that IBM choose 64 bit CPU for his high end server with the power 3 and Power 4 64bits chips, if the only important thing is the speed of the HD ?<hr></blockquote>

They chose 64 bits for larger memory address space, not for being able to handle larger volume or file sizes on disk, as was suggested before.

Bye,

RazzFazz

razzfazz · April 3, 2002 4:10AM

[quote]Originally posted by 123:

64bit addressing can wait for now. What we really need is 64bit FP SIMD (altivec)!

<hr></blockquote>

Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless IMO (and boosting AltiVec to 256 bits would kill compatibility). Just adding another scalar FPU would be much more powerful and flexible.

[quote]

Today, if you read 8bits from memory, you will already read 64 bits, because that's the bus' width. Actually, you'll even read 4x64 (or 8x64 or 8 quads on DDR boards), because:

- SDRAM bursts are cheap.

- entire cache blocks are read at once.

<hr></blockquote>

Yes, but getting data from a non-aligned address (ints not on addresses divisible by 32, for example) into a register is still usually more expensive, and sometimes not possible at all.

[quote]

Actually, this depends a lot on the processor design and calling convention. Most processors have gp registers that are not saved for simple function calls.<hr></blockquote>

While this may be true, he was talking about context switches, not simple function calls.

[quote]As for regular function calls, if you have a machine with multiple register windows (SPARC), you don't have to save your registers at all (well, most of them, most of the time). <hr></blockquote>

Well, yes, but apart from SPARC, what other processor currently in use has register windows at all?

[quote]However, I entirely agree that generally stack size will increase, but how much is that?

<hr></blockquote>

I think he was more referring to the fact that a larger register file (32x64 GPR + 32x64 FPR instead of 32x32 GPR + 32x64 FPR) causes additional memory bus utilization on each and every context switch.

Bye,

RazzFazz

powerdoc · April 3, 2002 10:05AM

[quote]Originally posted by RazzFazz:



They chose 64 bits for larger memory address space, not for being able to handle larger volume or file sizes on disk, as was suggested before.

Bye,

RazzFazz<hr></blockquote>

So if i understand you , RazzFazz, 64bits cpu are useless for the mac, much better have one or two more FPU unit,

That's my opinion too, but i am not a specialist like you or the Programmer.

programmer · April 3, 2002 11:04AM

[quote]Originally posted by RazzFazz:



Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless IMO (and boosting AltiVec to 256 bits would kill compatibility). Just adding another scalar FPU would be much more powerful and flexible.

<hr></blockquote>

Didn't I just say that?

[quote]

Yes, but getting data from a non-aligned address (ints not on addresses divisible by 32, for example) into a register is still usually more expensive, and sometimes not possible at all.

<hr></blockquote>

This usually isn't a factor since the vast majority of all accesses are cached and the size of the fetch from cache is irrelevent (at least on PPC). The important factor is locality of reference, and the effect data size has on that.

 [quote]

While this may be true, he was talking about context switches, not simple function calls.

...

I think he was more referring to the fact that a larger register file (32x64 GPR + 32x64 FPR instead of 32x32 GPR + 32x64 FPR) causes additional memory bus utilization on each and every context switch.

<hr></blockquote>

Both, actually. Context switches & interrupts don't happen all that often -- on the order of a few hundred per second, typically. This doesn't really amount to a whole lot of bandwidth.

Function calls, on the other hand, usually do need to save a fair bit of context... many functions are not leaf functions and thus end up needing to save at least about half the registers. Even leaf functions need to save registers if they are doing something register intensive.

Don't over-estimate the cost I'm pointing out either -- I didn't say how expensive it is to carry around a 64-bit integer register file, I just said it was more expensive than carrying around a 32-bit integer register file. It probably doesn't amount to more than a few percent on most benchmarks. Just like going to DDR RAM doesn't double overall system performance, going to double sized integers doesn't cut performance in half.

programmer · April 3, 2002 11:14AM

[quote]Originally posted by powerdoc:



So if i understand you , RazzFazz, 64bits cpu are useless for the mac, much better have one or two more FPU unit,

That's my opinion too, but i am not a specialist like you or the Programmer.<hr></blockquote>

I certainly wouldn't go that far! If I had to rank processor features in order of preference, this is the order I'd put them:

- On-chip DDR memory controller + RapidIO bus.

- Higher clock rate (~1.6 GHz).

- More than 1 FPU.

- Ability to retire more instructions per clock cycle.

- More integer units (since we can now retire more...).

- Super high speed interface to graphics chipset.

- 64-bit

- Much higher clock rate (~3 GHz). This is last because it requires much longer pipelines, which is a bad thing.

From a developer's point of view having 64-bit and a capable virtual memory system (like MacOSX's) enables certain useful techniques -- like being able to memory map as many really large files as you want, or using huge sparsely allocated data structures.

airsluf · April 3, 2002 4:05PM

programmer · April 3, 2002 4:55PM

[quote]Originally posted by AirSluf:



Not if you added a second Altivec unit and laid the two logically side-by-side. Then extend the instruction set to handle 4x64 vectors by using both units simultaneously. Whether the real estate is available on the die is a seperate matter. But a logical 256-bit Altivec does not have to break the previous implementation.<hr></blockquote>

Oh there is plenty of real estate available. The scheme you propose is fraught with difficulties, however, and it just feels like a hack. It is difficult enough for Apple to get people coding for AltiVec when there is just one version of it -- adding a second would be impossible to justify. Even Intel has trouble convincing developers to spend the effort to support MMX/SSE/SSE2. Adding more FPUs would just make existing code run faster. Adding another VPU or two would make existing AltiVec code run faster. Of course you'd have to ease the memory bottleneck first to make any of that worthwhile...

April 3, 2002 5:06PM

[quote] Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless IMO (and boosting AltiVec to 256 bits would kill compatibility). Just adding another scalar FPU would be much more powerful and flexible. <hr></blockquote>

You could add some additional FPUs or whatever, I don't really care. At the moment, the G4 doesn't have competitive double precision units and this is very important for a majority o scientific applications. In my opinion, 2x64 is a lot more than 0x64. Vectorization is not a big issue, because scientific calculations already are vectororized for either supercomputers, clusters or P4s (where everything dp is done in SIMD). Besides, I wouldn't exactly say those kind of programmers are stupid... so I think they could handle AltiVec.

[quote] However, if the machine is reading a sequence of these values then the 8-bit reads will find 32 entries in the same cacheline, whereas the 64-bit reads will have to pull in 8 cachelines to get 32 entries. If you look at data structures in the aggregate (especially ones designed for good cache performance) then using larger types has a performance impact if you are memory bound ... and the G4 spends a lot of time waiting for memory. <hr></blockquote>

Correct, if you use larger types. However, I was talking about "it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits", my point being that today's machines already read at least 64 bits at a time.

Just because the registers are 64 bits wide, that doesn't mean you have to use all bits, there are still stb, sth, stw and their l equivalents. A halfword has to be aligned (RazzFazz) to 16 bits and a byte to 8 bits. So you can still have your 32 bytes in one 256 bit cache block. This has absolutely nothing to do with register size.

[quote]

Well, yes, but apart from SPARC, what other processor currently in use has register windows at all?<hr></blockquote>

the discussion was becoming more and more one about 64 vs. 32 in general, not just PPC. My point was that a lot of the stated arguments are not generally true but rather design dependant and I actually think one example is enough to proof that (Tensilica uses them too).

[quote] While this may be true, he was talking about context switches, not simple function calls.<hr></blockquote>

He was talking about function calls, and so was I (true, context switches are expensive, but I wasn't addressing them). I also admitted that stack size will increase (being the place where function parameters and registers are usually saved) and with it, memory transfer. However, as I've said, this is hardware and software dependant, cache design plays an important role, multiple register sets etc. I still don't think it accounts for "a few percent on most benchmarks", I'd say a fraction of one.

[ 04-03-2002: Message edited by: 123 ]

airsluf · April 3, 2002 7:40PM

programmer · April 4, 2002 12:05AM

[quote]Originally posted by AirSluf:



Very true as things are now. Looks like prime territory for compiler work. There is still tremendous opportunity for speed gains through compiler design to alleviate the programmer from having to tweak everything by hand. Especially automating the vectorizing where it would be advantageous to use it. What I am getting at is much more in depth than the current pathetic compiler auto-vectorizations.

<hr></blockquote>

I agree that it would be really nice to have better compilers... but people have been saying that for decades, and for the most part it hasn't come true. Rather than counting on some highly speculative compiler improvements coming to save the day, it would be much more prudent to make the FPU super-scalar, staying fully backward compatible, and avoid complicating the execution model. Compilers frequently get new schedulers, but I'm not holding my breath for them to get a vectorizer. If vectorizing compilers do show up then they can still use the existing AltiVec unit for anything that doesn't require double precision (i.e. most things), and it can then operate in a super-scalar fashion along side the multiple double precision FPUs.

powerdoc · April 4, 2002 12:16AM

[quote]Originally posted by Programmer:



I certainly wouldn't go that far! If I had to rank processor features in order of preference, this is the order I'd put them:

- On-chip DDR memory controller + RapidIO bus.

- Higher clock rate (~1.6 GHz).

- More than 1 FPU.

- Ability to retire more instructions per clock cycle.

- More integer units (since we can now retire more...).

- Super high speed interface to graphics chipset.

- 64-bit

- Much higher clock rate (~3 GHz). This is last because it requires much longer pipelines, which is a bad thing.

.<hr></blockquote>

Oops : you are right, DDRam, rapid IO, and more Mhz is more important than the add of one fpu unit : i did not write it because i found evident that this new features will be incorporated in the G5.

Concerning the ability of retire more instructions per cycle : is this not linked (but not only) to a better gestion of the memory and more integer or fpu unit ?. At the contrary as you have mentionned very deep pipeline go in the wrong way for this.

When you speak of Super high speed graphic interface : what do you refer to : ability to use AGP 8 X, or mobo like the nforce (but still AGP 4X if you do not use the geforce 2 mx graphic chipset of the mobo). This last sentance lead me to another question : do you think we might see a new Apple mobo with nforce features : 128 bit memory bus and a graphic chipset directly implemented on the mobo with the help of nvidia. 128 memory bus double the memory data transfer and an embedded graphic chipset can save some money : high end user will still have a choice of adding an AGP graphic card.

[ 04-04-2002: Message edited by: powerdoc ]

programmer · April 4, 2002 12:21AM

[quote]Originally posted by 123:



I still don't think it accounts for "a few percent on most benchmarks", I'd say a fraction of one.<hr></blockquote>

This will obviously depend on the benchmark. A lot of modern code is heavily object-oriented and does a great many function calls.

[quote]

Correct, if you use larger types. However, I was talking about "it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits", my point being that today's machines already read at least 64 bits at a time.

<hr></blockquote>

Well, if nothing else, you don't have a choice about the size of your pointers -- they will be twice as big. This includes the vtable pointer in every polymorphic C++ object (or equivalent in whatever language you are using).

[quote]

Vectorization is not a big issue, because scientific calculations already are vectororized for either supercomputers, clusters or P4s (where everything dp is done in SIMD). Besides, I wouldn't exactly say those kind of programmers are stupid... so I think they could handle AltiVec.

<hr></blockquote>

Unfortunately these things are rarely standardized and have to be re-coded for each vector unit. The supercomputer implementations, in particular, are often not even SIMD in nature. The P4 implementation is done in assembly. Most programmers, however, maintain a straight-C/C++ double precision implementation that is not SIMD-ified, and if your double precision FPU performance is the best performer then programmers don't need to do anything other than recompile (see message above for comments on auto-vectorizing compilers). And while good programmers are not stupid, they are "lazy" (at least in terms of not rewriting code that already works). Apple needs to make it trivial to get code onto the Mac platform, and that means minimizing the amount of Mac or PPC specific coding that needs to be done.

It bears repeating -- splitting the PowerPC execution model is highly undesirable. It means there are more versions to be coded for & debugged, the market is fractured, new development tools are called for, new hardware must wait for code that takes advantage of it, etc. Speeding up the existing execution model is much preferable, and (relatively) easily done. IBM's POWER3/4 machines have almost the same execution model (minus AltiVec), but a better and superscalar FPU which gives them tremendous double precision performance.

[ 04-04-2002: Message edited by: Programmer ]

programmer · April 4, 2002 12:28AM

[quote]Originally posted by powerdoc:



Concerning the ability of retire more instructions per cycle : is this not linked (but not only) to a better gestion of the memory and more integer or fpu unit. At the contrary as you have mentionned very deep pipeline go in the wrong way for this.<hr></blockquote>

Well the instruction retirement is usually broken out as a seperate unit in most of the Motorola documentation -- whether this is a conceptual distinction or a physical one I don't know. IIRC the retirement limit doesn't depend on the kind of instructions, however, which implies some kind of a shared mechanism. As far as I know pipeline length and instruction retirement are fairly independent -- retirement is just what happens at the end of the pipeline, although it could be multiple steps itself. I'm definitely flirting with the edge of my knowledge of this subject, however, so I'll shut up now.

programmer · April 4, 2002 12:36AM

[quote]Originally posted by powerdoc:

When you speak of Super high speed graphic interface : what do you refer to : ability to use AGP 8 X, or mobo like the nforce (but still AGP 4X if you do not use the geforce 2 mx graphic chipset of the mobo). This last sentance lead me to another question : do you think we might see a new Apple mobo with nforce features : 128 bit memory bus and a graphic chipset directly implemented on the mobo with the help of nvidia. 128 memory bus double the memory data transfer and an embedded graphic chipset can save some money : high end user will still have a choice of adding an AGP graphic card.

<hr></blockquote>

I think the day of the wide bus is done. nVidia is on the HyperTransport committee and has already implemented it in a couple of places (XBox & nForce). I think ATI has talked about using HT as well. I expect to see AGP either disappear or be relegated to a secondary port (it is only 1-2 GBytes/sec, after all). A graphics chipset on HyperTransport could have as much as 12 GBytes/sec bandwidth to memory and the processor, and do it with a much lower pin count. RapidIO can provide high bandwidths as well, but I think the graphics engine deserves its own channel. A slot for HT is possible, but hasn't been specified yet... although Apple is exactly the sort of company that I'd expect to push that envelope. They might also just build it into the motherboard like the iMac... hopefully providing for multiple options for people who want high end graphics. One added bonus for this kind of a setup might be finally getting rid of the GPU-specific memory, which has reached epic proportions (128 MBytes of RAM just for the graphics engine!!). This would make the graphics memory available to the main system when not needed by the graphics, and system memory available to the graphics when needed. Much more flexible.

[ 04-04-2002: Message edited by: Programmer ]

powerdoc · April 4, 2002 1:11AM

[quote]Originally posted by Programmer:



. I'm definitely flirting with the edge of my knowledge of this subject, however, so I'll shut up now.

<hr></blockquote>

Don't worry you are flying already one hundred kilometers above my knowledge on that subject. If i have to shut up each time i am flirting with the edge of my knowledge if will never speak on this forums ...

powerdoc · April 4, 2002 7:59AM

[quote]Originally posted by Programmer:



I think the day of the wide bus is done. nVidia is on the HyperTransport committee and has already implemented it in a couple of places (XBox & nForce). I think ATI has talked about using HT as well. I expect to see AGP either disappear or be relegated to a secondary port (it is only 1-2 GBytes/sec, after all). A graphics chipset on HyperTransport could have as much as 12 GBytes/sec bandwidth to memory and the processor, and do it with a much lower pin count. RapidIO can provide high bandwidths as well, but I think the graphics engine deserves its own channel. A slot for HT is possible, but hasn't been specified yet... although Apple is exactly the sort of company that I'd expect to push that envelope. They might also just build it into the motherboard like the iMac... hopefully providing for multiple options for people who want high end graphics. One added bonus for this kind of a setup might be finally getting rid of the GPU-specific memory, which has reached epic proportions (128 MBytes of RAM just for the graphics engine!!). This would make the graphics memory available to the main system when not needed by the graphics, and system memory available to the graphics when needed. Much more flexible.

[ 04-04-2002: Message edited by: Programmer ]<hr></blockquote>

As you mentionned, AGP is starved to 1 to 2 (AGP 8 X ) GB/s, but hypertransport can reach 12 GB/s. I quite understand that direct communication between the G5 and GPU can reach this speed, but even DDRam with PC 2100 cannot go beyond 2 GB/s : in this case hypertransport bring nothing more than AGP 8 X on PC 2100 based mobo for memory access.

Build in GPU on the mobo can be great, but the main memory (if you do not use specific memory the GPU like in the i mac) have to be really fast. The minimum for this is the nforce specification 128 bit DDRAM memory at 133 mhz : 4 GByte / second , but even with that the graphic geforce 2 mx of the Nforce is just average compared to AGP 2 mx graphic card. Without 128 bits DDRAM this solution will be a regression compared to the present powermacs.

For this reason i think that Apple may choose as you mentionned to make an hypertransport slot with different dauhter cards available. It could be an elegant solution.

programmer · April 4, 2002 10:33AM

[quote]Originally posted by powerdoc:

For this reason i think that Apple may choose as you mentionned to make an hypertransport slot with different dauhter cards available. It could be an elegant solution.<hr></blockquote>

No argument there... system memory does need to get much faster, but I don't expect Apple to really push the envelope on it. DDR-II is coming eventually though, and this architectural change would enable Apple to hop on that bandwagon a lot faster. Another (rather odd) possibility is if the CPU could use the GPU's memory across the HT bus, rather than visa versa (or bidirectionally). This would give the flexibility without needing all the memory to be super-fast. Bit of a management nightmare though. :eek:

The scary thing about putting the memory controller (or any other part of the system) on the CPU is that to improve it you have to upgrade the CPU. Now hopefully with the new modular architecture this can be done more quickly and effectively... but it still must be done. Hopefully the memory controller they are building in from day one is very forward looking. If it is, then suddenly Apple has the option of using really advanced types of RAM in the build-to-order store (since its all on the daughtercard according to Dorsal M -- which sounds very plausible). This will allow a great deal of product differentiation within the PowerMac line and make it feasible to use some of the less plentiful types of memory for people who really want to spend the money and push the envelope.

lemon bon bon · April 4, 2002 12:01PM

I think that is the crux of the PM G4 line.

Differentiation.

Apollo for lower end.

G5 for top end.

There should be Amiga style chips to boost Apple's push to multimedia supremacy.

A more modular approach to design.

With a 30% mark up they should be able to do it.

They already use many common PC common components to cut costs.

It's about time they gave back to Powermac customers.

They just look bad vfmoney compared to PCs.

Still as somebody who is using a PC tower, I'm dying for the G5! (My four year old Mac tower was sold...and I've been waiting and waiting for...)

Had a long look at the G4 dual gig...ramble...

It's not enough. Yeesh the chip is years old already!

Give us the G5!!!

Lemon Bon Bon

<img src="graemlins/bugeye.gif" border="0" alt="[Skeptical]" />

razzfazz · April 7, 2002 5:03PM

[quote]Originally posted by Programmer:



Didn't I just say that?<hr></blockquote>

Yeah, noticed that too after I posted, but I kinda was too lazy to edit my post again just to remove it

[quote]

Both, actually. Context switches & interrupts don't happen all that often -- on the order of a few hundred per second, typically. This doesn't really amount to a whole lot of bandwidth.

<hr></blockquote>

Well, I was under the impression that the VRSAVE register was added exactly because the guys at Motorola thought that having to save / restore the whole vector register file on each context switch was too much.

Also, that was pretty much what Godfrey van der Linden (Apple employee) said on the Darwin Development List on, uh, wait, March 12th:

"In fact the kernel has made it difficult to get access to floating

point and altivec mostly for efficiency reasons.

I the kernel used these engines we would need to dump the entire

altivec and floating point register sets every time we took a kernel

transition from user land. As these register files are quite large

this is a big performance hit."

[quote]Don't over-estimate the cost I'm pointing out either -- I didn't say how expensive it is to carry around a 64-bit integer register file, I just said it was more expensive than carrying around a 32-bit integer register file. It probably doesn't amount to more than a few percent on most benchmarks. <hr></blockquote>

Well, I (obviously

) didn't measure it or anything, just remembered the Mailing List post above...

Bye,

RazzFazz

G5 : 64 bits or 32 bits ?

Comments