or Connect
AppleInsider › Forums › Mac Hardware › Future Apple Hardware › G5 : 64 bits or 32 bits ?
New Posts  All Forums:Forum Nav:

G5 : 64 bits or 32 bits ? - Page 2

post #41 of 127
64bit addressing can wait for now. What we really need is 64bit FP SIMD (altivec)!


Amorph
[quote]In fact, it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits, and if you're reading in 8 bit ASCII characters then you're pulling 8 times the bandwidth the data actually requires across the bus, and since the bus is always a bottleneck, this actually hurts performance.<hr></blockquote>

Today, if you read 8bits from memory, you will already read 64 bits, because that's the bus' width. Actually, you'll even read 4x64 (or 8x64 or 8 quads on DDR boards), because:

- SDRAM bursts are cheap.
- entire cache blocks are read at once.


Programmer
[quote]Context switching is the single largest data size cost because the size of the integer register file has doubled and must be saved on every function call and thread switch.<hr></blockquote>

Actually, this depends a lot on the processor design and calling convention. Most processors have gp registers that are not saved for simple function calls. Then, there are leaf procedures... As for regular function calls, if you have a machine with multiple register windows (SPARC), you don't have to save your registers at all (well, most of them, most of the time). However, I entirely agree that generally stack size will increase, but how much is that?

[ 04-02-2002: Message edited by: 123 ]

[ 04-02-2002: Message edited by: 123 ]</p>
post #42 of 127
[quote]Originally posted by 123:
<strong>64bit addressing can wait for now. What we really need is 64bit FP SIMD (altivec)!
</strong><hr></blockquote>

I disagree... if you have vector doubles then you're only doing two operations per instruction (unless you double the size of the already huge vector register set and introduce all sorts of compatibility issues with current AltiVec code), but you still need to go through the effort of vectorizing your code. Much better would be to have multiple FPUs so that more than one double FP operation executes per clock cycle, and this happens on scalar and vector code.

[quote]<strong>
Actually, this depends a lot on the processor design and calling convention. Most processors have gp registers that are not saved for simple function calls. Then, there are leaf procedures... As for regular function calls, if you have a machine with multiple register windows (SPARC), you don't have to save your registers at all (well, most of them, most of the time). However, I entirely agree that generally stack size will increase, but how much is that?
</strong><hr></blockquote>

In multi-threaded object oriented code it can amount to quite a bit of bandwidth. It also has the effect of putting extra pressure on the caches in the system because the stack consumes cachelines that would otherwise be used for other data.

[quote]<strong>
...comments about how reads come in as 64-bit anyhow...
</strong><hr></blockquote>

While it is true that individual reads are at least 64-bits (wider from L3 -&gt; L2 -&gt; L1), and they come in bursts of 4, this does not mean that reading a 64-bit integer is the same cost as reading an 8-bit integer. The problem is that you can't consider machine performance on element at a time. If all the machine is doing is reading just the one data element, and then doing nothing else, you are correct that either read is of equal cost. However, if the machine is reading a sequence of these values then the 8-bit reads will find 32 entries in the same cacheline, whereas the 64-bit reads will have to pull in 8 cachelines to get 32 entries. If you look at data structures in the aggregate (especially ones designed for good cache performance) then using larger types has a performance impact if you are memory bound ... and the G4 spends a lot of time waiting for memory.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #43 of 127
[quote]Originally posted by powerdoc:
<strong>
So how , do you explain that IBM choose 64 bit CPU for his high end server with the power 3 and Power 4 64bits chips, if the only important thing is the speed of the HD ?</strong><hr></blockquote>

They chose 64 bits for larger memory address space, not for being able to handle larger volume or file sizes on disk, as was suggested before.

Bye,
RazzFazz
post #44 of 127
[quote]Originally posted by 123:
<strong>64bit addressing can wait for now. What we really need is 64bit FP SIMD (altivec)!
</strong><hr></blockquote>

Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless IMO (and boosting AltiVec to 256 bits would kill compatibility). Just adding another scalar FPU would be much more powerful and flexible.


[quote]<strong>
Today, if you read 8bits from memory, you will already read 64 bits, because that's the bus' width. Actually, you'll even read 4x64 (or 8x64 or 8 quads on DDR boards), because:

- SDRAM bursts are cheap.
- entire cache blocks are read at once.
</strong><hr></blockquote>

Yes, but getting data from a non-aligned address (ints not on addresses divisible by 32, for example) into a register is still usually more expensive, and sometimes not possible at all.


[quote]<strong>
Actually, this depends a lot on the processor design and calling convention. Most processors have gp registers that are not saved for simple function calls.</strong><hr></blockquote>

While this may be true, he was talking about context switches, not simple function calls.


[quote]<strong>As for regular function calls, if you have a machine with multiple register windows (SPARC), you don't have to save your registers at all (well, most of them, most of the time). </strong><hr></blockquote>

Well, yes, but apart from SPARC, what other processor currently in use has register windows at all?


[quote]<strong>However, I entirely agree that generally stack size will increase, but how much is that?
</strong><hr></blockquote>

I think he was more referring to the fact that a larger register file (32x64 GPR + 32x64 FPR instead of 32x32 GPR + 32x64 FPR) causes additional memory bus utilization on each and every context switch.

Bye,
RazzFazz
post #45 of 127
Thread Starter 
[quote]Originally posted by RazzFazz:
<strong>

They chose 64 bits for larger memory address space, not for being able to handle larger volume or file sizes on disk, as was suggested before.

Bye,
RazzFazz</strong><hr></blockquote>
So if i understand you , RazzFazz, 64bits cpu are useless for the mac, much better have one or two more FPU unit,
That's my opinion too, but i am not a specialist like you or the Programmer.
post #46 of 127
[quote]Originally posted by RazzFazz:
<strong>
Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless IMO (and boosting AltiVec to 256 bits would kill compatibility). Just adding another scalar FPU would be much more powerful and flexible.
</strong><hr></blockquote>

Didn't I just say that?

[quote]<strong>
Yes, but getting data from a non-aligned address (ints not on addresses divisible by 32, for example) into a register is still usually more expensive, and sometimes not possible at all.
</strong><hr></blockquote>

This usually isn't a factor since the vast majority of all accesses are cached and the size of the fetch from cache is irrelevent (at least on PPC). The important factor is locality of reference, and the effect data size has on that.

<strong> [quote]
While this may be true, he was talking about context switches, not simple function calls.
...
I think he was more referring to the fact that a larger register file (32x64 GPR + 32x64 FPR instead of 32x32 GPR + 32x64 FPR) causes additional memory bus utilization on each and every context switch.
</strong><hr></blockquote>

Both, actually. Context switches & interrupts don't happen all that often -- on the order of a few hundred per second, typically. This doesn't really amount to a whole lot of bandwidth.

Function calls, on the other hand, usually do need to save a fair bit of context... many functions are not leaf functions and thus end up needing to save at least about half the registers. Even leaf functions need to save registers if they are doing something register intensive.

Don't over-estimate the cost I'm pointing out either -- I didn't say how expensive it is to carry around a 64-bit integer register file, I just said it was more expensive than carrying around a 32-bit integer register file. It probably doesn't amount to more than a few percent on most benchmarks. Just like going to DDR RAM doesn't double overall system performance, going to double sized integers doesn't cut performance in half.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #47 of 127
[quote]Originally posted by powerdoc:
<strong>
So if i understand you , RazzFazz, 64bits cpu are useless for the mac, much better have one or two more FPU unit,
That's my opinion too, but i am not a specialist like you or the Programmer.</strong><hr></blockquote>

I certainly wouldn't go that far! If I had to rank processor features in order of preference, this is the order I'd put them:

- On-chip DDR memory controller + RapidIO bus.
- Higher clock rate (~1.6 GHz).
- More than 1 FPU.
- Ability to retire more instructions per clock cycle.
- More integer units (since we can now retire more...).
- Super high speed interface to graphics chipset.
- 64-bit
- Much higher clock rate (~3 GHz). This is last because it requires much longer pipelines, which is a bad thing.


From a developer's point of view having 64-bit and a capable virtual memory system (like MacOSX's) enables certain useful techniques -- like being able to memory map as many really large files as you want, or using huge sparsely allocated data structures.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #48 of 127
post #49 of 127
[quote]Originally posted by AirSluf:
<strong>
Not if you added a second Altivec unit and laid the two logically side-by-side. Then extend the instruction set to handle 4x64 vectors by using both units simultaneously. Whether the real estate is available on the die is a seperate matter. But a logical 256-bit Altivec does not have to break the previous implementation.</strong><hr></blockquote>

Oh there is plenty of real estate available. The scheme you propose is fraught with difficulties, however, and it just feels like a hack. It is difficult enough for Apple to get people coding for AltiVec when there is just one version of it -- adding a second would be impossible to justify. Even Intel has trouble convincing developers to spend the effort to support MMX/SSE/SSE2. Adding more FPUs would just make existing code run faster. Adding another VPU or two would make existing AltiVec code run faster. Of course you'd have to ease the memory bottleneck first to make any of that worthwhile...
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #50 of 127
[quote] Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless IMO (and boosting AltiVec to 256 bits would kill compatibility). Just adding another scalar FPU would be much more powerful and flexible. <hr></blockquote>

You could add some additional FPUs or whatever, I don't really care. At the moment, the G4 doesn't have competitive double precision units and this is very important for a majority o scientific applications. In my opinion, 2x64 is a lot more than 0x64. Vectorization is not a big issue, because scientific calculations already are vectororized for either supercomputers, clusters or P4s (where everything dp is done in SIMD). Besides, I wouldn't exactly say those kind of programmers are stupid... so I think they could handle AltiVec.

[quote] However, if the machine is reading a sequence of these values then the 8-bit reads will find 32 entries in the same cacheline, whereas the 64-bit reads will have to pull in 8 cachelines to get 32 entries. If you look at data structures in the aggregate (especially ones designed for good cache performance) then using larger types has a performance impact if you are memory bound ... and the G4 spends a lot of time waiting for memory. <hr></blockquote>

Correct, if you use larger types. However, I was talking about "it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits", my point being that today's machines already read at least 64 bits at a time.

Just because the registers are 64 bits wide, that doesn't mean you have to use all bits, there are still stb, sth, stw and their l equivalents. A halfword has to be aligned (RazzFazz) to 16 bits and a byte to 8 bits. So you can still have your 32 bytes in one 256 bit cache block. This has absolutely nothing to do with register size.


[quote]
Well, yes, but apart from SPARC, what other processor currently in use has register windows at all?<hr></blockquote>

the discussion was becoming more and more one about 64 vs. 32 in general, not just PPC. My point was that a lot of the stated arguments are not generally true but rather design dependant and I actually think one example is enough to proof that (Tensilica uses them too).


[quote] While this may be true, he was talking about context switches, not simple function calls.<hr></blockquote>

He was talking about function calls, and so was I (true, context switches are expensive, but I wasn't addressing them). I also admitted that stack size will increase (being the place where function parameters and registers are usually saved) and with it, memory transfer. However, as I've said, this is hardware and software dependant, cache design plays an important role, multiple register sets etc. I still don't think it accounts for "a few percent on most benchmarks", I'd say a fraction of one.

[ 04-03-2002: Message edited by: 123 ]</p>
post #51 of 127
post #52 of 127
[quote]Originally posted by AirSluf:
<strong>
Very true as things are now. Looks like prime territory for compiler work. There is still tremendous opportunity for speed gains through compiler design to alleviate the programmer from having to tweak everything by hand. Especially automating the vectorizing where it would be advantageous to use it. What I am getting at is much more in depth than the current pathetic compiler auto-vectorizations.
</strong><hr></blockquote>

I agree that it would be really nice to have better compilers... but people have been saying that for decades, and for the most part it hasn't come true. Rather than counting on some highly speculative compiler improvements coming to save the day, it would be much more prudent to make the FPU super-scalar, staying fully backward compatible, and avoid complicating the execution model. Compilers frequently get new schedulers, but I'm not holding my breath for them to get a vectorizer. If vectorizing compilers do show up then they can still use the existing AltiVec unit for anything that doesn't require double precision (i.e. most things), and it can then operate in a super-scalar fashion along side the multiple double precision FPUs.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #53 of 127
Thread Starter 
[quote]Originally posted by Programmer:
<strong>

I certainly wouldn't go that far! If I had to rank processor features in order of preference, this is the order I'd put them:

- On-chip DDR memory controller + RapidIO bus.
- Higher clock rate (~1.6 GHz).
- More than 1 FPU.
- Ability to retire more instructions per clock cycle.
- More integer units (since we can now retire more...).
- Super high speed interface to graphics chipset.
- 64-bit
- Much higher clock rate (~3 GHz). This is last because it requires much longer pipelines, which is a bad thing.


.</strong><hr></blockquote>

Oops : you are right, DDRam, rapid IO, and more Mhz is more important than the add of one fpu unit : i did not write it because i found evident that this new features will be incorporated in the G5.

Concerning the ability of retire more instructions per cycle : is this not linked (but not only) to a better gestion of the memory and more integer or fpu unit ?. At the contrary as you have mentionned very deep pipeline go in the wrong way for this.

When you speak of Super high speed graphic interface : what do you refer to : ability to use AGP 8 X, or mobo like the nforce (but still AGP 4X if you do not use the geforce 2 mx graphic chipset of the mobo). This last sentance lead me to another question : do you think we might see a new Apple mobo with nforce features : 128 bit memory bus and a graphic chipset directly implemented on the mobo with the help of nvidia. 128 memory bus double the memory data transfer and an embedded graphic chipset can save some money : high end user will still have a choice of adding an AGP graphic card.

[ 04-04-2002: Message edited by: powerdoc ]</p>
post #54 of 127
[quote]Originally posted by 123:
<strong>
I still don't think it accounts for "a few percent on most benchmarks", I'd say a fraction of one.</strong><hr></blockquote>

This will obviously depend on the benchmark. A lot of modern code is heavily object-oriented and does a great many function calls.

[quote]<strong>
Correct, if you use larger types. However, I was talking about "it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits", my point being that today's machines already read at least 64 bits at a time.
</strong><hr></blockquote>

Well, if nothing else, you don't have a choice about the size of your pointers -- they will be twice as big. This includes the vtable pointer in every polymorphic C++ object (or equivalent in whatever language you are using).

[quote]<strong>
Vectorization is not a big issue, because scientific calculations already are vectororized for either supercomputers, clusters or P4s (where everything dp is done in SIMD). Besides, I wouldn't exactly say those kind of programmers are stupid... so I think they could handle AltiVec.
</strong><hr></blockquote>

Unfortunately these things are rarely standardized and have to be re-coded for each vector unit. The supercomputer implementations, in particular, are often not even SIMD in nature. The P4 implementation is done in assembly. Most programmers, however, maintain a straight-C/C++ double precision implementation that is not SIMD-ified, and if your double precision FPU performance is the best performer then programmers don't need to do anything other than recompile (see message above for comments on auto-vectorizing compilers). And while good programmers are not stupid, they are "lazy" (at least in terms of not rewriting code that already works). Apple needs to make it trivial to get code onto the Mac platform, and that means minimizing the amount of Mac or PPC specific coding that needs to be done.

It bears repeating -- splitting the PowerPC execution model is highly undesirable. It means there are more versions to be coded for & debugged, the market is fractured, new development tools are called for, new hardware must wait for code that takes advantage of it, etc. Speeding up the existing execution model is much preferable, and (relatively) easily done. IBM's POWER3/4 machines have almost the same execution model (minus AltiVec), but a better and superscalar FPU which gives them tremendous double precision performance.

[ 04-04-2002: Message edited by: Programmer ]</p>
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #55 of 127
[quote]Originally posted by powerdoc:
<strong>
Concerning the ability of retire more instructions per cycle : is this not linked (but not only) to a better gestion of the memory and more integer or fpu unit. At the contrary as you have mentionned very deep pipeline go in the wrong way for this.</strong><hr></blockquote>

Well the instruction retirement is usually broken out as a seperate unit in most of the Motorola documentation -- whether this is a conceptual distinction or a physical one I don't know. IIRC the retirement limit doesn't depend on the kind of instructions, however, which implies some kind of a shared mechanism. As far as I know pipeline length and instruction retirement are fairly independent -- retirement is just what happens at the end of the pipeline, although it could be multiple steps itself. I'm definitely flirting with the edge of my knowledge of this subject, however, so I'll shut up now.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #56 of 127
[quote]Originally posted by powerdoc:
<strong>When you speak of Super high speed graphic interface : what do you refer to : ability to use AGP 8 X, or mobo like the nforce (but still AGP 4X if you do not use the geforce 2 mx graphic chipset of the mobo). This last sentance lead me to another question : do you think we might see a new Apple mobo with nforce features : 128 bit memory bus and a graphic chipset directly implemented on the mobo with the help of nvidia. 128 memory bus double the memory data transfer and an embedded graphic chipset can save some money : high end user will still have a choice of adding an AGP graphic card.
</strong><hr></blockquote>

I think the day of the wide bus is done. nVidia is on the HyperTransport committee and has already implemented it in a couple of places (XBox & nForce). I think ATI has talked about using HT as well. I expect to see AGP either disappear or be relegated to a secondary port (it is only 1-2 GBytes/sec, after all). A graphics chipset on HyperTransport could have as much as 12 GBytes/sec bandwidth to memory and the processor, and do it with a much lower pin count. RapidIO can provide high bandwidths as well, but I think the graphics engine deserves its own channel. A slot for HT is possible, but hasn't been specified yet... although Apple is exactly the sort of company that I'd expect to push that envelope. They might also just build it into the motherboard like the iMac... hopefully providing for multiple options for people who want high end graphics. One added bonus for this kind of a setup might be finally getting rid of the GPU-specific memory, which has reached epic proportions (128 MBytes of RAM just for the graphics engine!!). This would make the graphics memory available to the main system when not needed by the graphics, and system memory available to the graphics when needed. Much more flexible.

[ 04-04-2002: Message edited by: Programmer ]</p>
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #57 of 127
Thread Starter 
[quote]Originally posted by Programmer:
<strong>

. I'm definitely flirting with the edge of my knowledge of this subject, however, so I'll shut up now. </strong><hr></blockquote>

Don't worry you are flying already one hundred kilometers above my knowledge on that subject. If i have to shut up each time i am flirting with the edge of my knowledge if will never speak on this forums ...
post #58 of 127
Thread Starter 
[quote]Originally posted by Programmer:
<strong>

I think the day of the wide bus is done. nVidia is on the HyperTransport committee and has already implemented it in a couple of places (XBox & nForce). I think ATI has talked about using HT as well. I expect to see AGP either disappear or be relegated to a secondary port (it is only 1-2 GBytes/sec, after all). A graphics chipset on HyperTransport could have as much as 12 GBytes/sec bandwidth to memory and the processor, and do it with a much lower pin count. RapidIO can provide high bandwidths as well, but I think the graphics engine deserves its own channel. A slot for HT is possible, but hasn't been specified yet... although Apple is exactly the sort of company that I'd expect to push that envelope. They might also just build it into the motherboard like the iMac... hopefully providing for multiple options for people who want high end graphics. One added bonus for this kind of a setup might be finally getting rid of the GPU-specific memory, which has reached epic proportions (128 MBytes of RAM just for the graphics engine!!). This would make the graphics memory available to the main system when not needed by the graphics, and system memory available to the graphics when needed. Much more flexible.

[ 04-04-2002: Message edited by: Programmer ]</strong><hr></blockquote>
As you mentionned, AGP is starved to 1 to 2 (AGP 8 X ) GB/s, but hypertransport can reach 12 GB/s. I quite understand that direct communication between the G5 and GPU can reach this speed, but even DDRam with PC 2100 cannot go beyond 2 GB/s : in this case hypertransport bring nothing more than AGP 8 X on PC 2100 based mobo for memory access.
Build in GPU on the mobo can be great, but the main memory (if you do not use specific memory the GPU like in the i mac) have to be really fast. The minimum for this is the nforce specification 128 bit DDRAM memory at 133 mhz : 4 GByte / second , but even with that the graphic geforce 2 mx of the Nforce is just average compared to AGP 2 mx graphic card. Without 128 bits DDRAM this solution will be a regression compared to the present powermacs.
For this reason i think that Apple may choose as you mentionned to make an hypertransport slot with different dauhter cards available. It could be an elegant solution.
post #59 of 127
[quote]Originally posted by powerdoc:
<strong>For this reason i think that Apple may choose as you mentionned to make an hypertransport slot with different dauhter cards available. It could be an elegant solution.</strong><hr></blockquote>

No argument there... system memory does need to get much faster, but I don't expect Apple to really push the envelope on it. DDR-II is coming eventually though, and this architectural change would enable Apple to hop on that bandwagon a lot faster. Another (rather odd) possibility is if the CPU could use the GPU's memory across the HT bus, rather than visa versa (or bidirectionally). This would give the flexibility without needing all the memory to be super-fast. Bit of a management nightmare though. :eek:

The scary thing about putting the memory controller (or any other part of the system) on the CPU is that to improve it you have to upgrade the CPU. Now hopefully with the new modular architecture this can be done more quickly and effectively... but it still must be done. Hopefully the memory controller they are building in from day one is very forward looking. If it is, then suddenly Apple has the option of using really advanced types of RAM in the build-to-order store (since its all on the daughtercard according to Dorsal M -- which sounds very plausible). This will allow a great deal of product differentiation within the PowerMac line and make it feasible to use some of the less plentiful types of memory for people who really want to spend the money and push the envelope.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #60 of 127
I think that is the crux of the PM G4 line.

Differentiation.

Apollo for lower end.
G5 for top end.

There should be Amiga style chips to boost Apple's push to multimedia supremacy.

A more modular approach to design.

With a 30% mark up they should be able to do it.

They already use many common PC common components to cut costs.

It's about time they gave back to Powermac customers.

They just look bad vfmoney compared to PCs.

Still as somebody who is using a PC tower, I'm dying for the G5! (My four year old Mac tower was sold...and I've been waiting and waiting for...)

Had a long look at the G4 dual gig...ramble...

It's not enough. Yeesh the chip is years old already!

Give us the G5!!!

Lemon Bon Bon

<img src="graemlins/bugeye.gif" border="0" alt="[Skeptical]" />
We do it because Steve Jobs is the supreme defender of the Macintosh faith, someone who led Apple back from the brink of extinction just four years ago. And we do it because his annual keynote is...
Reply
We do it because Steve Jobs is the supreme defender of the Macintosh faith, someone who led Apple back from the brink of extinction just four years ago. And we do it because his annual keynote is...
Reply
post #61 of 127
[quote]Originally posted by Programmer:
<strong>
Didn't I just say that?</strong><hr></blockquote>

Yeah, noticed that too after I posted, but I kinda was too lazy to edit my post again just to remove it


[quote]<strong>
Both, actually. Context switches & interrupts don't happen all that often -- on the order of a few hundred per second, typically. This doesn't really amount to a whole lot of bandwidth.
</strong><hr></blockquote>

Well, I was under the impression that the VRSAVE register was added exactly because the guys at Motorola thought that having to save / restore the whole vector register file on each context switch was too much.

Also, that was pretty much what Godfrey van der Linden (Apple employee) said on the Darwin Development List on, uh, wait, March 12th:
"In fact the kernel has made it difficult to get access to floating
point and altivec mostly for efficiency reasons.
I the kernel used these engines we would need to dump the entire
altivec and floating point register sets every time we took a kernel
transition from user land. As these register files are quite large
this is a big performance hit."


[quote]<strong>Don't over-estimate the cost I'm pointing out either -- I didn't say how expensive it is to carry around a 64-bit integer register file, I just said it was more expensive than carrying around a 32-bit integer register file. It probably doesn't amount to more than a few percent on most benchmarks. </strong><hr></blockquote>

Well, I (obviously ) didn't measure it or anything, just remembered the Mailing List post above...

Bye,
RazzFazz
post #62 of 127
[quote]Originally posted by 123:
<strong>
In my opinion, 2x64 is a lot more than 0x64. </strong><hr></blockquote>

Well, to be fair, this should read "1x64" rather than "0x64" (the G4 does have a DP FPU).


[quote]<strong>
Vectorization is not a big issue, because scientific calculations already are vectororized for either supercomputers, clusters or P4s (where everything dp is done in SIMD).</strong><hr></blockquote>

Well, supercomputers probably have much longer vectors in the first place, so you'd have to rewrite again for AltiVec, and a program designed for a cluster won't help much for an on-chip SIMD unit either.
Since SSE2 uses the same vector types (assuming 4x32FP SIMD), it should be possible to at least partially reuse code for it.


[quote]<strong>
A halfword has to be aligned (RazzFazz) to 16 bits and a byte to 8 bits. So you can still have your 32 bytes in one 256 bit cache block. This has absolutely nothing to do with register size.
</strong><hr></blockquote>

Hm, not sure, but I thought I remembered that, at least for some architectures, reading byte-aligned values on a 32bit machine takes longer than reading ones aligned to the "native" size.

Also, couldn't early Alphas not read individual bytes from memory at all (i.e. you had to read a quadword and mask out the byte you were interested in)? Damn, definitely need to beef up my long-term memory


[quote]<strong>He was talking about function calls, and so was I (true, context switches are expensive, but I wasn't addressing them).</strong><hr></blockquote>

Oops, sorry, must have mixed up some quotes in that case.


[quote]<strong>I still don't think it accounts for "a few percent on most benchmarks", I'd say a fraction of one.
</strong><hr></blockquote>

As stated above, I had the quoted Darwin-Developerment-posting in mind.

Bye,
RazzFazz

[ 04-07-2002: Message edited by: RazzFazz ]</p>
post #63 of 127
[quote]Originally posted by RazzFazz:
<strong>
Well, I was under the impression that the VRSAVE register was added exactly because the guys at Motorola thought that having to save / restore the whole vector register file on each context switch was too much.

Also, that was pretty much what Godfrey van der Linden (Apple employee) said on the Darwin Development List on, uh, wait, March 12th:
"In fact the kernel has made it difficult to get access to floating
point and altivec mostly for efficiency reasons.
I the kernel used these engines we would need to dump the entire
altivec and floating point register sets every time we took a kernel
transition from user land. As these register files are quite large
this is a big performance hit."</strong><hr></blockquote>

Yes, this is a bit of a fuzzy case -- these transitions into the kernel are really function calls to the operating system, but they use a "software interrupt" mechanism which requires a context switch of sorts. These calls can be very frequent and, as you quoted, the repeated cost of saving the complete context would be brutal. VRSAVE ensures that only the registers you are using get saved, but it is up to the OS to enforce that and if you are using any of those registers they still need to be saved. One upside is that vector code doesn't usually make too many operating system calls.

Anyhow, I forgot to explicitly mention this case so thanks for pointing it out.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #64 of 127
[quote]Originally posted by Programmer:
<strong>
One added bonus for this kind of a setup might be finally getting rid of the GPU-specific memory, which has reached epic proportions (128 MBytes of RAM just for the graphics engine!!). This would make the graphics memory available to the main system when not needed by the graphics, and system memory available to the graphics when needed. Much more flexible.
</strong><hr></blockquote>

Much more flexible indeed, but also a hell of a lot more expensive than current designs (those 128 MB VRAM you mentioned are considrably faster than standard RAM, and consequently are also much much more expensive).
Also, this might kill expandability. There are reasons why current graphic cards don't feature upgradeable VRAM any more. Beyond a certain speed, you can't stick that memory on an exchangeable module any more, but rather have to solder it onto the board directly (at least with current tech). Given that RAM is probably the most efficient and frequent upgrade to computers, I don't think this would be a good way to go.

Bye,
RazzFazz
post #65 of 127
[quote]Originally posted by RazzFazz:
<strong>Much more flexible indeed, but also a hell of a lot more expensive than current designs (those 128 MB VRAM you mentioned are considrably faster than standard RAM, and consequently are also much much more expensive).
Also, this might kill expandability. There are reasons why current graphic cards don't feature upgradeable VRAM any more. Beyond a certain speed, you can't stick that memory on an exchangeable module any more, but rather have to solder it onto the board directly (at least with current tech). Given that RAM is probably the most efficient and frequent upgrade to computers, I don't think this would be a good way to go.
</strong><hr></blockquote>

True, but the RAM on graphics boards isn't really much faster than the RAM on the motherboard (i.e. its DDR266 or DDR333). The difference is in the organization... I wrote something on that earlier, was it this thread?
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #66 of 127
[quote]Originally posted by Programmer:
<strong>
True, but the RAM on graphics boards isn't really much faster than the RAM on the motherboard (i.e. its DDR266 or DDR333). The difference is in the organization...</strong><hr></blockquote>

Are you sure? On first glance, most publications claim "650MHz memory clock" for the new GF4 Ti4600 boards, with between 2 and 3 ns access time. Even if these are in fact only DDRed 325MHz, that would still be almost twice as fast as DDR333.


[quote]<strong>I wrote something on that earlier, was it this thread?
</strong><hr></blockquote>

Nope, not in here...

Bye,
RazzFazz

[ 04-07-2002: Message edited by: RazzFazz ]</p>
post #67 of 127
[quote]Originally posted by RazzFazz:
<strong>
Are you sure? On first glance, most publications claim "650MHz memory clock" for the new GF4 Ti4600 boards, with between 2 and 3 ns access time. Even if these are in fact only DDRed 325MHz, that would still be almost twice as fast as DDR333.</strong><hr></blockquote>

Yeah, you're right about that -- their latest boards do claim a 650 MHz memory clock (I had only seen their geForce4 architecture stuff, not the actual clock specs). Wow, I knew their memory architecture had improved but I didn't realize by how much. Unless they are doing some weird arithmetic, that is (wouldn't be the first time). But saying 650 MHz memory clock is pretty clear.

I'm rather curious about this memory, actually... if it is actually clocked that fast then it may be plausible for Apple to start using it in the rumoured G5 machines since they will be tightly coupled to their memory just like the GPUs are. nVidia is shipping 128MBytes of this stuff on their $500 cards so price-wise it might not be out of reach for Apple's top machines. All sorts of memory expansion issues, of course.

They've also got this fancy cross-bar memory controller which would need to be replicated in the CPU's memory controller before a true unified memory model could happen. Who knows, maybe the future processors will use the GPU's memory controller and talk to it via HyperTransport.

Thanks.
Providing grist for the rumour mill since 2001.
Reply
Providing grist for the rumour mill since 2001.
Reply
post #68 of 127
Programmer, RazzFazz

I just want to say, that I don't understand why you post just to have the last word, even if it's totally off topic. Do you need to increase your post count? Congrats, I lose big time.

programmer:
me: Correct, if you use larger types. However, I was talking about...
you: Well, if nothing else, you don't have a choice about the size of your pointers -- they will be twice as big. This includes the vtable..

Do you really think I don't realize that a 64 bit architecture actually means 64 bit pointers? I was not talking (do you read the things you quote?) about biger types (which are either used because you have to(pointers), because you're lazy or because the compiler choses to use them (int = 64bit) and indeed do use more bandwith). My post was about reading characters from memory, addressing Amorph's earlier post. It's a pleasure to quote it a third time:

"it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits, and if you're reading in 8 bit ASCII characters then you're pulling 8 times the bandwidth the data actually requires across the bus,..."

and here we go again, maybe this time...:

"it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits, and if you're reading in 8 bit ASCII characters then you're pulling 8 times the bandwidth the data actually requires across the bus,..."

And as before, my point is that this is not a difference between 32 and 64 bit processors. Because you would just not (ASCII, not pointers )PULL MORE DATA ACROSS THE BUS than you do now (even if it was slower from L1 cache to the register (RazzFazz), which it isn't!!!, this would still not be the issue here (BUS!)).

RazzFazz:
When we are discussing whether to implement 64bit fp altivec!! support, why do you write:

"Well, to be fair, this should read "1x64" rather than "0x64" (the G4 does have a DP FPU)."

??? We both know, that the G4 does not have any dp fp support in altivec. If I was talking about the whole processor, I would have written 3x64 &gt; 1x64, isnt' that obvious?

you:
Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless...
me:
2x64 is a lot more than 0x64

It was of course all SIMD related! Now, we don't have to agree on everything, but please read the posts and their context a bit closer before posting. Thank you very much.


BTW: "(assuming 4x32FP SIMD)" : I was of course talking about 2x64 SIMD, not 4x32 which is already implemented.
post #69 of 127
Wow! Don't we think much of ourselves! (by 'we' I mean 'you', and by 'ourselves' i mean 'yourself'.
post #70 of 127
I know this is off subject somewhat but I'd like to know if the G5 is a multicore chip and if it is, how many cores are used in it.

I've been looking at the transistor counts on the G4, G4e, and G5 and if the G5 is made from G4 cores, it looks like it could have up to 4 G4 cores on it based on the transistor count.

I used the G5 transistor count posted on geek.com so I hope their numbers are fairly accurate.
post #71 of 127
post #72 of 127
[quote]Originally posted by AirSluf:
<strong>123, I think you need to re-read before you try to pummel some intelligent posts. Me thinks you just like to selectively chip away in frustration even when the posts were correct. It's beginning to look like a trend... <img src="graemlins/oyvey.gif" border="0" alt="[No]" /> </strong><hr></blockquote>

I don't know, seems like 123 is correect in that people didn't read his post before answering.

For example:
and here we go again, maybe this time...:

"it can slow things down, because the minimum size of the data a 64 bit processor reads is 64 bits, and if you're reading in 8 bit ASCII characters then you're pulling 8 times the bandwidth the data actually requires across the bus,..."

And as before, my point is that this is not a difference between 32 and 64 bit processors. Because you would just not (ASCII, not pointers )PULL MORE DATA ACROSS THE BUS than you do now (even if it was slower from L1 cache to the register (RazzFazz), which it isn't!!!, this would still not be the issue here (BUS!)).

which essentially says your reading 64-bits of the bus everytime anyway (even on a 32-bit processor with a 64-bit bus)
I heard that geeks are a dime a dozen, I just want to find out who's been passin' out the dimes
----- Fred Blassie 1964
Reply
I heard that geeks are a dime a dozen, I just want to find out who's been passin' out the dimes
----- Fred Blassie 1964
Reply
post #73 of 127
[quote]Originally posted by onlooker:
<strong>
You your self mentioned servers, and the words Unix, and Server go togeter like peanut butter, and jelly. Ruuun Foorrest!!! So 64bit seems like natural step towards a brighter future if you ask me.</strong><hr></blockquote>

quite possibly the hardest thing I've had to read since madtool and da truth train....maybe I need to get to bed though? I've been tired all day....too many damn commas ...and I use to many "..." I used to use too many commas though.
orange you just glad?
Reply
orange you just glad?
Reply
post #74 of 127
Hi. I noticed that all you people seem to think 64-bit addressing is good for is more RAM. it's not.

Probably the most important upgrade made feasible by a 64-bit computer is 64-bit A/V. remember how incredibly cool it was. and how much nicer it looked when you installed the Radius card in your Mac II. and got 24-bit video? remember how the 660AV and 840AV. due to their 16-bit audio synthesizer/digitizer. sounded so much better?

If any of you have ever used(I have ) a high-end(The ones equipped with 48-bit A/V I/O) SGI machine. and switched between the 48-bit and 24-bit A/V modes. you'll notice a dramatic difference. and the multimedia/graphics professionals that are Apple's bread and butter will DEFINITELY notice.

The key here is not so much the G5(It's pretty much already here) or a 64-bit version of Mac OS X(basically inevitable). the key are those lazy audio/video hardware makers. and what Apple chooses to put on their motherboards. remember how long it was before Apple started shipping machines with 24-bit graphics hardware? notice how Macs STILL don't use 24-bit audio hardware? if Apple is dumb enough to ship G5 machines with 64-bit Mac OS X. and no 48-64-bit A/V I/O. we're doomed. this will be the big stumbling block for 64-bit Macs.

As for compatability. much like every other RISC architecture's bit depth transition(MIPS, POWER, Alpha, SHx, ARM etc) the G5 will run both 64-bit and totally unmodified 32-bit code concurrently and without any speed penalty whatsoever.

And as for a definition of "x-bit". my definition is that when all parts of a CPU operate at and/or above a certain bit depth. the CPU as a whole can be considered as operating at that bit depth.

For example. while the external bus of the G4 runs at 64-bit, the VPU runs at 128-bit and the FPU also runs at 64-bit. the IPU only operates at 32-bit. the G4 is thusly a 32-bit CPU.


\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t Eric,
post #75 of 127
post #76 of 127
[quote]Originally posted by 123:
<strong>Programmer, RazzFazz

I just want to say, that I don't understand why you post just to have the last word, even if it's totally off topic. Do you need to increase your post count? Congrats, I lose big time.
</strong><hr></blockquote>

Chill, dude.
Why so aggessive?


[quote]<strong>
RazzFazz:
When we are discussing whether to implement 64bit fp altivec!! support, why do you write:

"Well, to be fair, this should read "1x64" rather than "0x64" (the G4 does have a DP FPU)."

??? We both know, that the G4 does not have any dp fp support in altivec. If I was talking about the whole processor, I would have written 3x64 &gt; 1x64, isnt' that obvious?
</strong><hr></blockquote>

I thought we were talking about how to catch up in DP FP power, SIMD or not (see my and programmer's above posts regarding "rather 2xscalar DP than Altivec DP"). In fact, you said yourself that "You could add some additional FPUs or whatever, I don't really care."


[quote]<strong>
you:
Adding 2x64 DP SIMD capabilities to AltiVec is pretty useless...
me:
2x64 is a lot more than 0x64

It was of course all SIMD related! Now, we don't have to agree on everything, but please read the posts and their context a bit closer before posting. Thank you very much.
</strong><hr></blockquote>

Well, again, you said "You could add some additional FPUs or whatever, I don't really care.", and that doesn't sound like "we're exclusively talking SIMD here" to me very much.


[quote]<strong>BTW: "(assuming 4x32FP SIMD)" : I was of course talking about 2x64 SIMD, not 4x32 which is already implemented.</strong><hr></blockquote>

My bad, sorry.

Bye,
RazzFazz
post #77 of 127
RazzFazz:
[quote]
and that doesn't sound like "we're exclusively talking SIMD here" to me very much.
<hr></blockquote>

Ok, I see your point. Yes, I said, that I don't really care, but you said, that SIMD was not an option, so I was defending the SIMD idea, because we already agreed on additional FPUs. I was actually quoting your 2x64 (you were talking exclusively about SIMD in that sentence), but I agree that you could read that either way, misunderstanding on both sides, my apologies.

Why so aggessive?
Because I really was/am frustrated. By now, I wanted to post only a few(3-4) times. But because nobody seems to be able to leave something just there, even if it is undisputable (I think ;-), I had to quote the same things (64 bit bus, funnel theory(other thread)) several times again and again... Then, you tell me Programmer wasn't talking about functions, even though I've even quoted the sentence, then he writes about bigger types, although that has nothing to do with what I was saying and he even repeats that point. After that, you come up with 1x64, although I wasn't talking abot the fpu (from my point of view)... I felt I had to ask people to read closer before posting a response, but I agree that it sounded too aggressive.


AirSluf:
[quote]
Razz and Programmer both gave pretty good explanations in their original posts and they have a long standing track record of low bull-shit high content posting. Someone else has a 2 thread record of low-fact high-warble responses. Which makes more sense to you? <hr></blockquote>


I agree on their track record, but I also think people should read first what other people have written and think about it before posting answers, no matter their name/knowledge/track record/reputation.

(Sorry everyone, this is from another thread, not even hardware related)
Same to you actually, how could you tell me that I was basically saying funnels are completely "un-needed"? If you had read my posts on that subject, you would have noticed that I was never talking about funnels in general, just that they're not reponsible for the big finder problems on SP machines. And about that "low-fact" responses: At least my facts are related to the topic and I'm not talking about jacobians. I also don't just come up with some stuff I've read in a paper or book even if it's irrelevant (as you do), just to show off my tremendous knowledge. (BTW: it would have been pretty easy to dismiss Whyatt's main argument, if you understood funnels, but I think you don't).

[ 04-09-2002: Message edited by: 123 ]</p>
post #78 of 127
[quote]Originally posted by Eric D.V.H:
<strong>Probably the most important upgrade made feasible by a 64-bit computer is 64-bit A/V. remember how incredibly cool it was. and how much nicer it looked when you installed the Radius card in your Mac II. and got 24-bit video? remember how the 660AV and 840AV. due to their 16-bit audio synthesizer/digitizer. sounded so much better?</strong><hr></blockquote>

Well, this is pretty much similar to the memory size argument: Most people are well capable of seeing a difference between a real life picture or sound and its 8bit digitized equivalent. The same is not true anymore if you go to 24bit or 16bit/44kHz resolutions, respectively (and even less so for 24/96 audio). Unless our eyes and ears suddenly go through an evolutionary quantum leap, there's no point in having resolutions that are way beyond what our sensory organs can discern.

Also, there's little point in having huge internal resolutions, when the interfaces are the quality bottleneck. Most audio DACs and ADCs on the market today, while technically featuring 24bit/96kHz, are far from even coming close to the theoretical SNR allowed by that. Similarly, current LCDs are limited to 24 bit colors at most, and even if this was just a limitation of the interface circuits inside the screen (as opposed to the panel itself), it still couldn't make use of higher bit depths.


[quote]<strong>If any of you have ever used(I have ) a high-end(The ones equipped with 48-bit A/V I/O) SGI machine. and switched between the 48-bit and 24-bit A/V modes. you'll notice a dramatic difference. and the multimedia/graphics professionals that are Apple's bread and butter will DEFINITELY notice.
</strong><hr></blockquote>

Well, since I haven't actually used any of those, I'll have to believe you on this matter.

One plausible explanation I could think of is that 24 bits per pixel, while featuring &gt;16mio colors (more than you could put onto a single screen at once), only allow for 256 shades of every color component (red, green, blue) and consequently only 256 gray scales too. So in case you have a mainly monochromatic image (like a gray scale one), that limit would probably become apparent, whereas 16 bits per color component (i.e. 48bpp) would allow for a much more realistic representation.

On a side note, using AltiVec, the G4 could do 64bpp gfx already today (in fact, it could handle two such pixels per instruction), or even 128bpp (32 bit int or FP per color component).

Concerning audio, though, I'm pretty sure that a 32bit FP representation as used in OS X should be enough.

Bye,
RazzFazz
post #79 of 127
[quote]Originally posted by RazzFazz:
<strong>Well, this is pretty much similar to the memory size argument: Most people are well capable of seeing a difference between a real life picture or sound and its 8bit digitized equivalent. The same is not true anymore if you go to 24bit or 16bit/44kHz resolutions, respectively (and even less so for 24/96 audio). Unless our eyes and ears suddenly go through an evolutionary quantum leap, there's no point in having resolutions that are way beyond what our sensory organs can discern.</strong><hr></blockquote>

I doubt that. try(If you have a large monitor set at a high resolution) popping open Photoshop and do a large, simple gradient. notice the slight striation and dithering? now try this: get a photograph. and start doing a bunch of brightness/contrast/hue/saturation-type adjustments. notice the loss of detail and posterization effect? these sorts of things matter big time to media professionals. and as for the debate over higher bit depth audio. go to a DVD-Audio or Super Audio-CD site. you'll get an earfull(Heh heh).

[quote]Originally posted by RazzFazz:
<strong>Also, there's little point in having huge internal resolutions, when the interfaces are the quality bottleneck. Most audio DACs and ADCs on the market today, while technically featuring 24bit/96kHz, are far from even coming close to the theoretical SNR allowed by that.</strong><hr></blockquote>

Really? have you checked out your PowerMac? how about your DVD player? looked into your DV camcorder? I'd be willing to bet that the formermost has a 16-bit audio I/O set-up. while the latter two both bear mere 10-bit audio hardware.

[quote]Originally posted by RazzFazz:
<strong>Similarly, current LCDs are limited to 24 bit colors at most, and even if this was just a limitation of the interface circuits inside the screen (as opposed to the panel itself), it still couldn't make use of higher bit depths.</strong><hr></blockquote>

Emphasis on the "Current". try hooking an older analog LCD to a good SGI/HP/Compaq/IBM etc. machine. this is why I think digital I/O components(Read: those stupid USB speakers and ADC monitors Apple makes) are a bad idea.

[quote]Originally posted by RazzFazz:
<strong>Well, since I haven't actually used any of those, I'll have to believe you on this matter.

One plausible explanation I could think of is that 24 bits per pixel, while featuring &gt;16mio colors (more than you could put onto a single screen at once), only allow for 256 shades of every color component (red, green, blue) and consequently only 256 gray scales too. So in case you have a mainly monochromatic image (like a gray scale one), that limit would probably become apparent, whereas 16 bits per color component (i.e. 48bpp) would allow for a much more realistic representation.</strong><hr></blockquote>

Yup. whereas 24-bit only has around sixteen million(16,777,216) colors, 32-bit can theoretically do four billion(4,294,967,296). 48-bit can do up to two hundred and eighty one trillion(281,474,976,710,656). and full 64-bit can do over a whopping three hundred and forty duodecillion(340,282,366,920,938,463,463,374,607,431,768,211,0 00)! the capability to easily fumble around with such immense numbers would bring the I/O quality up so high as to please even the most particular user. and the power of high bit depth AD/DA circuitry could be tapped in a less CPU taxing fashion with palettes(Remember how a lot of games could get WAY more than 256 colors on early Macs? they did this by selecting 256 individual colors out of a static 16M color palette).

[quote]Originally posted by RazzFazz:
<strong>On a side note, using AltiVec, the G4 could do 64bpp gfx already today (in fact, it could handle two such pixels per instruction), or even 128bpp (32 bit int or FP per color component).</strong><hr></blockquote>

Using AltiVec that is. can you say "2x integer speed penalty"? like I said before. you need AT LEAST 64 bits throughout ALL of the CPU for maximum speed.

[quote]Originally posted by RazzFazz:
<strong>Concerning audio, though, I'm pretty sure that a 32bit FP representation as used in OS X should be enough.</strong><hr></blockquote>

That's what they said about 16-bit CDs. compare a 32-bit signal to an analog(A good one) signal on an oscilloscope. you'll notice tiny plateau-like steps in the digital signal. 32-bit is good. but 48-bit. or better yet. 64-bit audio is quite a bit better.

\t\t\t\t\t\t\t\t\t\t\t Eric,
post #80 of 127
Thread Starter 
[quote]Originally posted by Eric D.V.H:
<strong>

That's what they said about 16-bit CDs. compare a 32-bit signal to an analog(A good one) signal on an oscilloscope. you'll notice tiny plateau-like steps in the digital signal. 32-bit is good. but 48-bit. or better yet. 64-bit audio is quite a bit better.

\t\t\t\t\t\t\t\t\t\t\t Eric,</strong><hr></blockquote>
What you say is true in the absolute, but there is no need of a 24 bit audio if the analogic conversion , amplification , and the HP are lame.
I doubt that we can see a difference when we ear a 16 bit audio signal compared to even a 64 bit signal on a basic HP (like the lame built-in HP of the tower). The problem is not coming from an huge prize difference between a 24 bits audio chipset and a 16 bit one, but from the huge difference of prize, between a standart qualities of amplifications and HP, and a very good one.

This is the mean reason why in the domain of HI-FI there is so huge difference of prizes. If you don't believe me just go in an auditorium and just try some different stuffs (if you are an audio pro, i don't need to convice you on that particular subject)

[ 04-09-2002: Message edited by: powerdoc ]</p>
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Future Apple Hardware
AppleInsider › Forums › Mac Hardware › Future Apple Hardware › G5 : 64 bits or 32 bits ?