Full 64 bit optimization

kickaha · November 18, 2003 1:56AM

Er, all of us?

Never done video work in my life.

And what on *earth* type of color system requires >32 bits for fidelity??

The biggest gains for 64-bit are larger memory/file space access, and the *possibility* of cramming multiple lower-bit data access into the same bus fetch, ala AltiVec.

ghost_user_name · November 18, 2003 2:05AM

Quote:

Originally posted by Kickaha

And what on *earth* type of color system requires >32 bits for fidelity??

Think of an image as a big 2D matrix of 32 bit numbers. You perform a function over that matrix. You have a small rounding error that results from truncating data after the calculation. Perform another function. More errors. Another. More errors.

It's a snowball effect that builds as artists apply several filters to their images. It's clearly visible under certain circumstances.

Editing in 32 bit color is already less than adequate today for some image work. Just a few days ago I was working in Photoshop trying to correct some banding in what should have been a smooth dark grey gradient. Even with dithering, it was impossible to properly fix. I was able to do some very minor corrections because Photoshop 7 has an option for 16 bits/channel editing (rather than 8 bits/channel), but I practically fell out of my chair when I discovered that editing in the higher bit mode can only deal with one flattened layer. Arbitrary restriction by Adobe? Or technological limitation of 32 bit processors? I'd lean more towards the latter.

kickaha · November 18, 2003 2:15AM

Fair enough. Didn't think the computational drift was that obvious.

1337_5l4xx0r · November 18, 2003 2:41AM

I think the original poster meant compiled/optimized for the 64-bit PPC 970, which will result in huge speed gains. I hope to see a fork (32/64) by 10.4 at the latest. I expect many apps will fork and be available as 32 or 64 bit binaries.

Gentoo linux will be an interesting proving ground for the work IBM has done on the PPC GCC 3.x. I expect many more advancements will be made in speed and efficiency through clever scheduling.

Programmer:

Quote:

10.3 is optimized for the PPC970

No, it is not.

ghost_user_name · November 18, 2003 6:48AM

Quote:

Originally posted by 1337_5L4Xx0R

No, it is not.

Based on what grounds do you make this claim?

barto · November 18, 2003 6:57AM

Quote:

Originally posted by Brad

Based on what grounds do you make this claim?

Obviously not the PPC970 scheduler in GCC 3.3. Care to explain, 1337_5L4Xx0R?

tidris · November 18, 2003 3:05PM

Quote:

Originally posted by Brad

Think of an image as a big 2D matrix of 32 bit numbers. You perform a function over that matrix. You have a small rounding error that results from truncating data after the calculation. Perform another function. More errors. Another. More errors.

It's a snowball effect that builds as artists apply several filters to their images. It's clearly visible under certain circumstances.

Editing in 32 bit color is already less than adequate today for some image work. Just a few days ago I was working in Photoshop trying to correct some banding in what should have been a smooth dark grey gradient. Even with dithering, it was impossible to properly fix. I was able to do some very minor corrections because Photoshop 7 has an option for 16 bits/channel editing (rather than 8 bits/channel), but I practically fell out of my chair when I discovered that editing in the higher bit mode can only deal with one flattened layer. Arbitrary restriction by Adobe? Or technological limitation of 32 bit processors? I'd lean more towards the latter.

You are probably thinking that in order to work with pixels deeper than 32-bit one needs to use 64-bit computations. Not so. A pixel consists of several components (red, gree, blue, alpha, for example) that need to be manipulated individually during image processing. A 32-bit RGB pixel uses 8-bits of precision for each component. A 64-bit RGB pixel would use 16-bits of precision for each component. A 128-bit RGB pixel would use 32-bits of precision for each component. So, even when working with 128-bit pixels one can use 32-bit computations, which is a good thing because it means you could use AltiVec to do simultaneous operations on all 32-bit components of the pixel. If you go to a 256-bit RGB pixel with 64-bit components then AltiVec will not be of much use to you.

myahmac · November 18, 2003 3:23PM

By all of us i meant all who work with video. You cant really get out of using huge files when working with video. So if the guy earlier said that a 64 bit file system would boost speeds for large HD's then that is good for people with huge HD's like me.

tidris · November 18, 2003 3:33PM

Quote:

Originally posted by myahmac

So if the guy earlier said that a 64 bit file system would boost speeds for large HD's then that is good for people with huge HD's like me.

A 64-bit file system will not boost speeds for any HDs, large or small. The bottle neck in disk speed is the hardware data transfer rate, not the organization of the file system.

amorph · November 18, 2003 3:41PM

Quote:

Originally posted by Tidris

The G5 does not slow down when doing 32-bit computations.

That's not the source of the slowdown.

Say you have a large text file - something you downloaded from the Gutenberg Project, perhaps. Each character in the text file is 8 bits (well, 7 if it's straight ASCII). How is it stored?

An n-bit processor most efficiently addresses data in chunks of size n. So data is usually aligned so that every individual datum is at the beginning of one of those chunks. For a 64-bit CPU, that means there's a character every 64 bits - but they're only 8 bits in size. You need 8 times more memory to store the text file than you would optimally (this phenomenon is known as "internal fragmentation"). That means that you're wasting a lot of precious bandwidth just moving the data around, and losing a lot of potential CPU performance on data that wastes 87.5% of its precious register and cache space.

So, you could pack the data, and put eight characters in each chunk (the chunk is usually called a "word" for this reason). But then you have to unpack it, because the CPU still deals with everything 64 bits at a time. So now you're memory efficient, but you have two ugly choices: Unaligned data access, which means that the CPU reads the same word 8 times, and extracts each character manually - whoops, and we're back to 12.5% bandwidth efficiency - or else reading 8 at a time and spending cycles and registers extracting out each character. The bottom line is that there's no way to handle really lightweight data without a tremendous amount of waste. And lightweight data is common.

The tradeoff, of course, is that when you have to crunch great big numbers, the CPU will be able to do that far more efficiently. Fortunately for CPU architects everywhere, high performance is more commonly crucial in these circumstances than it is when handling plain text, so it's always been worthwhile to increase the word size (or bitness) of an architecture once there was enough large data around.

spooky · November 18, 2003 5:33PM

So, (and forgive me if this is a really dumb question) exactly what is the benefit of 64 bits? After all apple seems to be touting this as a major speed factor.

is 64 bits gonna give us anything at all? can I use imovie to make a video without having to take another holiday while it renders?

programmer · November 18, 2003 5:48PM

Quote:

Originally posted by 1337_5L4Xx0R

Programmer: 10.3 is optimized for the PPC970.

No, it is not.

Uh, yes it is. Go ask Apple. GCC 3.3 optimizes for the PPC970 and Apple recompiled the OS with that option turned on. Ergo, 10.3 is optimized for the PPC970.

Now that doesn't mean it supports the full 64-bit capabilities of the PPC970, but that is different than being optimized for it. Adding the 64-bit capabilities will not improve performance except for in a few unusual algorithms that either need 64-bit address spaces for efficiency or that use full 64-bit math (and are compute bound by it). The OS is not compute bound by 64-bit math.

programmer · November 18, 2003 5:52PM

Quote:

Originally posted by spooky

is 64 bits gonna give us anything at all? can I use imovie to make a video without having to take another holiday while it renders?

Very little software needs 64-bit integer math and is speed limited by it, but any that does will run considerably faster on a 64-bit machine (when compiled in 64-bit mode). The 64-bit addressing is more interesting, but again most software doesn't need it. You've already got most of the performance that the 970 offers, don't expect to see some magical performance jump. There are no doubt still plenty of processor-independent optimization opportunities left in 10.3, however, and plenty of software still hasn't been rebuilt with GCC 3.3 ...

tidris · November 18, 2003 9:43PM

Quote:

Originally posted by Amorph

That's not the source of the slowdown.

Say you have a large text file - something you downloaded from the Gutenberg Project, perhaps. Each character in the text file is 8 bits (well, 7 if it's straight ASCII). How is it stored?

An n-bit processor most efficiently addresses data in chunks of size n. So data is usually aligned so that every individual datum is at the beginning of one of those chunks. For a 64-bit CPU, that means there's a character every 64 bits - but they're only 8 bits in size. You need 8 times more memory to store the text file than you would optimally (this phenomenon is known as "internal fragmentation"). That means that you're wasting a lot of precious bandwidth just moving the data around, and losing a lot of potential CPU performance on data that wastes 87.5% of its precious register and cache space.

So, you could pack the data, and put eight characters in each chunk (the chunk is usually called a "word" for this reason). But then you have to unpack it, because the CPU still deals with everything 64 bits at a time. So now you're memory efficient, but you have two ugly choices: Unaligned data access, which means that the CPU reads the same word 8 times, and extracts each character manually - whoops, and we're back to 12.5% bandwidth efficiency - or else reading 8 at a time and spending cycles and registers extracting out each character. The bottom line is that there's no way to handle really lightweight data without a tremendous amount of waste. And lightweight data is common.

The tradeoff, of course, is that when you have to crunch great big numbers, the CPU will be able to do that far more efficiently. Fortunately for CPU architects everywhere, high performance is more commonly crucial in these circumstances than it is when handling plain text, so it's always been worthwhile to increase the word size (or bitness) of an architecture once there was enough large data around.

That is an interesting theory, but I have actually benchmarked 32-bit versus 64-bit integer number crunching code on a G5. The 32-bit code was faster, plain and simple. And since someone is bound to ask, yes, I made sure the gcc3.3 compiler was using the new 64-bit instructions for the G5. I used the gcc option -fast which is documented here:

http://developer.apple.com/documenta...e-Options.html

amorph · November 18, 2003 9:50PM

Quote:

Originally posted by Tidris

That is an interesting theory, but I have actually benchmarked 32-bit versus 64-bit integer number crunching code on a G5. The 32-bit code was faster, plain and simple. And since someone is bound to ask, yes, I made sure the gcc3.3 compiler was using the new 64-bit instructions for the G5. I used the gcc option -fast which is documented here:

http://developer.apple.com/documenta...e-Options.html

If the 970 can switch modes, it might be able to switch word sizes, in which case my "theory" (actually, material and inevitable fact) would be worked around.

Otherwise, the plain and simple fact is that on a machine with a 64-bit word will waste 50% of the RAM and cache it uses to store 32 bit numbers, and 50% of the memory bandwidth it uses to move it around. That's just how things are.

tidris · November 18, 2003 10:02PM

Quote:

Originally posted by Amorph

If the 970 can switch modes, it might be able to switch word sizes, in which case my "theory" (actually, material and inevitable fact) would be worked around.

Otherwise, the plain and simple fact is that on a machine with a 64-bit word will waste 50% of the RAM and cache it uses to store 32 bit numbers, and 50% of the memory bandwidth it uses to move it around. That's just how things are.

Have you actually written any C/C++ code [edit] for the PowerPC? I can tell you that with the PowerPC compilers I have used, arrays of char, short, int, long, float, etc are always packed in memory, even when targeting the G5. Each char in the array uses a single byte, not 8 as you suggested earlier. The G5 has a very sophisticated data prefetching mechanism you might want to read about here:

http://developer.apple.com/technotes/tn/tn2087.html

programmer · November 18, 2003 10:05PM

Here is one good use for 64-bit:

http://maul.deepsky.com/%7Emerovech/2038.html

wmf · November 18, 2003 11:09PM

Quote:

Originally posted by Amorph

An n-bit processor most efficiently addresses data in chunks of size n. So data is usually aligned so that every individual datum is at the beginning of one of those chunks. For a 64-bit CPU, that means there's a character every 64 bits - but they're only 8 bits in size. You need 8 times more memory to store the text file than you would optimally (this phenomenon is known as "internal fragmentation"). That means that you're wasting a lot of precious bandwidth just moving the data around, and losing a lot of potential CPU performance on data that wastes 87.5% of its precious register and cache space.

So, you could pack the data, and put eight characters in each chunk (the chunk is usually called a "word" for this reason). But then you have to unpack it, because the CPU still deals with everything 64 bits at a time. So now you're memory efficient, but you have two ugly choices: Unaligned data access, which means that the CPU reads the same word 8 times, and extracts each character manually - whoops, and we're back to 12.5% bandwidth efficiency - or else reading 8 at a time and spending cycles and registers extracting out each character. The bottom line is that there's no way to handle really lightweight data without a tremendous amount of waste. And lightweight data is common.

This is not correct. As someone already said, data is packed in memory, and thus in the cache. Bandwidth out of the L1 cache is practically free, and extracting a field out of a word is literally free (loading a byte is the same speed as loading a word).

Processor designers are smart; they can handle this stuff.

amorph · November 18, 2003 11:15PM

Quote:

Originally posted by Tidris

Have you actually written any C/C++ code?

For an Alpha. Which is where my information comes from.

If the PowerPC's done better, great. But that has absolutely nothing to do with how much code I've written in which language.

[edit: Nevertheless, nice to know my information's old, and the PPC's doing things better. The Alpha had ways of fetching unaligned data, too, but you didn't want to use them. I might have to code for a PPC sometime. It sounds like a nice architecture.

]

programmer · November 19, 2003 1:34AM

Quote:

Originally posted by wmf

This is not correct. As someone already said, data is packed in memory, and thus in the cache. Bandwidth out of the L1 cache is practically free, and extracting a field out of a word is literally free (loading a byte is the same speed as loading a word).

Processor designers are smart; they can handle this stuff.

On the other hand, if you are dealing with values 32-bits or smaller, having a 64-bit processor doesn't do you any good either. The only place it really makes a difference is when you have a 64-bit address space, and then the bandwidth increase in 64-bit mode is inescapable -- pointers are twice and big and therefore require more bandwidth, decreasing performance compared to an equivalent 32-bit application unless you actually require more than 4 GB of memory (something few apps need).

Full 64 bit optimization

Comments