GPU's and Maths-coprocessing

anna mated · July 6, 2003 4:43AM

Consider this, the newest Radeon and GeForce FX chips are something like >120million transistors vs PPC970 and P4 >50 million. Also these chips (GPU's) can be bought in a retail card for <$500. They do run considerably slower Mhz though.

These generation of GPU can do 64-bit FP throughout, and AFAIK can do this with 4 way parallelism, hence 256 bit moniker. Like Altivec for 64bit FP.

I wonder then if it is possible to send 64 bit "Textures" that are visual representations of a Mathematical dataset to a GPU, for instance (Fluid dynamics, Finite Element Analysis, Radiosity etc), as GPU's seem to be rather more efficient at FP, as this is what they really do.

Imagine Photoshop/Final cut or even Shake filters rendered in the GPU, nearly instant.

Would it not be posible, if utilising these chips for FP calculations is realistic, to build Maths-coprossers from these chips, and put them in PCI-x slots?

programmer · July 6, 2003 10:08AM

Quote:

Originally posted by Anna Mated

Would it not be posible, if utilising these chips for FP calculations is realistic, to build Maths-coprossers from these chips, and put them in PCI-x slots?

Soon. The current generation of nVidia GPUs are generalized enough that this is doable, and they maintain full precision through the entire pipeline. The real challenge at the moment is how to map your problems into the data set being fed to these processors, and how to map the results back out. Certainly the idea of having a massively parallel, deeply pipelines programmable floating point engine is an idea whose time has come.

I've been talking about it for a year or two now, just waiting for the hardware and tools to arrive. Now its time to start thinking hard about how to write software that uses this sort of thing. Once we have the ability to package up computations like that for GPU-like hardware, the same mechanism can be used in clusters or to feed computations to SMP SIMD machines (or clusters of SMP SIMD machines). I've got some ideas on the subject but so far haven't seen anybody publishing anything similar.

airsluf · July 6, 2003 3:15PM

programmer · July 6, 2003 4:34PM

Quote:

Originally posted by AirSluf

The long pole in this tent (to making it accessible and really useful) seems to be the appropriate tools to get the data out of graphics memory off-screen buffers without having to do something like a copyBits or getTexSubImage and then cast that into appropriate data structs and then verifying everything went hunky-dory.

Nah, the real "long pole in the tent" will be allowing the pipeline to be configured by the user, not just the programs in fixed stages.

gargoyle · July 6, 2003 5:03PM

not wanting to but in totally off topic, but is it really necessary to quote the entire message in the first reply... it has to be the single most annoying thing about these forums.

programmer · July 6, 2003 5:10PM

Quote:

Originally posted by Gargoyle

not wanting to but in totally off topic, but is it really necessary to quote the entire message in the first reply... it has to be the single most annoying thing about these forums.

Apologies, I was rushed. I have corrected the quote.

powerdoc · July 7, 2003 12:51AM

Quote:

Originally posted by Anna Mated

Consider this, the newest Radeon and GeForce FX chips are something like >120million transistors vs PPC970 and P4 >50 million. Also these chips (GPU's) can be bought in a retail card for <$500. They do run considerably slower Mhz though.

These generation of GPU can do 64-bit FP throughout, and AFAIK can do this with 4 way parallelism, hence 256 bit moniker. Like Altivec for 64bit FP.

Let's do a small comparison for FP against a G5 and a GPU of the lattest generation.

the GPU clock at a max of 500 mhz, and is a 4 way FP hand has a 2 GB/sec of memory bandwitch when the main memory is employed. He has a very good "L3 cache" (read the video ram, i said L3, because modern L2 cache are on die cache)

The CPU clock at a max of 2000 mhz has two independant but not fully symetrical FP, plus symetrical FP via altivec, and has a state of the art bus with more than 6,4 GB access to the main memory.

If we only took into account (it's an oversimplification) the vectorise FP unit (the 4 of the GPU, and the 2 of the Altivec unit), the G5 is two time fasters.

If we take into account non vectorized FP code, the G5 will be from 4 to 8 times faster.

If you take into accont the difficult access to the main memory of the video card, and the hard work of programming involved, you will understand that using the current FP unit on the G5 is the best solution. And if you take into account that you can use only one video card, but two G5, you will bet that GPU math coprocessing is not for tomorrow.

Concerning the possible and future use of PCI-X card : the idear is interesting but there is a problem of what supply. Apple said in his developper PDF document, that the PCI-X and AGP slot, has a power supply of 90 watts maximum. There is no room to feed many GPU with only 90 watts.

mmicist · July 7, 2003 11:27AM

Quote:

Originally posted by Anna Mated

These generation of GPU can do 64-bit FP throughout, and AFAIK can do this with 4 way parallelism, hence 256 bit moniker. Like Altivec for 64bit FP.

No. The cards can only do 32 bit (single precision) FP.

They do, however, have multiple pipes, each handling up to 4 32 bit values at the same time.

michael

anna mated · July 7, 2003 11:35AM

Powerdoc,

I wont quote or debunk your post, but.

Its quite simply fact that the GPU is a specialist chip for 1 area only, the graphics. These (new) chips are built entirely for 64bit FP more or less. I cant recall Apples direct quote, but when the Geforce 3 was first available for the mac, they quoted the amount of Flops/Gigaflops, which was an order of magnitute higher than the current G4, and still an order of magnitute higher than the G5 of today, this was for a GF3.

Still, whats the difference between a 64bit number and a 64bit pixel? I'd guess (though Im not sure) that they're just numbers expressed with 64bit precision.

Say for instance, you have a mathmatical dataset consisting of a million 64bit numbers, and say you want to divide all these numbers by two. In a Graphics API, ie Open GL or DirectX or whatever,(and im speculating, maybe programmer can give a firm example) you would simply overlay a screen of black at 50% opacity. The result (in Pixels on the screen) is that all these 64bit numbers have been divided by two, (I see this happens regularly in FPS games when you press escape to bring up a main options menu). No doubt with the power of todays cards, such a simple example could be performed at something like 5000 times a second, considering the card is simply overlaying a colour on a static background (Im purely speculating on numbers) Surely this is faster than looping a CPU intensive program to divide each number by two.

If I understand programmer correctly, the problem is then getting the result off the screen (ie the modified pixels), back into main ram, to witness the results as something useful.

Bear in mind Im not a programmer, so I simply speculate at the possibilities.

anna mated · July 7, 2003 11:37AM

Quote:

Originally posted by mmicist

No. The cards can only do 32 bit (single precision) FP.

They do, however, have multiple pipes, each handling up to 4 32 bit values at the same time.

michael

Please can you back this up? I thought the DirectX9 specs(sorry to quote a windows technology!) called for 64bit FP throughout the rendering pipeline.

powerdoc · July 7, 2003 1:17PM

Quote:

Originally posted by Anna Mated

Powerdoc,

I wont quote or debunk your post, but.

Please debunk it, the GPU have low clocking compared to CPU, even if they have massive parallelar FPU unit, 500 mhz is 4 time slower than a G5 at 2 ghz. Show me where i am wrong with this. A gpu card do not only calculate triangles, but do many others things, like texturing, antialiasing ... All this stuff is made via specialised sections.

Concerning the Gflop, keep in mind that the Gflop of a video card is not the same as Gflop of a CPU especially FP calculations.

anna mated · July 7, 2003 4:10PM

Quote:

Originally posted by Powerdoc

Please debunk it, the GPU have low clocking compared to CPU, even if they have massive parallelar FPU unit, 500 mhz is 4 time slower than a G5 at 2 ghz. Show me where i am wrong with this. A gpu card do not only calculate triangles, but do many others things, like texturing, antialiasing ... All this stuff is made via specialised sections.

Concerning the Gflop, keep in mind that the Gflop of a video card is not the same as Gflop of a CPU especially FP calculations.

I wont, because Im not trying to prove you wrong!

Yes the GPU runs 4x slower. I can see that. Is it not true though that each sub-Pixel (RGB+Alpha) is stored as a 64bit FP number? (I don't know, Im asking - DX9/ATI/NVIDIA literature leads me too believe it is), therefore what happens to each Pixel, happens to the sub-pixel components, of which there are 4. Already we have a 4way parallel processor, similar to altivec (albeit 32bits). For instance on a monitor set to 1280x1024, we have 1,310,720 pixels, or 5,242,880 sub pixels. Say 5 million sub-pixels for arguments sake. Say for instance we can program a Graphics API to calculate a 'transform' on each sub-pixel at 100 FPS (average FPS game, so not an unreasonable guess) - we are already doing 500 million FP calculations per second, which if I understand DX9/ATI/NVIDIA correctly is also performed at 64-bit Floating point precision throughout. Can you explain why this 64bit FP calculation is any different than a 64-bit calculation done on the CPU?

Suppose we load up a huge dataset of 64bit numbers on each of our sub-pixels, irrespective of whatever garbage we see on the screen, and apply a transformation to them (like my earlier example of an overlay). Why do you not agree that this is a considerable source of Floating Point calculating power. Apart from the load on the CPU and AGP/PCI-x slot of transferring the data, all the actual calculaton is done with no CPU.

So where do we get these huge datasets from? Well imagine an hour of Final-cut footage. You want to increase the brightness of the whole hour by 10%, (personally I have no idea how long this would take to render through the CPU), Currently (speculating) a programmer would write a loop to go through every-sub pixel of each frame and multiply the value by 110% (sorry if this does not equate a brightness transform), and you would get a result afer the loop has finished. During this time while it was rendering, your PC would be bogged down with the processors at 100%. Alternatively, you could write a Shader algorithm for the GPU to process, (assuming the bus can keep up) and render in the background, probably at many times the real-time FPS of the video, and you do not notice a thing, (remember I said these chips were installed as co-processors in a PCI-X slot, and not the actual video output)

I need to hear back from Programmer et all, but Im sure it could be possible.

anna mated · July 7, 2003 4:32PM

OK, ive done some web research,

You only have to read fully the introduction, to see what I am getting at....

http://www.multires.caltech.edu/pubs/GPUSim.pdf

And read the conclusion at the end, a 500MHZ GPU performed about 80% better than a 3.0GHZ P4 using SSE2 on the floating point tests they performed. I cant see anywhere whether these were 64bit calculations.

programmer · July 7, 2003 4:43PM

The 64-bit float requirement probably refers to 4 x 16-bit float SIMD. The ATI and nVidia also support 4 x 32-bit SIMD. They do not support double precision floating point.

The GPUs have a lot more computational power than you are crediting them with. They have ~4 vertex shader pipelines which (very) roughly correspond to a PowerPC AltiVec unit each, plus they have clipping hardware, fragment hardware, pixel shaders, frame buffer combiners, Z-buffer comparators, etc. They also have a really smart streaming memory system that understands how data is moving through the system.

As I said before, nVidia is moving down the road of having a general purpose massively parallel floating point engine which they set up as a graphics pipeline in their driver. With a different driver it could theoretically do something quite different. ATI's GPU is currently much more hardwired. For graphics it is debatable which is better, but if you're thinking of other uses for this hardware then the nVidia approach would be more useful.

anna mated · July 7, 2003 5:12PM

Quote:

Originally posted by Programmer

The 64-bit float requirement probably refers to 4 x 16-bit float SIMD. The ATI and nVidia also support 4 x 32-bit SIMD. They do not support double precision floating point.

The GPUs have a lot more computational power than you are crediting them with. They have ~4 vertex shader pipelines which (very) roughly correspond to a PowerPC AltiVec unit each, plus they have clipping hardware, fragment hardware, pixel shaders, frame buffer combiners, Z-buffer comparators, etc. They also have a really smart streaming memory system that understands how data is moving through the system.

As I said before, nVidia is moving down the road of having a general purpose massively parallel floating point engine which they set up as a graphics pipeline in their driver. With a different driver it could theoretically do something quite different. ATI's GPU is currently much more hardwired. For graphics it is debatable which is better, but if you're thinking of other uses for this hardware then the nVidia approach would be more useful.

Thank Prog

Its a shame that the computer industry alway lists bullshit for specs, you are indeed correct, 64 bit actually means 4x16 as ive found out from the web

"More precision everywhere ? The watchword for DX9 is precision, as you might have gathered by now. DX9 calls for larger, floating-point datatypes throughout the rendering pipeline, from texture storage to pixel shaders, from the Z-buffers to the frame buffers. 128-bit floating-point color precision is the most complex color mode, but the DX9 spec calls for a range of color formats, including 32, 40, and 64-bit integer modes (with red, green, blue, and alpha channels of 8:8:8:8, 10:10:10:10, and 16:16:16:16), plus 16 and 32-bit floating-point modes."

Still all this 32bit power is just sitting there waiting to be released, Imagine there are 4 altivec units (roughly) in each Gefore FX!!!!!!!!!!!

Powerdoc, hey, looks like we both had a point.

davechen · July 7, 2003 7:50PM

There's a web site devoted to using GPUs for general processing at UNC (my alma mater):

http://wwwx.cs.unc.edu/~harrism/gpgpu/

And there's a session at Siggraph devoted to the topic too:

http://www.siggraph.org/s2003/confer.../papers12.html

One of the talks will be by the CalTech folks. I'm thinking about implementing that paper, or maybe the paper by the people from Munich. One of the projects I'm working on uses a sparse matrix solver, so these papers are directly applicable.

[edited to fix URLs]

whisper · July 7, 2003 8:41PM

Quote:

Originally posted by davechen

There's a web site devoted to using GPUs for general processing at UNC (my alma mater):

http://wwwx.cs.unc.edu/~harrism/gpgpu/index.shtm

And there's a session at Siggraph devoted to the topic too:

http://www.siggraph.org/s2003/confer...s/papers12.htm

One of the talks will be by the CalTech folks. I'm thinking about implementing that paper, or maybe the paper by the people from Munich. One of the projects I'm working on uses a sparse matrix solver, so these papers are directly applicable.

The first link should end in "shtml", and the second in "html".

davechen · July 7, 2003 10:00PM

Quote:

Originally posted by Whisper

The first link should end in "shtml", and the second in "html".

oops. i suck.

powerdoc · July 8, 2003 12:34AM

Quote:

Originally posted by Anna Mated

OK, ive done some web research,

You only have to read fully the introduction, to see what I am getting at....

http://www.multires.caltech.edu/pubs/GPUSim.pdf

And read the conclusion at the end, a 500MHZ GPU performed about 80% better than a 3.0GHZ P4 using SSE2 on the floating point tests they performed. I cant see anywhere whether these were 64bit calculations.

Interesting link, however SSE sucks compared to altivec. I wonder how it compare to a dual G5. Some DNA sequencing are 5 time faster than a P4 under a G5.

Sure i underestimate the power of a GPU card, but we should never underestimate the power of altivec.

GPU's and Maths-coprocessing

Comments