Speculation: Vector Unit in Apple's Intel chip

carniphage · June 14, 2005 5:37AM

Just a question, I am *not* trying to be smug.

Now that modern GPUs have instruction sets which include branching and looping. They support full float arithmetic. Permit arbitrary program sizes and virtual memory...

Precisely what types of real-world applications are still better suited for Altivec/SSE than for GPU execution? And why?

Carni

unixpoet · June 14, 2005 7:51AM

Quote:

Originally posted by gregmightdothat

...

Uninformed self-centered crap snipped.

Quote:

Originally posted by gregmightdothat

The result is that any physics that can be done on a GPU are extraordinarily limited (no collision detections and such).

You dont know how to use google. Collision detection on the GPU

If the rest of your "post" is as informed and reliable as this...

Quote:

Originally posted by gregmightdothat

Carmack's just being egotistical-- it's his design decisions that made it slow on the Mac

This is just hilarious! Wake me up when you code Doom4.

programmer · June 14, 2005 10:06AM

Quote:

Originally posted by UnixPoet

[B]If the "limited model" is good enough for Carmack and HL2 then its good enough for me.

GPUs are very, very fast vector units. Graphics happens to need these kinds of chips but so do other things. Physics is one - indeed there are demos of physics being done on a GPU.

Did I say that GPU's weren't a good thing? No. That doesn't mean they are better for everything, however. CPU based vector units and GPU vector units are quite different in how they operate and can be programmed, leading to different approaches to problems. Just because you can do something on one or the other doesn't mean you should. Both should exist in a modern machine, and applied as appropriate to the problems which need to be solved. The GPU is typically completely consumed doing graphics in a game, leaving everything else to the CPU and its vector units. Without the vector units in the CPU there is much more limited potential for what can be processed on the CPU.

Quote:

quoted from Carniphage:

Now that modern GPUs have instruction sets which include branching and looping. They support full float arithmetic. Permit arbitrary program sizes and virtual memory...

They still retain the vertex and pixel centric programming model. You write programs which operate conceptually on a single vertex or pixel at a time with no access to the others. This is great for allowing the GPU to execute them concurrently, but very limiting for most kinds of algorithms. The method of getting data in and out of the shader programs is also limited to textures and framebuffers, so it can be problematic. The pipelining of vertex to pixel shader output is also perfect for graphics, but less than ideal for non-graphical algorithms.

Quote:

quoted from UnixPoet:

BTW, Carmack had this to say (Taken from Slashdot post ):

Carmack is talking about the realities of marketshare vs. justifiable effort. FWIW, I agree with him completely which is why I don't spend my days optimizing Mac programs.

Quote:

Listening to you people its as if Altivec is God's own gift to the processor world. Its a good implementation but you lot are missing the big picture. And in the big picture Altivec is not important/relevant to 80% of the applications out there.

I don't care about 80% of the programs, so its a pretty good match.

It is also more than "a good implementation". There are fundamental features of AltiVec which are unique to AltiVec and will remain so because of patents held by AIM. It also unifies all the data types in a single large set of registers, unlike SSE/MMX which have small split sets of registers.

Junkyard is exactly right: Apple is going to Intel so that it can use standard unmodified chips. They aren't going to customize these processors as that would defeat much of their motivation for switching. We can hope that SSE4 is better than SSE3, but if Intel continues its incremental improvement strategy then it will fragment the Mac software/hardware market like it has the PC (this is one reason why AltiVec did better on Mac than MMX/SSE/SSE2/SSE3/3DNow! on the PC... what the heck do the PC guys code for?!).

BTW: most of the comments about SSE3 being "better" than AltiVec actually refer to the use of SSE3 as a scalar floating point unit. AltiVec doesn't even try to do this... why? Because PPC has always had an excellent FPU (the 970 actually has 2) while the x86 FPU sucks. SSE3 attempts to correct this by piggybacking on the vector unit's registers. The 970 still hands Intel their hats in heavy FPU code.

cubist · June 14, 2005 10:59AM

Thanks, Programmer, I always enjoy your posts, they are one of the things that make AI a worthwhile place to go.

WRT SSE3 etc., if Apple encapsulates this and uses this in a library, wouldn't that be worthwhile? At least Mac programmers wouldn't have to worry about the older stuff. The 'least-common-denominatorism' wouldn't bite nearly as hard.

carniphage · June 14, 2005 11:48AM

Quote:

Originally posted by Programmer

They still retain the vertex and pixel centric programming model. You write programs which operate conceptually on a single vertex or pixel at a time with no access to the others. This is great for allowing the GPU to execute them concurrently, but very limiting for most kinds of algorithms. The method of getting data in and out of the shader programs is also limited to textures and framebuffers, so it can be problematic. The pipelining of vertex to pixel shader output is also perfect for graphics, but less than ideal for non-graphical algorithms.

That's not strictly true. And it is becoming less true with each generation. Fixed function pipelines are long gone. Textures are just 1D or 2D arrays. It easy to write functions to access neigboring pixels/arrray elements. Its easy to access individual pixels (and its also trivial to access the interpolated values between elements)

In fact the more you look at modern GPU architecture, the more difficult it is to see programming problems that cannot be efficiently written as GPU code. So as I said, I am interested in working out what class of programming task is better suited to Alitvec than it is to GPU.

You've said that GPUs make some tasks more awkward, but you have some actual examples?

It's also worth pointing out NVidia and ATI have produced some great little development tools for GPU programming which allow the use of high level languages, which make writing this stuff relatively easy.

Carni.

gregmightdothat · June 14, 2005 12:43PM

Quote:

Originally posted by UnixPoet

Uninformed self-centered crap snipped.

You dont know how to use google. Collision detection on the GPU

If the rest of your "post" is as informed and reliable as this...

This is just hilarious! Wake me up when you code Doom4.

Carmack blames Doom's Mac performance on a poor gpu/cpu/... combination. Yet on computers with the same GPU and CPU's that are normally fairly head to head, Doom does much worse on the Mac. Carmack can say whatever he wants, by process of elimination the culprit is pretty much Doom that's slow.

The link you provided is fluff. Call me when that's useable.

You snipped the part where I call you on an amusing lapse of logic and you call me self-centered. You snipped where I clearly explained why GPU's won't replace vector units, and call me uninformed. You sir, are a crybaby.

splinemodel · June 14, 2005 1:49PM

Quote:

Originally posted by Carniphage

In fact the more you look at modern GPU architecture, the more difficult it is to see programming problems that cannot be efficiently written as GPU code. So as I said, I am interested in working out what class of programming task is better suited to Alitvec than it is to GPU.

Consider the fact that adding one or more VPUs is a lot cheaper that using the GPU, in terms of component cost, memory utilization, and heat. Let's say I'm performing FFTs on data and I need to do a lot of branching -- and perhaps some logic -- dependent on what the results of the transforms are. A GPU is certainly fine for this, but I can do it cheaper on the CPU/VPU, and take advantage of the fact that passing data between the CPU and VPU can be done very easily and efficiently with pointers. I would also suspect that it's much faster at non-batched load/store operations.

Here's another good one: I want to calculate spline-based derivative solids based on a collection of nurbs patches while maintaining a dynamic, OpenGL preview of the before-during-and-after.

I don't think anyone thinks GPUs are worse than a good VPU: most GPUs todays are basically just very purpose-oriented VPUs. But in my eyes it's hard not to like an on-die VPU that addresses the same memory as the CPU.

carniphage · June 14, 2005 1:58PM

Quote:

Originally posted by Splinemodel

Consider the fact that adding one or more VPUs is a lot cheaper that using the GPU, in terms of component cost, memory utilization, and heat. Let's say I'm performing FFTs on data and I need to do a lot of branching -- and perhaps some logic -- dependent on what the results of the transforms are. A GPU is certainly fine for this, but I can do it cheaper on the CPU/VPU, and take advantage of the fact that passing data between the CPU and VPU can be done very easily and efficiently with pointers. I would also suspect that it's much faster at non-batched load/store operations.

Here's another good one: I want to calculate spline-based derivative solids based on a collection of nurbs patches while maintaining a dynamic, OpenGL preview of the before-during-and-after.

I don't think anyone thinks GPUs are worse than a good VPU: most GPUs todays are basically just very purpose-oriented VPUs. But in my eyes it's hard not to like an on-die VPU that addresses the same memory as the CPU.

Thanks for that!

Do people still use Nurbs then? :-)

Personally I have never used Altivec code. I have only experienced the PS2 vector unit- which was a traumatic experience too terrible to talk about.

Carni.

splinemodel · June 14, 2005 2:36PM

Quote:

Originally posted by Carniphage

Thanks for that!

Do people still use Nurbs then? :-)

Yep. More so than before due to the fact that cpu speed is ramping up a lot faster than DRAM and SRAM density. But the example doesn't really change if I substitute a NURBs mesh with a SubD mesh.

Vector programming is something you have to want to learn, and it's more suited to EE's who have spent a lot of time in Matlab working on algorithms than it is to CS folks who have been working on ADTs, heuristics, and expert systems. Of course, writing for a GPU is really no different, especially if you have a custom application in mind and you're not just working with some API.

skatman · June 14, 2005 6:37PM

Quote:

SSE(1) was a joke.

Really technical argument here.

Quote:

It used the regular FPU.

No. You're wrong. SSE did not share any hardware with x87 FPU. SSE unit contained hardware for 70 additional instructions with ability to handle 128 bits at a time.

Quote:

IIRC, SSE2 fixed this, becoming a more independent unit on the cpu core, but it still doesn't have any instructions that I know of that are meant to operate on points within a single vector.

SSE2 was built as a modern, 128 bit replacement for x87 unit. It was never built for vector analysis.

Quote:

So it requires more overhead, more clocks (supposedly 4x, but who really knows) and much more clever programming to match Altivec's speed.

In what operations? Audio encoding?

Quote:

Then consider that most G4's snd G5's have more than one Altivec core per PPC core.

Pentium 4 has 2 SSE2 units per core running at 2x the main clock. Thus a 3 GHz P4 has each SSE2 running at 6GHz. I doubt G4 or G5 can match that.

Quote:

SSE3 is better than SSE2.

Not better, but more expanded at expense of hardware and energy.

Quote:

Perhaps SSE4 or 5 will match Altivec.

Match for what?

Talking about audio encoding: my P4 2.4GHz rips CDs into MP3s or Oggs or AAC or such at speeds that are limited by sustained speed of my CD-ROM, which starts at about 20X and ends at the outer edge at about 33X.

Quote:

It's certainly possible, and I'd like to see it as a 256bit unit rather than today's 128bit.

Why? What do you need 256bit precision for?

Why not 2048 bit?

Your arguments are hardly technical.

mac the fork · June 14, 2005 7:12PM

Quote:

No. You're wrong. SSE did not share any hardware with x87 FPU. SSE unit contained hardware for 70 additional instructions with ability to handle 128 bits at a time.

I was thinking the same, but I read Wikipedia's article on SSE, which contains the following:

"On the Pentium 3, however, SSE is implemented using the same circuitry as the FPU, meaning that, once again, the CPU cannot issue both FPU and SSE instructions at the same time for pipelining. The separate registers do allow SIMD and scalar floating point operations to be mixed without the performance hit from explicit MMX/floating point mode switching."

programmer · June 14, 2005 10:42PM

Quote:

Originally posted by Carniphage

That's not strictly true. And it is becoming less true with each generation. Fixed function pipelines are long gone. Textures are just 1D or 2D arrays. It easy to write functions to access neigboring pixels/arrray elements. Its easy to access individual pixels (and its also trivial to access the interpolated values between elements)

I wasn't talking about fixed function pipelines at all, just the programmable shader models. As for the limitations of the models, just try passing a value from one pixel/vertex to the next without multi-passing the whole thing. Or try to create/destroy vertices in the shader (best you can do currently is early exit in some cases, which has non-obvious effects downstream)... tesselation is coming but that introduces a whole other can'o'worms.

Quote:

In fact the more you look at modern GPU architecture, the more difficult it is to see programming problems that cannot be efficiently written as GPU code. So as I said, I am interested in working out what class of programming task is better suited to Alitvec than it is to GPU.

Well I suggest you actually try it. And then try it in a situation where you have graphics to do and memory bandwidth is a bottleneck.

Quote:

You've said that GPUs make some tasks more awkward, but you have some actual examples?

Anything where you have to multi-pass the program, or accumulate across vertices or pixels. Some kinds of lookup operations just don't map well to the texture lookups or the input streams. Encryption problems are not well suited to the GPU. There are tons of them.

Quote:

It's also worth pointing out NVidia and ATI have produced some great little development tools for GPU programming which allow the use of high level languages, which make writing this stuff relatively easy.

RenderMonkey, Cg, HLSL, GLSL... yes, I'm familiar with them all. Doesn't change the basic model. And realize that this model is the basic strength of the GPU that allows it to attack the problem in a massively parallel way. Any approach has downsides. There is no such thing as an efficient universal solution. Engineering is about trade-offs.

Quote:

Quoted from Cubist:

WRT SSE3 etc., if Apple encapsulates this and uses this in a library, wouldn't that be worthwhile? At least Mac programmers wouldn't have to worry about the older stuff. The 'least-common-denominatorism' wouldn't bite nearly as hard.

SSE3 isn't worthless, it just isn't as powerful as AltiVec. Apple will convert all of its vector libraries and other system services to use SSE3 where it is a win. Unfortunately with vector units these kinds of pre-compiled libraries aren't nearly as much of a win as specifically coded algorithms. Things like MacSTL by PixelGlow will help, but if you look at his benchmarks you'll see that AltiVec typically stomps SSE (especially in all those cases where SSE can't even be used) and this is borne out in practice when hand-coded for them. I just find it sad that Apple had the best and now they are forced to "downgrade". Unfortunately their target market just isn't receiving the attention from IBM or FreeScale that they need it to, so Intel wins the war.

Speculation: Vector Unit in Apple's Intel chip

Comments