OpenGL on the Mac Pros is multithreaded!
http://www.macrumors.com/pages/2006/...21223856.shtml
Looks like on the Mac Pros a special version of 10.4.7 is installed by default which includes multithreaded OpenGL code that will let multithreaded games and apps get quite a speed boost.
Blizzard has evidently made a beta recompile of WoW that's OpenGL-multithreaded and they report a DOUBLING in framerates.
Sounds cool!
Looks like on the Mac Pros a special version of 10.4.7 is installed by default which includes multithreaded OpenGL code that will let multithreaded games and apps get quite a speed boost.
Blizzard has evidently made a beta recompile of WoW that's OpenGL-multithreaded and they report a DOUBLING in framerates.
Sounds cool!
Comments
Oh man. OpenGL needs this, as DirectX 10 is a farce.
For it being a farce, it sure has a lot of followers (both hardware and software)...
This doesn't really make sense to me. Just about all Macs now have multiple processors, so it seems like multithreaded OpenGL would benefit all of them. Why hide this feature?
Agreed.
if they goe on in this fashion, apple will face a new era of really p*ssed customers ...
grrrrm ... first, a logic pro update enabling 4-way processing on Mac Pros but NOT PowerMac Quads, now these findings of OpenGL being optimized for intel and not PowerPC .... is this your idea of supporting the installed user base, apple??!
if they goe on in this fashion, apple will face a new era of really p*ssed customers ...
Is this a ploy to make the Intel-based Macs look much better than the PPC or just an attempt to put the PPC days long behind them in a hurry? Either way, it's not a good way to treat existing G5 users.
I wasn't at WWDC so I don't know exactly what Apple said, but the chances they multi-threaded OpenGL is pretty damn slim. OpenGL isn't a process that happens on the CPU, OpenGL is an API set that is almost entoirely shuffled off to the GPU. As in there are no magic OpenGL threads to make multi-processor aware, just OpenGL calls sprinkled into your own threads.
What do I think really happened? Apple has done some serious work on making their OpenGL drivers safely re-entrant. What does this do? It allows separate applications threads to call the re-entrant API's without trashing each others state, or having to wait for a serial shot through the code which was the previous solution to avoid the state trashing. This can remove a serious bottleneck and allow independent threads to progress without spending an abhorrent amount of time just waiting for ther other thread to get out of a particular OpenGL call.
I can see how some non-developer could mangle "Now I can safely multi-thread my OpenGL dpendent game code without bottlenecking" into "OMG, OpenGL just got multi-threaded1!11 OMG!!! Same end effect, big difference in how and what that means to taking advantage of it.
Part of this re-entrancy might be allowing some OpenGL instructions to call a proxy in an OS owned thread to handle certian functions that take awhile, like loading textures, but most of the API set isn't amenable to this treatment. Re-entrancy though is just more hard work under any call to get a nice return with no proxy threading. Apple has been attempting to do this type of thing throughout the OS for a few years now.
In a couple years everyone will be doing this, it is a big part of why Cell and the triple core Xboc CPU's are the way they are. I would also bet that the changes will eventually make it back to the multi-processor PPC's by the time 10.5 is out, this is just a by-product of the order the QA has been certified and it make sense to release the big win on the newest box first so it doesn't get trashed for 6 months by the older one.
Six months from now if the entire product lineup is running quad-optimized code the G5's will be faster but still slower than the new quads and everyone will be happy but still have motivation to buy a new box which will make Apple happy too.
http://episteme.arstechnica.com/eve/...r=598007950831
They've been talking about it for some time and near as I can tell this isn't a slight against PPC but maybe it was easier to roll out more quickly on Intel.
This link will also help explain things (even some of the comments are useful)
http://arstechnica.com/staff/fatbits.ars/2006/8/17/5024
As well as:
The Low Level Virtual Machine web site (this is what actually does the heavy lifting).
http://llvm.org/
Finally check this MacWorld story for some good quotes like "For the games that are very graphics bound it could give us some very nice frame rate boosts" - Glenda Adams - Aspyr.
http://www.macworld.com/news/2006/08...read/index.php
Dave
Well, this sure sounds like someone mangled a message. And Placebo has it correct, not Ben.
Well if Glenda Adams from Aspyr can be taken at her word then it would seem that both you and Placebo have it wrong after all. Unless Glenda was somehow misquoted (see my post above) and the article was never corrected. (not out of the realm of possibility)
Dave
As for Glenda, after her last public commenting performance I don't trust a thing she says at face value. She has also shown through the abysmal performance of her shop's ports that they don't really understand what they are doing [they can do the porting mechanics, just not good optimizations] or what benefits can be gained from threading outside of OpenGL.
To give her a sliver of credit in word choice though, she said "graphics bound" not GPU bound and those have very different meanings. Aspyr's serious graphical shortcomings have been in getting the code off the CPU and into the GPU. The Apple changes will help this significantly. This is still an advance that will make a CPU bound graphical app faster, not GPU bound one. [How much does anyone want to bet that Aspyrs awfully performing ports have been something that spurred Apple to make it easier to avoid those same screw-ups. The tools to do good optimizations have always been there for shops that really understand what they are doing, this will sidestep many of those techniques allowing less accomplished teams to generate faster code without learning much more].
Make no bones, this will be a very good thing, it's just not a GPU thing.
LLVM is being used for creating multiply-threaded OGL on the GPU, I thought, and will be out in 10.5, and not before.
This, in 10.4.7, appears to be re-entrant OGL in general, meaning that CPU utilization for multi-threaded OGL apps just got a lot better, but is *independent* of the LLVM technique.
In other words, this is *one* optimization technique, and others are coming with 10.5.
Right?
If so, OGL performance could get a *lot* better, very quickly.
In other words, this is *one* optimization technique, and others are coming with 10.5.
Right?
Correct. The bulk of the changes are coming in 10.5, and not yet available, neither on Mac Pros nor elsewhere.
Okay, wait.
LLVM is being used for creating multiply-threaded OGL on the GPU, I thought, and will be out in 10.5, and not before.
This, in 10.4.7, appears to be re-entrant OGL in general, meaning that CPU utilization for multi-threaded OGL apps just got a lot better, but is *independent* of the LLVM technique.
In other words, this is *one* optimization technique, and others are coming with 10.5.
Right?
If so, OGL performance could get a *lot* better, very quickly.
Well, everyone who actually knows is under NDA so we are left to best guesses, but I really don't think the concept of thread even applies to the GPU.
The GPU is generally a fixed function state machine that has some number of identical parallel pipelines. We get GPU programmability by replacing the fixed vertex processing unit with a programmable one and the fixed fragment unit with a programmable one, but all the pipelines have to use the current program(or the fixed unit) in the same fixed order and place in the pipeline. There is no way to split this up any finer.
The constraint on multi-threading OpenGL commands is mainly driven by the need to maintain a correct order to feed verticies into the GPU and keep those synchronized with the appropriate state change callouts. Once any single vertex enters the GPU we know exactly how it will be processed because all states that affect it are already present in the pipeline. Nothing Apple can do will change this, it is driven by the spec and hardware implementations.
My read on LLVM is that it will vastly optimize vertex & fragment programs before they are sent to the GPU as a state change. It will also go good things for OpenGL code that never needs to hit the GPU. There is a lot of optimization at a very granular level that no scene graph system could ever hope to incorporate because trying to capture all that customization would have the opposite effect and would make it too slow. In the bazillion tight loops that OpenGL uses, a well written JIT (just-in-time) compiler can theoretically make optimizations that are not physically possible at static compile time, without adding any complexity to the original program. [This is also why Hotspot Java and C# are getting so much faster nowadays]. I think this is where a great deal of the performance is coming from.
Re-entrancy is the other big contributor by assinating stupid driver imposed bottlenecks, sure a good OGL coder could plan around those bottlenecks but most outfits don't want to pay someone that good to spend the time required. Just eliminating those re-entrancy bottlenecks will allow an app to separate all non-pipeline related calls (verticies and state changes) into whatever thread you want without clobbering the whole stack. Now you can do expensive stuff such as load textures and V/F programs like any other good app handles I/O, off to the side.
The main draw thread will probably stay as a single thread, but it will be doing less so it can iterate faster. Chopping the actual draw thread up introduces too much opportunity for state thrashing, and because state changes necessitate flushing, too many of them are performance killers. Personally I think there is a lot of cruft that could already come out of the draw thread, but so far run of the mill coders are still too scared of synchronization to see where it doesn't really matter and where it is really needed.
Maybe the multi-threading in the stack refers to running culling and cropping in their own threads, not something specific the program calls or does. This would be pretty safe and is what SGI did for a long time with it's proprietary OpenGL pipeline and middleware products like Performer. That would offload A LOT of cycles from the main draw thread without causing any dependency issues. Worst case it the GPU gets a few extra verticies, but waits less for them. Since the wait was the killer, not the few extra verticies, absolutely perfect synchronization is not required, just pretty damn close. Considering cull/crop is done on the CPU and we usually know within a few degrees where the view frustrum is even in a fast paced environment, this is not too hard. Hell you could pay a few mil for a Reality Monster setup and get this 5 years ago. Why not today on the desktop for a few thousand?
Sweet!
Hiro, you talk the talk wrt OpenGL and SGI. What do you do for a living? Do you have Barcos in your basement?
Just been in the simulation field awhile. The AI side is where my heart is but I have to have at least half a clue on the graphics side since that generally owns the processing budget. Unfortunately I didn't disguise the half-clue well enough so now I have to teach the stuff as I work on my next degree.
This doesn't really make sense to me. Just about all Macs now have multiple processors, so it seems like multithreaded OpenGL would benefit all of them. Why hide this feature?
it will likey be in 10.4.8