Quote:
Originally Posted by Kickaha
Okay, wait.
LLVM is being used for creating multiply-threaded OGL on the GPU, I thought, and will be out in 10.5, and not before.
This, in 10.4.7, appears to be re-entrant OGL in general, meaning that CPU utilization for multi-threaded OGL apps just got a lot better, but is *independent* of the LLVM technique.
In other words, this is *one* optimization technique, and others are coming with 10.5.
Right?
If so, OGL performance could get a *lot* better, very quickly.
Well, everyone who actually knows is under NDA so we are left to best guesses, but I really don't think the concept of thread even applies to the GPU.
The GPU is generally a fixed function state machine that has some number of identical parallel pipelines. We get GPU programmability by replacing the fixed vertex processing unit with a programmable one and the fixed fragment unit with a programmable one, but all the pipelines have to use the current program(or the fixed unit) in the same fixed order and place in the pipeline. There is no way to split this up any finer.
The constraint on multi-threading OpenGL commands is mainly driven by the need to maintain a correct order to feed verticies into the GPU and keep those synchronized with the appropriate state change callouts. Once any single vertex enters the GPU we know exactly how it will be processed because all states that affect it are already present in the pipeline. Nothing Apple can do will change this, it is driven by the spec and hardware implementations.
My read on LLVM is that it will vastly optimize vertex & fragment programs before they are sent to the GPU as a state change. It will also go good things for OpenGL code that never needs to hit the GPU. There is a lot of optimization at a very granular level that no scene graph system could ever hope to incorporate because trying to capture all that customization would have the opposite effect and would make it too slow. In the bazillion tight loops that OpenGL uses, a well written JIT (just-in-time) compiler can theoretically make optimizations that are not physically possible at static compile time, without adding any complexity to the original program. [This is also why Hotspot Java and C# are getting so much faster nowadays]. I think this is where a great deal of the performance is coming from.
Re-entrancy is the other big contributor by assinating stupid driver imposed bottlenecks, sure a good OGL coder could plan around those bottlenecks but most outfits don't want to pay someone that good to spend the time required. Just eliminating those re-entrancy bottlenecks will allow an app to separate all non-pipeline related calls (verticies and state changes) into whatever thread you want without clobbering the whole stack. Now you can do expensive stuff such as load textures and V/F programs like any other good app handles I/O, off to the side.
The main draw thread will probably stay as a single thread, but it will be doing less so it can iterate faster. Chopping the actual draw thread up introduces too much opportunity for state thrashing, and because state changes necessitate flushing, too many of them are performance killers. Personally I think there is a lot of cruft that could already come out of the draw thread, but so far run of the mill coders are still too scared of synchronization to see where it doesn't really matter and where it is really needed.
Maybe the multi-threading in the stack refers to running culling and cropping in their own threads, not something specific the program calls or does. This would be pretty safe and is what SGI did for a long time with it's proprietary OpenGL pipeline and middleware products like Performer. That would offload A LOT of cycles from the main draw thread without causing any dependency issues. Worst case it the GPU gets a few extra verticies, but waits less for them. Since the wait was the killer, not the few extra verticies, absolutely perfect synchronization is not required, just pretty damn close. Considering cull/crop is done on the CPU and we usually know within a few degrees where the view frustrum is even in a fast paced environment, this is not too hard. Hell you could pay a few mil for a Reality Monster setup and get this 5 years ago. Why not today on the desktop for a few thousand?