Snow Leopard's Grand Central, Open CL boost app by 50%

backtomac · September 18, 2009 2:05PM

Quote:

Originally Posted by cjones051073

Don't expect to see any improvement in handbrake due to these technologies.

Handbrake can already efficiently uses all cores on multicore machines since the x264 library it uses supports this. Has done for some time, long before Grand Central Dispatch came along. So nothing to gain there.

Moreover, the x264 devs have already looked into OpenCL/CUDA and (from memory) deduced there is not much they can gain from that. GPUs may well be fast but they have some serious limitations, and in the case of H264 encoding result in them not being ideal (note I said encoding, not decoding...)

Last and not least handbrake is multi-platform. They support linux and windows as well as OSX, so are unlikely to start a widespread rewrite for some new technology only available on one platform.

Chris

Nice post but are you sure HB wouldn't benefit from OCL and GCD?

Isn't video encoding a SIMD (same instruction multiple data) process? That's what OCL is supposed to shine at.

Wouldn't the fact that OCL and now GCD are open sourced encourage their adoption on the other platforms?

brucep · September 18, 2009 2:19PM

Quote:

Originally Posted by solipsism

Yes, at least 7GB is real space that is freed up. Most of the additional is just a difference in reporting from binary to decimal depending on partition size, which is why you see reports of 20GB and more.

6 GB IS SAVED except if you need to use rosatta the you save 3 g or less . The whole snowy purpose was to from the ground uo rebuild the new system

LEAN

MEAN

FAST

64BIT

lep is dead now

snowy is like OS9.2

The future needed a slim powerful snowy to move forward in our/apples brave new world ..

I may wrong here and there .

9

wooster · September 18, 2009 2:31PM

Quote:

Originally Posted by brucep

6 GB IS SAVED except if you need to use rosatta the you save 3 g or less . The whole snowy purpose was to from the ground uo rebuild the new system

LEAN

MEAN

FAST

64BIT

lep is dead now

snowy is like OS9.2

The future needed a slim powerful snowy to move forward in our/apples brave new world ..

I may wrong here and there .

9

Sorry - but Rosetta is only a 2 - 2,5 MB install

cjones051073 · September 18, 2009 2:52PM

Quote:

Originally Posted by backtomac

Nice post but are you sure HB wouldn't benefit from OCL and GCD?

Isn't video encoding a SIMD (same instruction multiple data) process? That's what OCL is supposed to shine at.

Wouldn't the fact that OCL and now GCD are open sourced encourage their adoption on the other platforms?

I cannot be 100% sure, I'm not a Handbrake dev and as such cannot speak for them. However, its a fact that x264 cannot gain from GCD as it already does what CGD is designed to do. The point of GCD is to help developers use multicore processors without having to delve into multi-threaded programming themselves too deeply, which is a PITA. x264 does this itself so has no need for GCD.

As far as OCL goes, I was just quoting from a post I read on the x264 list a while back were someone asked about CUDA. I forget the details but the general idea was they have taken a deep look into it and found it did not help in their particular case.

You are correct about the open source part though, I did not realise this. So the multi-platform part of my original post is not so much of an issue since in theory CGD can be available for linux+windows (that said, I am not aware of any implementations on other platforms as yet, but I guess it is early days...)

cjones051073 · September 18, 2009 3:03PM

Quote:

Originally Posted by backtomac

Nice post but are you sure HB wouldn't benefit from OCL and GCD?

Isn't video encoding a SIMD (same instruction multiple data) process? That's what OCL is supposed to shine at.

Wouldn't the fact that OCL and now GCD are open sourced encourage their adoption on the other platforms?

Quote:

Originally Posted by addabox

True, but I'm curious about what GCD could do for Handbrake when Handbrake is competing for resources with other running processes.

My understanding is that one of the advantages of GCD is that it is system aware in a way that a given application cannot be, no matter how carefully coded for multicore optimization it may be, and allocates resources accordingly.

The main point of CGD is to help developers write multi-threaded applications. Writing good, robust multi-threaded applications is a serious undertaking and as such not one most devs would consider, unless (like x264) their code is a serious CPU user and as such big gains are possible by using all cores.

However, regardless of how a multi-threaded application is written, using GCD or not, at the end of the day the kernel just sees the threads. Its the kernels jobs to balance these threads between the cores, and I image it would do that just as well regardless of how the applications are written.

So, no, I doubt Handbrake/x264 would benefit in anyway from re-implementating its multi-threading to use CGD. At the end of the day, if you are utilising all your cores to 100%, then there is nothing left to gain ....

( Again, I am not a Handbrake or x264 dev, so the above is my personal opinion and in no way am I speaking for either Handbrake or x264. )

backtomac · September 18, 2009 3:06PM

Quote:

Originally Posted by cjones051073

From a post I read on the x264 list a while back were someone asked about CUDA. I forget the details but the general idea was they have taken a deep look into it and found it did not help in their particular case.

Wow, I'm surprised CUDA couldn't improve HB encoding. If that's true, then OCL which is very similar wouldn't help either.

I'm surprised.

cjones051073 · September 18, 2009 3:12PM

Quote:

Originally Posted by backtomac

Wow, I'm surprised CUDA couldn't improve HB encoding. If that's true, then OCL which is very similar wouldn't help either.

I'm surprised.

I'll see if I can dig up the details that went to the x264 list ... It was a while back though...

addabox · September 18, 2009 6:20PM

Quote:

Originally Posted by cjones051073

The main point of CGD is to help developers write multi-threaded applications. Writing good, robust multi-threaded applications is a serious undertaking and as such not one most devs would consider, unless (like x264) their code is a serious CPU user and as such big gains are possible by using all cores.

However, regardless of how a multi-threaded application is written, using GCD or not, at the end of the day the kernel just sees the threads. Its the kernels jobs to balance these threads between the cores, and I image it would do that just as well regardless of how the applications are written.

So, no, I doubt Handbrake/x264 would benefit in anyway from re-implementating its multi-threading to use CGD. At the end of the day, if you are utilising all your cores to 100%, then there is nothing left to gain ....

( Again, I am not a Handbrake or x264 dev, so the above is my personal opinion and in no way am I speaking for either Handbrake or x264. )

But my understanding is that GCD goes beyond thread balancing to dynamically create and close threads depending on system resources. From the Ars review:

Quote:

Let's say a program has a problem that can be split into eight separate, independent units of work. If this program then creates four threads on an eight-core machine, is this an example of creating too many or too few threads? Trick question! The answer is that it depends on what else is happening on the system.

If six of the eight cores are totally saturated doing some other work, then creating four threads will just require the OS to waste time rotating those four threads through the two available cores. But wait, what if the process that was saturating those six cores finishes? Now there are eight available cores but only four threads, leaving half the cores idle.

With the exception of programs that can reasonably expect to have the entire machine to themselves when they run, there's no way for a programmer to know ahead of time exactly how many threads he should create. Of the available cores on a particular machine, how many are in use? If more become available, how will my program know?

The bottom line is that the optimal number of threads to put in flight at any given time is best determined by a single, globally aware entity. In Snow Leopard, that entity is GCD. It will keep zero threads in its pool if there are no queues that have tasks to run. As tasks are dequeued, GCD will create and dole out threads in a way that optimizes the use of the available hardware. GCD knows how many cores the system has, and it knows how many threads are currently executing tasks. When a queue no longer needs a thread, it's returned to the pool where GCD can hand it out to another queue that has a task ready to be dequeued.

I don't claim to be particularly knowledgeable on this topic, but my impression is that this goes beyond "helping programmers write multi-threaded applications."

applebook · September 18, 2009 7:03PM

Quote:

Originally Posted by Tauron

Windows 7 will now require 4 GB of RAM as a minimum and applications running on Windows 7 will use 30% more CPU cycles to keep it from crashing.

I hope that you are joking because Windows 7 rarely uses over 1Gb on my 4Gb machine, and hardly more than 10% of each cores of my quad is under load even when watching 1080p.

nvidia2008 · September 18, 2009 8:58PM

Quote:

Originally Posted by applebook

I hope that you are joking because Windows 7 rarely uses over 1Gb on my 4Gb machine, and hardly more than 10% of each cores of my quad is under load even when watching 1080p.

Yeah I think Tauron was either joking or high when he talked about that. I am so glad I ditched Vista for Windows 7 on my PC.

Windows Explorer (not Internet Explorer) at one point kept crashing but one restart got rid of the problem. Right now I'm only using Azureus, Firefox and playing Wolfenstein but this Windows 7 64bit feels snappier and doesn't feel "bloated and bogged down" like the rubbish that Vista was.

jmmx · September 19, 2009 10:28PM

Quote:

Originally Posted by Cubert

Once again proving that Snow Leopard is more about positioning the Mac platform for the future than trying to drum up massive sales (which is happening anyway). And, I guess, it's also about "encouraging" people to upgrade their hardware, too.

I think you hit the nail on the head here.

Yes - massive sales of SL - but because they have the low price for current Leopard owners - which goes to your point.

My opinion is that Apple lowered the price in order to get widespread adoption of the new OS. Only with widespread adoption can they get developers to rewrite for GCD and OpenCL. Seems like they solved that problem.

Which again goes to prove your point. (IMHO)

jmmx · September 19, 2009 10:59PM

Quote:

Originally Posted by cjones051073

Don't expect to see any improvement in handbrake due to these technologies.

Handbrake can already efficiently uses all cores on multicore machines since the x264 library it uses supports this. Has done for some time, long before Grand Central Dispatch came along. So nothing to gain there.

Moreover, the x264 devs have already looked into OpenCL/CUDA and (from memory) deduced there is not much they can gain from that. GPUs may well be fast but they have some serious limitations, and in the case of H264 encoding result in them not being ideal (note I said encoding, not decoding...)

Chris

I have a degree in Computer Science including some work in development over the last 15 years, but more in SQA and Tech Writing. So I have a bit of expertise in the field - but I am certainly NOT an expert. I have read some of the documentation on GCD and OpenCL.

I do have some thoughts on the comments above by Chris and some other related issues.

First - Regarding Chris's remarks. It seems to me that GPUs already have built-in support for video decoding. This is what they exist to do. Therefore, in many cases, OpenCL is superfluous, the tasks are already optimized. This is the case with probably most graphic routines.

The real purpose of OpenCL is to harness this massively parallel GPU compute power for other tasks - such as math and physics computing. This is where it will have its strength. (I will address this further in my next post.)

GCD

The beauty of GCD is that it (as another poster has mentioned) handles the allocation of thread resources independent of hardware and fully aware of other processes that need the hardware. It thus frees the programmer from writing extremely complex thread management code. Now they only have to declare the threads and pass them off to the OS.

My naive guess is that this would save the programmer about 75% of their work, and 90% of SQA. It changes the task of a rewrite from an absolutely enormous endeavor, into one that may be daunting, but manageable.

If I have made any errors here - please correct me.

jmmx · September 19, 2009 11:13PM

Quote:

Originally Posted by manonthemove

Just so you guys know, a factor of 10-30 in performance is not uncommon for scientists who use CUDA. Given the similarities between OpenCL and CUDA, we should (hopefully) see a lot of improvement in the near future. Here is a link (http://sussi.megahost.dk/~frigaard/) to a standard piece of scientific code G2X (it does N-body calculations) modified to use CUDA by C. FRIGAARD (go to the bottom) which gets a factor of 30 for one subroutine and a factor of 10 overall.

CUDA is a lot like OpenCL - it gives access to the GPU processors to non-graphic applications.

The problems with CUDA (as I understand it) are:

1- Specific to NVIDIA GPUs

2- The code is written for one specific GPU and must be modified or recompiled for a different GPU with a different architecture (number of thread processors). (I am less sure of this requirement.) (If not, then it must at least have conditional code that checks for the architecture.)

OpenCL fixes both these problems.

1- Is an open standard that can be implemented on any type of processor. DSPs for example (digital signal processors) so you can pass of numerical calculations to your sound card.

2- Processing tasks are defined by the application program, and the OS will configure it for the existing architecture at runtime. So this is never a problem.

As an example, you reference the C. Frigaard routines. How would they handle it if you suddenly added a second or third Graphics card to your computer? Would you have to rebuild? Would it be able to handle it at all?

jmmx · September 19, 2009 11:25PM

Footnote:

If you are writing and compiling your own math routines, then editing and recompiling once in a while for a new hardware system is not a big deal. If, however, you are writing a commercial application, then you must somehow anticipate and conditionally code for, all potential system configurations. This makes your application a whole order of magnitude more complex (if not more), and debugging becomes a real nightmare.

(And even if you DO write & compile your own math routines - wouldn't you rather go have a micro-brew than recompile and install that damn program one more time?)

jmmx · September 20, 2009 3:57PM

Not sure if anyone else is reading this thread or not but...

First - As has already been noted, one reason for a 50% speedup in the example given is that this was only CPU usage - if tasks were spawned off to the GPU that would decrease the CPU usage level.

The following link is the first in a tutorial for OpenCL. To see the demo, skip to about 80% point - until you see a pic of a molecule. The author is a scientist and this is his real life program. He is running on a new Mac Pro with Nehalem 2x4-core processors. (Each core has 2 virtual threads so all told it is 16 threads.)

The skinny is this: He runs his program first just standard C, then using 16 CPU threads, then with OpenCL using the nVIDIA GTX285 chip card with 240 cores.

Results:

C 1 thread => ~58 seconds

16 threads => 4.76 seconds

GPU => 0.18 sec

Speed up of OpenCL over Quad-core CPU ~= 22x

OpenCl over non-threaded program ~= 322x

Now a macbook pro, 15" (advanced model) comes with builting GPU AND a NVIDIA GeForce 9600M GT with 32 cores. One would expect that this would run roughly 1/10 the speed of the faster system with the 240 cores, If so - it would still be twice the speed of the Nehalem quad-core system alone. This, however, is speculation.

See:

http://www.macresearch.org/files/opencl/Episode_1.mov

(Remember - skip to 80% point)

Snow Leopard's Grand Central, Open CL boost app by 50%

Comments