Snow Leopard's Grand Central, Open CL boost app by 50%

13»

Comments

  • Reply 41 of 55
    Quote:
    Originally Posted by cjones051073 View Post


    Don't expect to see any improvement in handbrake due to these technologies.



    Handbrake can already efficiently uses all cores on multicore machines since the x264 library it uses supports this. Has done for some time, long before Grand Central Dispatch came along. So nothing to gain there.



    Moreover, the x264 devs have already looked into OpenCL/CUDA and (from memory) deduced there is not much they can gain from that. GPUs may well be fast but they have some serious limitations, and in the case of H264 encoding result in them not being ideal (note I said encoding, not decoding...)



    Last and not least handbrake is multi-platform. They support linux and windows as well as OSX, so are unlikely to start a widespread rewrite for some new technology only available on one platform.



    Chris



    Nice post but are you sure HB wouldn't benefit from OCL and GCD?



    Isn't video encoding a SIMD (same instruction multiple data) process? That's what OCL is supposed to shine at.



    Wouldn't the fact that OCL and now GCD are open sourced encourage their adoption on the other platforms?
  • Reply 42 of 55
    brucepbrucep Posts: 2,823member
    Quote:
    Originally Posted by solipsism View Post


    Yes, at least 7GB is real space that is freed up. Most of the additional is just a difference in reporting from binary to decimal depending on partition size, which is why you see reports of 20GB and more.



    6 GB IS SAVED except if you need to use rosatta the you save 3 g or less . The whole snowy purpose was to from the ground uo rebuild the new system

    LEAN

    MEAN

    FAST



    64BIT



    lep is dead now

    snowy is like OS9.2

    The future needed a slim powerful snowy to move forward in our/apples brave new world ..



    I may wrong here and there .



    9
  • Reply 43 of 55
    Quote:
    Originally Posted by brucep View Post


    6 GB IS SAVED except if you need to use rosatta the you save 3 g or less . The whole snowy purpose was to from the ground uo rebuild the new system

    LEAN

    MEAN

    FAST



    64BIT



    lep is dead now

    snowy is like OS9.2

    The future needed a slim powerful snowy to move forward in our/apples brave new world ..



    I may wrong here and there .



    9



    Sorry - but Rosetta is only a 2 - 2,5 MB install
  • Reply 44 of 55
    Quote:
    Originally Posted by backtomac View Post


    Nice post but are you sure HB wouldn't benefit from OCL and GCD?



    Isn't video encoding a SIMD (same instruction multiple data) process? That's what OCL is supposed to shine at.



    Wouldn't the fact that OCL and now GCD are open sourced encourage their adoption on the other platforms?



    I cannot be 100% sure, I'm not a Handbrake dev and as such cannot speak for them. However, its a fact that x264 cannot gain from GCD as it already does what CGD is designed to do. The point of GCD is to help developers use multicore processors without having to delve into multi-threaded programming themselves too deeply, which is a PITA. x264 does this itself so has no need for GCD.



    As far as OCL goes, I was just quoting from a post I read on the x264 list a while back were someone asked about CUDA. I forget the details but the general idea was they have taken a deep look into it and found it did not help in their particular case.



    You are correct about the open source part though, I did not realise this. So the multi-platform part of my original post is not so much of an issue since in theory CGD can be available for linux+windows (that said, I am not aware of any implementations on other platforms as yet, but I guess it is early days...)
  • Reply 45 of 55
    Quote:
    Originally Posted by backtomac View Post


    Nice post but are you sure HB wouldn't benefit from OCL and GCD?



    Isn't video encoding a SIMD (same instruction multiple data) process? That's what OCL is supposed to shine at.



    Wouldn't the fact that OCL and now GCD are open sourced encourage their adoption on the other platforms?



    Quote:
    Originally Posted by addabox View Post


    True, but I'm curious about what GCD could do for Handbrake when Handbrake is competing for resources with other running processes.



    My understanding is that one of the advantages of GCD is that it is system aware in a way that a given application cannot be, no matter how carefully coded for multicore optimization it may be, and allocates resources accordingly.



    The main point of CGD is to help developers write multi-threaded applications. Writing good, robust multi-threaded applications is a serious undertaking and as such not one most devs would consider, unless (like x264) their code is a serious CPU user and as such big gains are possible by using all cores.



    However, regardless of how a multi-threaded application is written, using GCD or not, at the end of the day the kernel just sees the threads. Its the kernels jobs to balance these threads between the cores, and I image it would do that just as well regardless of how the applications are written.



    So, no, I doubt Handbrake/x264 would benefit in anyway from re-implementating its multi-threading to use CGD. At the end of the day, if you are utilising all your cores to 100%, then there is nothing left to gain ....



    ( Again, I am not a Handbrake or x264 dev, so the above is my personal opinion and in no way am I speaking for either Handbrake or x264. )
  • Reply 46 of 55
    Quote:
    Originally Posted by cjones051073 View Post


    From a post I read on the x264 list a while back were someone asked about CUDA. I forget the details but the general idea was they have taken a deep look into it and found it did not help in their particular case.



    Wow, I'm surprised CUDA couldn't improve HB encoding. If that's true, then OCL which is very similar wouldn't help either.



    I'm surprised.
  • Reply 47 of 55
    Quote:
    Originally Posted by backtomac View Post


    Wow, I'm surprised CUDA couldn't improve HB encoding. If that's true, then OCL which is very similar wouldn't help either.



    I'm surprised.



    I'll see if I can dig up the details that went to the x264 list ... It was a while back though...
  • Reply 48 of 55
    addaboxaddabox Posts: 12,665member
    Quote:
    Originally Posted by cjones051073 View Post


    The main point of CGD is to help developers write multi-threaded applications. Writing good, robust multi-threaded applications is a serious undertaking and as such not one most devs would consider, unless (like x264) their code is a serious CPU user and as such big gains are possible by using all cores.



    However, regardless of how a multi-threaded application is written, using GCD or not, at the end of the day the kernel just sees the threads. Its the kernels jobs to balance these threads between the cores, and I image it would do that just as well regardless of how the applications are written.



    So, no, I doubt Handbrake/x264 would benefit in anyway from re-implementating its multi-threading to use CGD. At the end of the day, if you are utilising all your cores to 100%, then there is nothing left to gain ....



    ( Again, I am not a Handbrake or x264 dev, so the above is my personal opinion and in no way am I speaking for either Handbrake or x264. )



    But my understanding is that GCD goes beyond thread balancing to dynamically create and close threads depending on system resources. From the Ars review:



    Quote:

    Let's say a program has a problem that can be split into eight separate, independent units of work. If this program then creates four threads on an eight-core machine, is this an example of creating too many or too few threads? Trick question! The answer is that it depends on what else is happening on the system.



    If six of the eight cores are totally saturated doing some other work, then creating four threads will just require the OS to waste time rotating those four threads through the two available cores. But wait, what if the process that was saturating those six cores finishes? Now there are eight available cores but only four threads, leaving half the cores idle.



    With the exception of programs that can reasonably expect to have the entire machine to themselves when they run, there's no way for a programmer to know ahead of time exactly how many threads he should create. Of the available cores on a particular machine, how many are in use? If more become available, how will my program know?



    The bottom line is that the optimal number of threads to put in flight at any given time is best determined by a single, globally aware entity. In Snow Leopard, that entity is GCD. It will keep zero threads in its pool if there are no queues that have tasks to run. As tasks are dequeued, GCD will create and dole out threads in a way that optimizes the use of the available hardware. GCD knows how many cores the system has, and it knows how many threads are currently executing tasks. When a queue no longer needs a thread, it's returned to the pool where GCD can hand it out to another queue that has a task ready to be dequeued.



    I don't claim to be particularly knowledgeable on this topic, but my impression is that this goes beyond "helping programmers write multi-threaded applications."
  • Reply 49 of 55
    Quote:
    Originally Posted by Tauron View Post


    Windows 7 will now require 4 GB of RAM as a minimum and applications running on Windows 7 will use 30% more CPU cycles to keep it from crashing.



    I hope that you are joking because Windows 7 rarely uses over 1Gb on my 4Gb machine, and hardly more than 10% of each cores of my quad is under load even when watching 1080p.
  • Reply 50 of 55
    Quote:
    Originally Posted by applebook View Post


    I hope that you are joking because Windows 7 rarely uses over 1Gb on my 4Gb machine, and hardly more than 10% of each cores of my quad is under load even when watching 1080p.



    Yeah I think Tauron was either joking or high when he talked about that. I am so glad I ditched Vista for Windows 7 on my PC.



    Windows Explorer (not Internet Explorer) at one point kept crashing but one restart got rid of the problem. Right now I'm only using Azureus, Firefox and playing Wolfenstein but this Windows 7 64bit feels snappier and doesn't feel "bloated and bogged down" like the rubbish that Vista was.
  • Reply 51 of 55
    jmmxjmmx Posts: 341member
    Quote:
    Originally Posted by Cubert View Post


    Once again proving that Snow Leopard is more about positioning the Mac platform for the future than trying to drum up massive sales (which is happening anyway). And, I guess, it's also about "encouraging" people to upgrade their hardware, too.



    I think you hit the nail on the head here.



    Yes - massive sales of SL - but because they have the low price for current Leopard owners - which goes to your point.



    My opinion is that Apple lowered the price in order to get widespread adoption of the new OS. Only with widespread adoption can they get developers to rewrite for GCD and OpenCL. Seems like they solved that problem.



    Which again goes to prove your point. (IMHO)
  • Reply 52 of 55
    jmmxjmmx Posts: 341member
    Quote:
    Originally Posted by cjones051073 View Post


    Don't expect to see any improvement in handbrake due to these technologies.



    Handbrake can already efficiently uses all cores on multicore machines since the x264 library it uses supports this. Has done for some time, long before Grand Central Dispatch came along. So nothing to gain there.



    Moreover, the x264 devs have already looked into OpenCL/CUDA and (from memory) deduced there is not much they can gain from that. GPUs may well be fast but they have some serious limitations, and in the case of H264 encoding result in them not being ideal (note I said encoding, not decoding...)



    Chris



    I have a degree in Computer Science including some work in development over the last 15 years, but more in SQA and Tech Writing. So I have a bit of expertise in the field - but I am certainly NOT an expert. I have read some of the documentation on GCD and OpenCL.



    I do have some thoughts on the comments above by Chris and some other related issues.



    First - Regarding Chris's remarks. It seems to me that GPUs already have built-in support for video decoding. This is what they exist to do. Therefore, in many cases, OpenCL is superfluous, the tasks are already optimized. This is the case with probably most graphic routines.



    The real purpose of OpenCL is to harness this massively parallel GPU compute power for other tasks - such as math and physics computing. This is where it will have its strength. (I will address this further in my next post.)



    GCD

    The beauty of GCD is that it (as another poster has mentioned) handles the allocation of thread resources independent of hardware and fully aware of other processes that need the hardware. It thus frees the programmer from writing extremely complex thread management code. Now they only have to declare the threads and pass them off to the OS.



    My naive guess is that this would save the programmer about 75% of their work, and 90% of SQA. It changes the task of a rewrite from an absolutely enormous endeavor, into one that may be daunting, but manageable.



    If I have made any errors here - please correct me.
  • Reply 53 of 55
    jmmxjmmx Posts: 341member
    Quote:
    Originally Posted by manonthemove View Post


    Just so you guys know, a factor of 10-30 in performance is not uncommon for scientists who use CUDA. Given the similarities between OpenCL and CUDA, we should (hopefully) see a lot of improvement in the near future. Here is a link (http://sussi.megahost.dk/~frigaard/) to a standard piece of scientific code G2X (it does N-body calculations) modified to use CUDA by C. FRIGAARD (go to the bottom) which gets a factor of 30 for one subroutine and a factor of 10 overall.



    CUDA is a lot like OpenCL - it gives access to the GPU processors to non-graphic applications.



    The problems with CUDA (as I understand it) are:

    1- Specific to NVIDIA GPUs

    2- The code is written for one specific GPU and must be modified or recompiled for a different GPU with a different architecture (number of thread processors). (I am less sure of this requirement.) (If not, then it must at least have conditional code that checks for the architecture.)



    OpenCL fixes both these problems.

    1- Is an open standard that can be implemented on any type of processor. DSPs for example (digital signal processors) so you can pass of numerical calculations to your sound card.



    2- Processing tasks are defined by the application program, and the OS will configure it for the existing architecture at runtime. So this is never a problem.



    As an example, you reference the C. Frigaard routines. How would they handle it if you suddenly added a second or third Graphics card to your computer? Would you have to rebuild? Would it be able to handle it at all?
  • Reply 54 of 55
    jmmxjmmx Posts: 341member
    Footnote:



    If you are writing and compiling your own math routines, then editing and recompiling once in a while for a new hardware system is not a big deal. If, however, you are writing a commercial application, then you must somehow anticipate and conditionally code for, all potential system configurations. This makes your application a whole order of magnitude more complex (if not more), and debugging becomes a real nightmare.



    (And even if you DO write & compile your own math routines - wouldn't you rather go have a micro-brew than recompile and install that damn program one more time?)
  • Reply 55 of 55
    jmmxjmmx Posts: 341member
    Not sure if anyone else is reading this thread or not but...



    First - As has already been noted, one reason for a 50% speedup in the example given is that this was only CPU usage - if tasks were spawned off to the GPU that would decrease the CPU usage level.



    The following link is the first in a tutorial for OpenCL. To see the demo, skip to about 80% point - until you see a pic of a molecule. The author is a scientist and this is his real life program. He is running on a new Mac Pro with Nehalem 2x4-core processors. (Each core has 2 virtual threads so all told it is 16 threads.)



    The skinny is this: He runs his program first just standard C, then using 16 CPU threads, then with OpenCL using the nVIDIA GTX285 chip card with 240 cores.



    Results:

    C 1 thread => ~58 seconds

    16 threads => 4.76 seconds

    GPU => 0.18 sec



    Speed up of OpenCL over Quad-core CPU ~= 22x

    OpenCl over non-threaded program ~= 322x



    Now a macbook pro, 15" (advanced model) comes with builting GPU AND a NVIDIA GeForce 9600M GT with 32 cores. One would expect that this would run roughly 1/10 the speed of the faster system with the 240 cores, If so - it would still be twice the speed of the Nehalem quad-core system alone. This, however, is speculation.



    See:



    http://www.macresearch.org/files/opencl/Episode_1.mov



    (Remember - skip to 80% point)
Sign In or Register to comment.