You guys need to think farther outside the box. I wouldn't expect custom hardware like this to be at all related to conventional CPUs. You're right, there isn't much point if Apple is just going to try to do what nVidia or Intel are doing... so they ought to do something different that maximizes what they get from the other vendors.
See I thought we where thinking outside the box here. I look at it this way if Apple wants to build a parallel vector processor they can do it all themselves or they can borrow an instruction set from somebody else. Why would they want to do that, well because you would still need a viable ALU for program control and other non vector duties. Since Apple has had a long term relation ship with ARM getting a license for part of the instruction set shouldn't be an issue. Apples job then is simply extending the instruction set for vector ops and intra core communications.
From the standpoint of cores ARM. is way ahead of the pack on size and power this it would be easy to place an array of such on a chip. In fact if they take an approach similar to Cell they might be able to supply each core with a considerable amount of RAM. In fact I see on chip RAM as being a key part of the equation especially if they reach megabyte sizes for each core.
The question then becomes one of speed. That is can Apple make it fast enough to justify. I think they can given that certain operations are realized in hardware. Cell kinda highlights why I want to see ARM at the core of these vector units, that simply so that they have a bit of automity. I would want each vector unit to be able to manage it's own memory and run a simple kernel.
It's a hardware limitation but modifying the algorithms allows you to get round it to some extent. Like I say, there's going to be a hefty requirement for understanding parallel computing before we see any benefit. I think this introduces an element of risk into custom hardware. If developers don't take advantage of the special hardware properly then Apple's investment won't pay off.
On the other hand, one of the people in the above forum has a PhD thesis referring to specialized raytracing hardware:
"the authors developed an RPU (Ray Processing Unit) that resembles
GPUs, but with extended functionality and optimized for ray tracing instead of
rasterization. The authors describe the unit as being flexible like CPUs, but containing
the parallelism of a GPU."
"unlike modern GPUs, is their architecture supports
conditional branching, recursion, and a hardware-maintained register stack. This allows
for recursively tracing rays in shaders, which certainly adds flexibility, but is probably
not absolutely necessary. Another interesting feature of the architecture is the inclusion
of the TPU, or Traversal Processing Unit, which works with the SPU to traverse the
scene’s kd-tree. This is quite interesting, because in current graphics applications, spatial
divisions are contained in software; there is no notion of kd-trees, etc in GPUs."
But this is really what Larrabee is for and I reckon Apple will be using it meaning in about a year, they won't need their own custom solution. GPUs should get us by until then. Using them for computing will give a big enough boost to be noticeable and allow the software to be tuned even if it's not the ideal.
Quote:
Originally Posted by wizard69
What would be interesting is to know where the limitation on recursion is. Is it the hardware that OpenCL would target such as a GPU? I know everybody gets starry eyed when thinking about the compute power in a GPU but lets face it the hardware there isn't exactly general purpose.
Yup hardware limitations but then again the limitations are often in place in order to maximize performance vs power consumption, cost etc. People have been doing Parallel computing for ages so there are methods to get round the limitations.
The thing is, no matter if recursion isn't supported, fewer faster cores won't help as much as changing the algorithms because our hardware is moving towards more and more cores instead of faster ones so it's probably best we start to transition software to better match the hardware instead of trying to push out hardware that executes less efficient code faster.
Trouble is, developers were expected to do this with Alti-Vec too and just didn't have or take the time to use it to its full potential so in real-world tasks we saw little benefit. But what if Apple make a new Core component that abstracts commonly heavy computation? This is what rendering engine developers do. They allow you to code functions as you please but when you need to access something common like raytracing, you can call a gathering function and it just gives you the results and behind the scenes, it optimizes the algorithm for you.
Apple could similarly update things like the Quicktime SDK so that plugin developers are abstracted away from the difficulties of parallel programming but simply comply with certain rules and the compiler does the work for them. This way they don't have to worry about bypassing GPU limitations as the core modules will overcome them behind the scenes. All the developer sees is the speed improvement.
That's another thing that they can probably improve too - software compilation itself.
Quote:
Originally Posted by wizard69
That is interesting because I had this view that OpenCL was more of a replacement for OpenGL.
I think it will be for the parts that will work better in a general purpose language. Shader computation for example is one area it will excel in because they are very small chunks of program code that execute when a piece of geometry is sampled. In fact, the mention in the paper of OpenCL having access to geometry and image processing points to this being an important use.
You attach a shader consisting of a few functions to a 3D model and when the rendering engine starts, it checks segments of your scene in buckets to maximize memory use and performance. As it moves per pixel through the output image, it queries the visible geometry and executes the shader code attached to the frontmost visible object. This code performs all sorts of functions and essentially returns the color and opacity of the sampled point.
With general purpose processing, you don't have to rely on fixed function shader processing so the GPU can distribute the processing more effectively to improve the overall performance. GPUs are moving towards this unified design to help GPGPU processing like in the G80:
The display parts of OpenGL will be the same but the shaders that do all the effects processing will be done in a general purpose language. The physics calculations will likely also be moved from the CPU to the GPU.
This will essentially bring it up to par with DirectX as far as visuals go. Tessellation isn't done in shaders though so that might have to be addressed with the language elsewhere.
Quote:
Originally Posted by wizard69
With Intels process technology you likely wouldn't need massive numbers of cores either as you could just run fast.
Fast means more heat and power though and that doesn't fit with Apple hardware. I would see them going for more low power, low speed cores. Apple usually underclock hardware because of this.
Quote:
Originally Posted by wizard69
Lets face it, it will be a long time before Apple ships all of its hardware with the same GPU or even GPU's form the same family.
They only need GPUs that are above a certain grade. I'm not sure the cut-off point for ATI hardware (possibly the X1900 XTX) but all of the Geforce 8 series hardware and above is supported by CUDA - every Nvidia card in use by Apple today is supported. This would imply that given how OpenCL uses CUDA, that all this hardware can be used by OpenCL too and right now.
You should be able to see what CUDA will bring on a MBP by downloading CUDA for OS X here:
There are samples included. Most of them are mathematical sims that aren't very interesting but there are some effects sims that look ok - there are some videos on youtube showing the samples included with the download:
The multiplier is how much of a performance boost they got by using the GPU. The Elemental H264 encoder was 18 times faster. This was likely using a GT280 though. The performance boost will vary based on the spec of the GPU.
but they'll only run under Windows and it says they require desktop models of the 8 series or higher. I think that's probably due to the PhysX requirement. Might just need modified drivers:
but Mac Pro owners who can boot Windows should be able to run them no problem. I would have expected the badaboom encoder to work on a mobile GPU though.
Quote:
Originally Posted by wizard69
Does any one here happen to think that Apple engineers tune in to this thread just to get a laugh at the wild imaginings going on here.
Nah, they come here and steal all our good ideas. That's why the patent drawings have deformed hands. They have to draw them out quickly in case one of us patents the ideas first. They also seem to be backdating them - they've had lots of practice with this - every time a great idea comes up on AI, suspiciously, a patent appears from Apple as if from nowhere dated two years earlier.
See I thought we where thinking outside the box here. I look at it this way if Apple wants to build a parallel vector processor they can do it all themselves or they can borrow an instruction set from somebody else. Why would they want to do that, well because you would still need a viable ALU for program control and other non vector duties.
Non-vector stuff lives on the x86 cores. And the program control components needn't be anything as generalized as an ARM core. In fact they may want to work quite differently than a scalar core does them, so starting from ARM doesn't help and would stick them with an instruction set encoding that they don't want. Sure they could use an existing core and add to it, but I don't think that would be even close to the best alternative and it would leave them competing directly with the x86 cores... which doesn't make a lot of sense. The advantage here is to do something different.
Comments
You guys need to think farther outside the box. I wouldn't expect custom hardware like this to be at all related to conventional CPUs. You're right, there isn't much point if Apple is just going to try to do what nVidia or Intel are doing... so they ought to do something different that maximizes what they get from the other vendors.
See I thought we where thinking outside the box here. I look at it this way if Apple wants to build a parallel vector processor they can do it all themselves or they can borrow an instruction set from somebody else. Why would they want to do that, well because you would still need a viable ALU for program control and other non vector duties. Since Apple has had a long term relation ship with ARM getting a license for part of the instruction set shouldn't be an issue. Apples job then is simply extending the instruction set for vector ops and intra core communications.
From the standpoint of cores ARM. is way ahead of the pack on size and power this it would be easy to place an array of such on a chip. In fact if they take an approach similar to Cell they might be able to supply each core with a considerable amount of RAM. In fact I see on chip RAM as being a key part of the equation especially if they reach megabyte sizes for each core.
The question then becomes one of speed. That is can Apple make it fast enough to justify. I think they can given that certain operations are realized in hardware. Cell kinda highlights why I want to see ARM at the core of these vector units, that simply so that they have a bit of automity. I would want each vector unit to be able to manage it's own memory and run a simple kernel.
Dave
C99 Specification: http://www.open-std.org/JTC1/SC22/WG...docs/n1256.pdf
Selection from 6.5.2.2 Function Calls.
Apple is well aware of the uses for recursion so I'd expect them to address this need.
As Programmer says OpenCL isn't C99, it's just derived from it. CUDA and the AMD SDK don't support recursion either:
http://forums.nvidia.com/lofiversion...hp?t65244.html
http://www.isi.edu/~ddavis/GPU/Cours...rogramming.pdf
It's a hardware limitation but modifying the algorithms allows you to get round it to some extent. Like I say, there's going to be a hefty requirement for understanding parallel computing before we see any benefit. I think this introduces an element of risk into custom hardware. If developers don't take advantage of the special hardware properly then Apple's investment won't pay off.
On the other hand, one of the people in the above forum has a PhD thesis referring to specialized raytracing hardware:
http://www.handsfreeprogramming.com/...ect_report.pdf
"the authors developed an RPU (Ray Processing Unit) that resembles
GPUs, but with extended functionality and optimized for ray tracing instead of
rasterization. The authors describe the unit as being flexible like CPUs, but containing
the parallelism of a GPU."
"unlike modern GPUs, is their architecture supports
conditional branching, recursion, and a hardware-maintained register stack. This allows
for recursively tracing rays in shaders, which certainly adds flexibility, but is probably
not absolutely necessary. Another interesting feature of the architecture is the inclusion
of the TPU, or Traversal Processing Unit, which works with the SPU to traverse the
scene’s kd-tree. This is quite interesting, because in current graphics applications, spatial
divisions are contained in software; there is no notion of kd-trees, etc in GPUs."
But this is really what Larrabee is for and I reckon Apple will be using it meaning in about a year, they won't need their own custom solution. GPUs should get us by until then. Using them for computing will give a big enough boost to be noticeable and allow the software to be tuned even if it's not the ideal.
What would be interesting is to know where the limitation on recursion is. Is it the hardware that OpenCL would target such as a GPU? I know everybody gets starry eyed when thinking about the compute power in a GPU but lets face it the hardware there isn't exactly general purpose.
Yup hardware limitations but then again the limitations are often in place in order to maximize performance vs power consumption, cost etc. People have been doing Parallel computing for ages so there are methods to get round the limitations.
The thing is, no matter if recursion isn't supported, fewer faster cores won't help as much as changing the algorithms because our hardware is moving towards more and more cores instead of faster ones so it's probably best we start to transition software to better match the hardware instead of trying to push out hardware that executes less efficient code faster.
Trouble is, developers were expected to do this with Alti-Vec too and just didn't have or take the time to use it to its full potential so in real-world tasks we saw little benefit. But what if Apple make a new Core component that abstracts commonly heavy computation? This is what rendering engine developers do. They allow you to code functions as you please but when you need to access something common like raytracing, you can call a gathering function and it just gives you the results and behind the scenes, it optimizes the algorithm for you.
Apple could similarly update things like the Quicktime SDK so that plugin developers are abstracted away from the difficulties of parallel programming but simply comply with certain rules and the compiler does the work for them. This way they don't have to worry about bypassing GPU limitations as the core modules will overcome them behind the scenes. All the developer sees is the speed improvement.
That's another thing that they can probably improve too - software compilation itself.
That is interesting because I had this view that OpenCL was more of a replacement for OpenGL.
I think it will be for the parts that will work better in a general purpose language. Shader computation for example is one area it will excel in because they are very small chunks of program code that execute when a piece of geometry is sampled. In fact, the mention in the paper of OpenCL having access to geometry and image processing points to this being an important use.
You attach a shader consisting of a few functions to a 3D model and when the rendering engine starts, it checks segments of your scene in buckets to maximize memory use and performance. As it moves per pixel through the output image, it queries the visible geometry and executes the shader code attached to the frontmost visible object. This code performs all sorts of functions and essentially returns the color and opacity of the sampled point.
With general purpose processing, you don't have to rely on fixed function shader processing so the GPU can distribute the processing more effectively to improve the overall performance. GPUs are moving towards this unified design to help GPGPU processing like in the G80:
http://www.anandtech.com/video/showdoc.aspx?i=2870&p=5
The display parts of OpenGL will be the same but the shaders that do all the effects processing will be done in a general purpose language. The physics calculations will likely also be moved from the CPU to the GPU.
This will essentially bring it up to par with DirectX as far as visuals go. Tessellation isn't done in shaders though so that might have to be addressed with the language elsewhere.
With Intels process technology you likely wouldn't need massive numbers of cores either as you could just run fast.
Fast means more heat and power though and that doesn't fit with Apple hardware. I would see them going for more low power, low speed cores. Apple usually underclock hardware because of this.
Lets face it, it will be a long time before Apple ships all of its hardware with the same GPU or even GPU's form the same family.
They only need GPUs that are above a certain grade. I'm not sure the cut-off point for ATI hardware (possibly the X1900 XTX) but all of the Geforce 8 series hardware and above is supported by CUDA - every Nvidia card in use by Apple today is supported. This would imply that given how OpenCL uses CUDA, that all this hardware can be used by OpenCL too and right now.
You should be able to see what CUDA will bring on a MBP by downloading CUDA for OS X here:
http://www.nvidia.com/object/cuda_get.html
There are samples included. Most of them are mathematical sims that aren't very interesting but there are some effects sims that look ok - there are some videos on youtube showing the samples included with the download:
http://www.youtube.com/watch?v=LhTvMAbEC0Y
http://www.youtube.com/watch?v=9Do2Xav-nNU
There is a list of projects developed using CUDA here:
http://www.nvidia.com/object/cuda_home.html#state=home
The multiplier is how much of a performance boost they got by using the GPU. The Elemental H264 encoder was 18 times faster. This was likely using a GT280 though. The performance boost will vary based on the spec of the GPU.
There are graphical demos here:
http://www.nvidia.co.uk/content/forc...k/download.asp
but they'll only run under Windows and it says they require desktop models of the 8 series or higher. I think that's probably due to the PhysX requirement. Might just need modified drivers:
http://forum.notebookreview.com/showthread.php?t=284579
but Mac Pro owners who can boot Windows should be able to run them no problem. I would have expected the badaboom encoder to work on a mobile GPU though.
Does any one here happen to think that Apple engineers tune in to this thread just to get a laugh at the wild imaginings going on here.
Nah, they come here and steal all our good ideas. That's why the patent drawings have deformed hands. They have to draw them out quickly in case one of us patents the ideas first. They also seem to be backdating them - they've had lots of practice with this - every time a great idea comes up on AI, suspiciously, a patent appears from Apple as if from nowhere dated two years earlier.
See I thought we where thinking outside the box here. I look at it this way if Apple wants to build a parallel vector processor they can do it all themselves or they can borrow an instruction set from somebody else. Why would they want to do that, well because you would still need a viable ALU for program control and other non vector duties.
Non-vector stuff lives on the x86 cores. And the program control components needn't be anything as generalized as an ARM core. In fact they may want to work quite differently than a scalar core does them, so starting from ARM doesn't help and would stick them with an instruction set encoding that they don't want. Sure they could use an existing core and add to it, but I don't think that would be even close to the best alternative and it would leave them competing directly with the x86 cores... which doesn't make a lot of sense. The advantage here is to do something different.