What Apple's three GPU enhancements in A17 Pro and M3 actually do

Posted:
in General Discussion

Apps and games that utilize the Metal API target specific functions of Apple Silicon GPUs, which get even better with significant improvements to parallel processes in M3 and A17 Pro. Here's how it works.

Apple's M3 family benefits from new GPU features
Apple's M3 family benefits from new GPU features



Apple released a developer talk on these new Apple Silicon GPU features detailing exactly what's happening to achieve improved results. The video goes into great technical detail, but provides enough to explain in basic terms.

Developers building apps with the Metal API don't need to make any changes to their apps to see performance improvements with M3 and A17 Pro. These chipsets utilize Dynamic Caching, hardware-accelerated ray tracing, and hardware-accelerated mesh mapping to make the GPU more performant than ever.

Dynamic shader core memory



Dynamic Caching is made possible thanks to a next-generation shader core. When utilizing the latest GPU cores in A17 Pro and M3, these shaders can run in parallel much more efficiently than before, massively improving output performance.

Dotted lines represent wasted register memory
Dotted lines represent wasted register memory



Normally, the GPU is only able to allocate register memory based on the highest bandwidth process within an executed action for the duration of that action. Therefore, if one part of an action requires significantly more register memory than the rest, the action will utilize much more register memory for a given process.

Dynamic Caching allows the GPU to allocate exactly the right amount of register memory for every action it is taking. The previously unavailable register memory is freed, allowing for many more shader tasks to occur in parallel.

Flexible on-chip memory



Previously, on-chip memory would have fixed memory allocation for register, threadgroup, and tile memory with a buffer cache. That meant significant portions of memory went unused if an action utilized more of one type of memory than another.

The entire on-chip memory can be used as cache
The entire on-chip memory can be used as cache



With flexible on-chip memory, all of the on-chip memory is a cache that can be utilized for any memory type. So, an action that heavily relies on threadgroup memory can utilize the entire span of the on-chip memory, and even overflow actions into main memory.

The shader core dynamically adjusts on-chip memory occupancy to maximize performance. That means developers can spend less time optimizing occupancy.

Shader core's high-performance ALU pipelines



Apple recommends developers execute FP16 math in their programs, but the high-performance ALUs execute different combinations of integer, FP32, and FP16 in parallel. Instructions are executed across different actions performed in parallel, which means ALU utilization is improved with higher occupancy.

Increased parallel operations with high-performance ALU pipelines
Increased parallel operations with high-performance ALU pipelines



Basically, if different actions contain the same FP32 or FP16 instructions that would be executed at different points in time, the executions can be overlapped to increase parallelism.

Hardware-accelerated graphics pipelines



Hardware-accelerated ray tracing makes the process much faster, taking the vital intersection calculations out of the GPU function. Since there's hardware taking care of a portion of the calculations, it allows more operations to occur in parallel, thus speeding up ray tracing with a hardware component.

Hardware-acceleration takes over from on-chip processes
Hardware-acceleration takes over from on-chip processes



Hardware-accelerated mesh shading utilizes a similar method. It takes the middle of the geometric calculations pipeline and passes it to a dedicated unit, thus allowing more parallel operations.

These are complex systems that can't be broken down into a few paragraphs. We recommend watching the video to get all the details with one thing in mind -- A17 Pro and M3 focus on computing parallelism to speed up tasks.

The M3 is available in the MacBook Pro and 24-inch iMac. The A17 Pro is available in the iPhone 15 Pro.

Read on AppleInsider

mattinozFileMakerFeller

Comments

  • Reply 1 of 11
    blastdoorblastdoor Posts: 3,308member
    Does the GPU support 64 bit floating point?
    9secondkox2
  • Reply 2 of 11
    chasmchasm Posts: 3,308member
    Having watched the video and having had my head duly spun by the technical detail, I just want to say thanks for this summary. Your bottom line is 100 percent correct: being able to a) use the memory more efficiently and b) do more things in parallel adds up to waaaay faster and better graphics performance than perhaps any integrated GPU (cough HEY INTEL cough) has ever done, ever.

    Discrete GPUs will still rule the roost at the end of the day, but Apple has designed all this to meet the needs of Apple buyers, not hardcore all-day-and-night PC gamers. For typical user needs AND many game titles, this will bring a big boost in performance, but eventually Apple is going to have to allow third-party GPU compatibility for the minority of Mac users who actually do seriously need more.
  • Reply 3 of 11
    chasm said:
    Having watched the video and having had my head duly spun by the technical detail, I just want to say thanks for this summary. Your bottom line is 100 percent correct: being able to a) use the memory more efficiently and b) do more things in parallel adds up to waaaay faster and better graphics performance than perhaps any integrated GPU (cough HEY INTEL cough) has ever done, ever.

    Discrete GPUs will still rule the roost at the end of the day, but Apple has designed all this to meet the needs of Apple buyers, not hardcore all-day-and-night PC gamers. For typical user needs AND many game titles, this will bring a big boost in performance, but eventually Apple is going to have to allow third-party GPU compatibility for the minority of Mac users who actually do seriously need more.
    Not necessarily. Discrete GPUs aren’t better because they’re discrete. Indeed m3 max beats quite a few discrete GPUs. 

    The real answer is Apple beefing up their SOC GPU cores to perform on the level of big hitters like nvidia’s rtx 4090. 

    The difficulty is doing so in an efficient way as Nvidia is basically selling thermonuclear reactors with huge power and cooling requirements to push percormance. Meanwhile, apple is pushing rtx 3090esque performance with m3 max - and doing so efficiently - where the entire SOC uses a fraction of the power Nvidia does with only the GPU. 

    Apple will continue to make strides in the GPU arena and will put the pressure on the companies who are freewheeling with power and thermals right now. 

    Apple is already upping the GHz and adding core counts. I can see them significantly adding cores to future m series and even offering an ultra/extreme with a whole new layer of just GPU cores surrounding the SOC with new interconnections. The sky’s the limit with what they will do. But two things seem to be set in stone: 1) apple won’t look to third parties. And 2) apple won’t sacrifice efficiency to push things forward. They’ll design it properly to perform at the architecture level. 

    Although I’d love to see apple design a desktop only chip that is allowed to be a power glutton and just smash the others at their own game. 
    netroxFileMakerFeller
  • Reply 4 of 11
    As great as this is, in order for Apple to be taken seriously by the pro crowd, they need to bring back the Mac Pro's ability to have extra PCI slots that work with whatever people want to put in them, not just what Apple says is ok. They also need to bring back the ability to use Thunderbolt and external GPU's. Adding in some sort of ability to use Windows would be a plus, either with VM support or some new version of Boot Camp. Until THAT happens, I'm staying with my Intel Mac. 
    williamlondon
  • Reply 5 of 11
    sflocalsflocal Posts: 6,096member
    There was a video that saw a while back discussing the strange place the new Mac Pro resides.  The PCIe slot argument was moot.  The devices that those “pros” used to have a need for PCI slots have been replaced with external thunderbolt interfaces.  So the Mac Mini Ultra is basically the new Mac Pro. 
  • Reply 6 of 11
    Apple should just buy star citizen and a couple other major aaa game companies and go for it. Would be huge. I suspect something like that is around the corner. 
  • Reply 7 of 11
    entropysentropys Posts: 4,168member
    • Yep, regardless of how great the performance is, the only way Macs become gaming machines is for Apple to do a Microsoft for x360 and buy a brace of upcoming gaming studios.
    edited November 2023 williamlondon
  • Reply 8 of 11
    Apple should just buy star citizen 
    Wow!  We have a new champion of stupidest "Apple should just..." !

    Star Citizen ffs  :D
    williamlondon
  • Reply 9 of 11
    swat671 said:
    As great as this is, in order for Apple to be taken seriously by the pro crowd, they need to bring back the Mac Pro's ability to have extra PCI slots that work with whatever people want to put in them, not just what Apple says is ok. They also need to bring back the ability to use Thunderbolt and external GPU's. Adding in some sort of ability to use Windows would be a plus, either with VM support or some new version of Boot Camp. Until THAT happens, I'm staying with my Intel Mac. 
    It is more complicated than that. Apple GPUs don't work quite like PC GPUs, so it would be a major challenge to support them for realtime graphics. It is not about just shifting that work to some third party like nVidia. It just can't be done without being beholden to those that make GPU decisions on the Windows side and Apple certainly doesn't want to have PC GPU decisions affect how they design their operating system. That is particularly true now that Apple is on their new GPU architecture that diverges even more from PCs.

    There are gaps, but mainly for certain groups of creatives and developers that Apple never (or sometimes barely) had in their camp. It would be nice to see support for nVidia in the Mac Pro for extra 3D rendering and AI work (stuff that is isolated enough it could run without forcing the operating system to adhere to others GPU architectures), but not for realtime graphics since Apple processors are more than capable for that. Apple's chips are good at many ML areas, but I believe nVidia is still needed to round out support. The new M3 chips are probably great for 3D graphics, but regardless of how good they are many 3D artists may be throwing 2-4 high end GPUs at it for raytracing on the PC side (where GPUs 2, 3, and 4 were just used exclusively for raytracing anyway). Apple would also need to tempt these pro users to come over from the PC world where they have been entrenched for a long time and there are no gaps in tooling like there is on the Mac. The tooling gap would be the most difficult challenge for 3D rendering.

    There is the ability to use Windows already, but DX12/Vulkan support would need to improve. I can't see 3D graphics pros ever working in a VM. That doesn't make any sense. Bootcamp will never happen and that would defeat the purpose. Apple probably needs to shrink the gap so you might be able to get by with a lower end professional workstation as a secondary PC and shift everything else to the Mac. They would then only get people that already had a preference for the Mac operating system or are seeing productivity or mobility gains in Apple CPUs. Better PC interoperability would probably be nice for that. Apple should *really* bring Universal Control over to Windows to aid these transitions so you can work across both seamlessly.

    Improving the Mac image in gaming culture would help a lot. There are many 3D artists and developers in the gaming world that would never consider a Mac except as a build server for mobile games. I think Apple might be working on this, but they have a long way to go. Apple has had some success getting Capcom ports lately. I wouldn't be surprised if they could get Sony to port PlayStation games at the same time they do the ports for Windows since Apple and Sony get along very well. I'm just not sure how Apple expands past that, but if Apple is lucky the few partnerships they have will show there is interest in AAA gaming on mobile and that will bring it to Mac too. Sponsoring major Twitch streamers and providing free hardware for Mac/iOS titles would help too.

    I think this will take a long time to play out. Apple's best move might be to get people used to things like 3D sculpting on an iPad Pro and try to sell them on Mac later since the Apple Pencil is far superior to any Windows Wacom, MS Surface, or similar device and much more comfortable to hold. Ideally that would mean getting some level of Mac software compatibility on iPad. Getting Zbrush and Blender on the iPad Pro would be huge and there are rumors that Apple might be working toward this.

    edited November 2023
  • Reply 10 of 11
    blastdoor said:
    Does the GPU support 64 bit floating point?
    If you are asking about what is sometimes called the "Graphics Memory Interface," the recent Anandtech article says it is 128-bit on the M3, 192-bit on the M3 Pro, and 512-bit on the M3 Max. Remember, the Apple M-Series so far has the graphics and cpu share the main memory,
  • Reply 11 of 11
    MarvinMarvin Posts: 15,327moderator
    blastdoor said:
    Does the GPU support 64 bit floating point?
    Someone made a library to support this here:

    https://github.com/philipturner/metal-float64

    It's going to be pretty slow compared to dedicated hardware. High-end consumer GPU double precision is around 1TFLOPs:

    https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

    Dedicated GPUs for this are over 30-40TFLOPs:

    https://www.techpowerup.com/gpu-specs/h100-pcie-96-gb.c4164
    https://www.techpowerup.com/gpu-specs/radeon-instinct-mi300.c4019

    but the Nvidia one costs over $40k:

    https://www.newegg.com/p/1VK-0066-00022

    It's probably most cost-effective using cloud services:

    https://evp.cloud/pricing

    The above Metal float64 support was to help add support for 64-bit atomics for Unreal Engine's Nanite feature:

    https://github.com/philipturner/metal-benchmarks#nanite-atomics

    "The Apple GPU architecture only supports 32-bit atomics on pointer values, while other architectures support texture atomics or 64-bit atomics. The latter two are required to run the current implementation of Nanite in Unreal Engine 5 (UE5). Nanite is a very novel rendering algorithm that removes the need for static LOD on vertex meshes. Rendering infinitely detailed meshes requires subpixel resolution and rasterizing pixels entirely in software. To implement a software-rasterized depth buffer, UE5 performs 64-bit atomic comparisons. The depth value is the upper 32 bits; the color is the lower 32. This algorithm is an example of a larger trend toward using GPGPU in rendering.

    There was a recent discovery that Nanite can run entirely on 32-bit buffer atomics, at a 2.5x bandwidth/5x latency cost. However, Apple added hardware acceleration to the M2 series of GPUs for Nanite atomics. This includes a single instruction for non-returning UInt64 min or max. It does not include the wider set of atomic instructions typically useful for GPGPU, although such instructions were effectively emulated in the prototypical metal-float64. The A15 and A16, part of the same GPU family as M2, do not support Nanite atomics. Hopefully the A17 will gain support in the next series of chips."

    Apple added 64-bit atomics for M2 (Apple8), M3 and A17 (Apple9):

    https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf

    Consumer GPUs aren't well suited for double precision computing.

    https://www.tomshardware.com/news/intel-arc-will-not-support-fp64-hardware
    https://www.techpowerup.com/forums/threads/nerfed-fp64-performance-in-consumer-gpu-cards.272732/

    These are features they can add into Mac Pro versions of M3 Extreme. These models would be priced near $10k but if they can do 10-20TFLOPs FP64, it will be useful to some people. I doubt the volume of buyers justifies the manufacturing though and is why the Nvidia one is over $40k.
    muthuk_vanalingam
Sign In or Register to comment.