What Apple's three GPU enhancements in A17 Pro and M3 actually do
Apps and games that utilize the Metal API target specific functions of Apple Silicon GPUs, which get even better with significant improvements to parallel processes in M3 and A17 Pro. Here's how it works.
Apple's M3 family benefits from new GPU features
Apple released a developer talk on these new Apple Silicon GPU features detailing exactly what's happening to achieve improved results. The video goes into great technical detail, but provides enough to explain in basic terms.
Developers building apps with the Metal API don't need to make any changes to their apps to see performance improvements with M3 and A17 Pro. These chipsets utilize Dynamic Caching, hardware-accelerated ray tracing, and hardware-accelerated mesh mapping to make the GPU more performant than ever.
Dynamic shader core memory
Dynamic Caching is made possible thanks to a next-generation shader core. When utilizing the latest GPU cores in A17 Pro and M3, these shaders can run in parallel much more efficiently than before, massively improving output performance.
Dotted lines represent wasted register memory
Normally, the GPU is only able to allocate register memory based on the highest bandwidth process within an executed action for the duration of that action. Therefore, if one part of an action requires significantly more register memory than the rest, the action will utilize much more register memory for a given process.
Dynamic Caching allows the GPU to allocate exactly the right amount of register memory for every action it is taking. The previously unavailable register memory is freed, allowing for many more shader tasks to occur in parallel.
Flexible on-chip memory
Previously, on-chip memory would have fixed memory allocation for register, threadgroup, and tile memory with a buffer cache. That meant significant portions of memory went unused if an action utilized more of one type of memory than another.
The entire on-chip memory can be used as cache
With flexible on-chip memory, all of the on-chip memory is a cache that can be utilized for any memory type. So, an action that heavily relies on threadgroup memory can utilize the entire span of the on-chip memory, and even overflow actions into main memory.
The shader core dynamically adjusts on-chip memory occupancy to maximize performance. That means developers can spend less time optimizing occupancy.
Shader core's high-performance ALU pipelines
Apple recommends developers execute FP16 math in their programs, but the high-performance ALUs execute different combinations of integer, FP32, and FP16 in parallel. Instructions are executed across different actions performed in parallel, which means ALU utilization is improved with higher occupancy.
Increased parallel operations with high-performance ALU pipelines
Basically, if different actions contain the same FP32 or FP16 instructions that would be executed at different points in time, the executions can be overlapped to increase parallelism.
Hardware-accelerated graphics pipelines
Hardware-accelerated ray tracing makes the process much faster, taking the vital intersection calculations out of the GPU function. Since there's hardware taking care of a portion of the calculations, it allows more operations to occur in parallel, thus speeding up ray tracing with a hardware component.
Hardware-acceleration takes over from on-chip processes
Hardware-accelerated mesh shading utilizes a similar method. It takes the middle of the geometric calculations pipeline and passes it to a dedicated unit, thus allowing more parallel operations.
These are complex systems that can't be broken down into a few paragraphs. We recommend watching the video to get all the details with one thing in mind -- A17 Pro and M3 focus on computing parallelism to speed up tasks.
The M3 is available in the MacBook Pro and 24-inch iMac. The A17 Pro is available in the iPhone 15 Pro.
Read on AppleInsider
Comments
Star Citizen ffs
There are gaps, but mainly for certain groups of creatives and developers that Apple never (or sometimes barely) had in their camp. It would be nice to see support for nVidia in the Mac Pro for extra 3D rendering and AI work (stuff that is isolated enough it could run without forcing the operating system to adhere to others GPU architectures), but not for realtime graphics since Apple processors are more than capable for that. Apple's chips are good at many ML areas, but I believe nVidia is still needed to round out support. The new M3 chips are probably great for 3D graphics, but regardless of how good they are many 3D artists may be throwing 2-4 high end GPUs at it for raytracing on the PC side (where GPUs 2, 3, and 4 were just used exclusively for raytracing anyway). Apple would also need to tempt these pro users to come over from the PC world where they have been entrenched for a long time and there are no gaps in tooling like there is on the Mac. The tooling gap would be the most difficult challenge for 3D rendering.
There is the ability to use Windows already, but DX12/Vulkan support would need to improve. I can't see 3D graphics pros ever working in a VM. That doesn't make any sense. Bootcamp will never happen and that would defeat the purpose. Apple probably needs to shrink the gap so you might be able to get by with a lower end professional workstation as a secondary PC and shift everything else to the Mac. They would then only get people that already had a preference for the Mac operating system or are seeing productivity or mobility gains in Apple CPUs. Better PC interoperability would probably be nice for that. Apple should *really* bring Universal Control over to Windows to aid these transitions so you can work across both seamlessly.
Improving the Mac image in gaming culture would help a lot. There are many 3D artists and developers in the gaming world that would never consider a Mac except as a build server for mobile games. I think Apple might be working on this, but they have a long way to go. Apple has had some success getting Capcom ports lately. I wouldn't be surprised if they could get Sony to port PlayStation games at the same time they do the ports for Windows since Apple and Sony get along very well. I'm just not sure how Apple expands past that, but if Apple is lucky the few partnerships they have will show there is interest in AAA gaming on mobile and that will bring it to Mac too. Sponsoring major Twitch streamers and providing free hardware for Mac/iOS titles would help too.
I think this will take a long time to play out. Apple's best move might be to get people used to things like 3D sculpting on an iPad Pro and try to sell them on Mac later since the Apple Pencil is far superior to any Windows Wacom, MS Surface, or similar device and much more comfortable to hold. Ideally that would mean getting some level of Mac software compatibility on iPad. Getting Zbrush and Blender on the iPad Pro would be huge and there are rumors that Apple might be working toward this.
https://github.com/philipturner/metal-float64
It's going to be pretty slow compared to dedicated hardware. High-end consumer GPU double precision is around 1TFLOPs:
https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889
Dedicated GPUs for this are over 30-40TFLOPs:
https://www.techpowerup.com/gpu-specs/h100-pcie-96-gb.c4164
https://www.techpowerup.com/gpu-specs/radeon-instinct-mi300.c4019
but the Nvidia one costs over $40k:
https://www.newegg.com/p/1VK-0066-00022
It's probably most cost-effective using cloud services:
https://evp.cloud/pricing
The above Metal float64 support was to help add support for 64-bit atomics for Unreal Engine's Nanite feature:
https://github.com/philipturner/metal-benchmarks#nanite-atomics
"The Apple GPU architecture only supports 32-bit atomics on pointer values, while other architectures support texture atomics or 64-bit atomics. The latter two are required to run the current implementation of Nanite in Unreal Engine 5 (UE5). Nanite is a very novel rendering algorithm that removes the need for static LOD on vertex meshes. Rendering infinitely detailed meshes requires subpixel resolution and rasterizing pixels entirely in software. To implement a software-rasterized depth buffer, UE5 performs 64-bit atomic comparisons. The depth value is the upper 32 bits; the color is the lower 32. This algorithm is an example of a larger trend toward using GPGPU in rendering.
Apple added 64-bit atomics for M2 (Apple8), M3 and A17 (Apple9):
https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf
Consumer GPUs aren't well suited for double precision computing.
https://www.tomshardware.com/news/intel-arc-will-not-support-fp64-hardware
https://www.techpowerup.com/forums/threads/nerfed-fp64-performance-in-consumer-gpu-cards.272732/
These are features they can add into Mac Pro versions of M3 Extreme. These models would be priced near $10k but if they can do 10-20TFLOPs FP64, it will be useful to some people. I doubt the volume of buyers justifies the manufacturing though and is why the Nvidia one is over $40k.