Apple-Nvidia collaboration triples speed of AI model production
Apple's latest machine learning research could make creating models for Apple Intelligence faster, by coming up with a technique to almost triple the rate of generating tokens when using Nvidia GPUs.

Training models for machine learning is a processor-intensive task
One of the problems in creating large language models (LLMs) for tools and apps that offer AI-based functionality, such as Apple Intelligence, is inefficiencies in producing the LLMs in the first place. Training models for machine learning is a resource-intensive and slow process, which is often countered by buying more hardware and taking on increased energy costs.
Earlier in 2024, Apple published and open-sourced Recurrent Drafter, known as ReDrafter, a method of speculative decoding to improve performance in training. It used an RNN (Recurrent Neural Network) draft model combining beam search with dynamic tree attention for predicting and verifying draft tokens from multiple paths.
This sped up LLM token generation by up to 3.5 times per generation step versus typical auto-regressive token generation techniques.
In a post to Apple's Machine Learning Research site, it explained that alongside existing work using Apple Silicon, it didn't stop there. The new report published on Wednesday detailed how the team applied the research in creating ReDrafter to make it production-ready for use with Nvidia GPUs.
Nvidia GPUs are often employed in servers used for LLM generation, but the high-performance hardware often comes at a hefty cost. It's not uncommon for multi-GPU servers to cost in excess of $250,000 apiece for the hardware alone, let alone any required infrastructure or other connected costs.
Apple worked with Nvidia to integrate ReDrafter into the Nvidia TensorRT-LLM inference acceleration framework. Due to ReDrafter using operators that other speculative decoding methods didn't use, Nvidia had to add the extra elements for it to work.
With its integration, ML developers using Nvidia GPUs in their work can now use ReDrafter's accelerated token generation when using TensorRT-LLM for production, not just those using Apple Silicon.
The result, after benchmarking a tens-of-billions parameter production model on Nvidia GPUs, was a 2.7-times speed increase in generated tokens per second for greedy encoding.
The upshot is that the process could be used to minimize latency to users and reduce the amount of hardware required. In short, users could expect faster results from cloud-based queries, and companies could offer more while spending less.
In Nvidia's Technical Blog on the topic, the graphics card producer said the collaboration made TensorRT-LLM "more powerful and more flexible, enabling the LLM community to innovate more sophisticated models and easily deploy them."
The report's release follows after Apple publicly confirmed it was investigating the potential use of Amazon's Trainium2 chip to train models for use in Apple Intelligence features. At the time, it expected to see a 50% improvement in efficiency with pretraining using the chips over existing hardware.
Read on AppleInsider
Comments
Apple hasn’t used Nvidia since. So this is interesting and somewhat surprising.
https://www.zdnet.com/article/ati-on-apple-leak-our-fault/
https://www.pcguide.com/gpu/power-supply-rtx-4080/ 750 Watts system recommendation, and the back house Nvidia stuff it’s even more out there. The current M2 Ultra takes 107 watts for everything, and the M4 is even more powerful and efficient than that, let along what’s coming up with the M5 and M6.
Currently the 4080 is about 3.3 times faster than the M2 Studio Ultra (Blender), it will be interesting to see how close the M4 Studio Ultra gets next year, we know it’ll use a hell of a lot less power to achieve Its performance.
Apple Silicon is definitely powerful enough now, but for market inertia by the time of the M5 it will be indisputable.
My 2019 iMac gets a Geekbench Metal score of about 3x less than the top end M2 Ultra, but it's 4 years older than the M2 Ultra, and it didn't cost £5200 + display. The Pro Vega 48 in my iMac was pretty sluggish compared to the equivalent Nvidia card at the time: getting 11,000 on 3DMark. The Nvidia RTX 2080 at the time was getting nearly double that. So that shows how Apple screwed over Mac users by refusing to use Nvidia. Nvidia also wrote their own Mac drivers, which were updated all the time and much better than the Apple-written ATI drivers. And shows that in the real world, right now, Apple Silicon GPUs are still a long way from reaching dedicated GPUs.
At this current rate, Apple seems to be about 2 years behind on dedicated graphics cards (since the M4 series is mostly compared to the RTX 4000 series), and the A-series chips are about 4 years behind the M series (since the A18 Pro is comparable to the M1). So if the math is correct, a modern high-end gaming pc is comparable to an iPhone in less than 10 years’ time. Although technology seems to be developing faster, and we are nearing the end of the silicon area of development, so who knows how quickly tech will be developed during the switch?