When the hyperscalers and cloud builders were smaller and the Arm collective had failed to storm the datacenter and AMD was not yet on its path to resurgence, it was Intel that controlled the cadence of new compute engine introductions into the datacenter.
Given this week, which started off with Intel chief executive officer Pat Gelsinger being ousted and Amazon Web Services hosting its annual re:Invent conference in Las Vegas with 60,000 people attending in person and 400,000 people attending online, it is all too obvious who controls the pace of technology rollouts at the hyperscalers and cloud builders.
They do.
And they also control when they do not roll out new technologies as well because they do not have to have a new thing to sell like other chip designers do. They are not in the business of selling compute engines to ODMs and OEMs, like Intel and AMD and Nvidia are, but rather they create virtualized utilities and sell access to raw capacity directly to customers. It is a much smoother, and easier, business in many ways.
If you sat through the opening keynote late last night with Peter DeSantis, who is senior vice president of utility computing at AWS, and the keynotes today by Matt Garman, chief executive officer at AWS, and Andy Jassy, chief executive officer at parent Amazon, you were probably waiting, as we were, for some announcements about future compute engines such as Graviton5 server CPUs, Inferentia3 AI inference accelerators, or Trainium3 AI training accelerators.
Alas, with the exception with one slide from Garman that showed Trainium3 being etched using 3 nanometer processes (presumably from Taiwan Semiconductor Manufacturing Co), having twice the performance of Trainium2, and delivering 40 percent better performance per watt than Trainium2, there was no talk of future homegrown silicon coming out of AWS.
Garman added that Trainium3 was "coming later next year," which presumably means it will be launched at re:Invent 2025. Back in June, there was some buzz about an AWS executive confirming that Trainium3 will bust through 1,000 watts, which would not surprise us in the slightest. The top bin "Blackwell" B200 GPUs from Nvidia peak out at 1,200 watts.
This is still lower wattage than the hair dryers everyone else in my house uses and that I have not had a need for in over four decades. So we are not freaked out quite yet. But it is also a dozen incandescent bulbs, too, which is a weird thought, particularly if you never waited quite long enough for them to cool to take them out as we often did not.
We were a little surprised that we did not already see a Graviton4E deep bin sort aimed at HPC applications at the SC24 supercomputing conference last month, which would have mirrored what AWS did with the plain vanilla Graviton3 in November 2021 and the boosted Graviton3E in November 2022. Graviton4, arguably one of the best Arm-based server CPUs on the market and certainly the one that is most available for anyone to use, came out in November 2023 and got a memory boost in September of this year.
AWS is under precisely zero pressure to have an annual cadence for its CPUs, AI accelerators, and DPUs, and if you look carefully at the GPU roadmaps from Nvidia and AMD, their core products are still coming out once every two years, with opportunistic memory upgrades or a performance tweak coming in the second year on essentially the same GPU that was announced in the first year.
The cadence at AWS for silicon looks to be two years, with a wiggle here and there. Graviton1 was really a "Nitro" DPU card on steroids and it sort of doesn't count. Graviton1 was "a signal into the market," as DeSantis put it in his keynote when it came out in 2018, testing the idea that customers were finally ready for Arm CPUs in the datacenter. With the Graviton2 in 2019, AWS jumped onto a modern 7 nanometer process from TSMC and used "Ares" N1 cores from Arm Ltd to create a 64-core device that could do useful work and at 40 percent better bang for the buck compared to X86 CPUs from Intel and AMD running on the AWS cloud.
Two years later, Graviton3 came out using the much more powerful "Zeus" V1 core from Arm and could suddenly take on bigger jobs even though it "only" had 64 cores. Two years later, Graviton4 was out and we think it was a shrink to 4 nanometer TSMC processes to cram 96 "Demeter" V2 cores on the socket against a dozen DDR5 memory controllers with 537.6 GB/sec of memory bandwidth. Core for core, Graviton4 offered 30 percent more oomph per core and 50 percent more cores than Graviton3, which is 2X the performance, generally speaking, and according to our pricing analysis here, somewhere between 13 percent and 15 percent better bang for the buck. On real-world benchmarks, Graviton4 sometimes delivered 40 percent more performance
Frankly, AWS has to get two years out of a processor design to recoup what must be a pretty hefty investment. And so it was unreasonable - if not greedy - to expected any news about Graviton5 at this week's re:Invent 2024. Still, DeSantis or Garman or Jassy could have dropped a breadcrumb, just the same.
The top brass at AWS did offer some interesting stats about Graviton in their keynotes. Dave Brown, vice president of compute and networking services at AWS, showed this very interesting chart, and it explains in part why Intel's financials are so awful in recent quarters:
Roughly speaking, about half of the processing underneath four core services at AWS - Redshift Serverless and Aurora databases, Managed Streaming for Kafka, and ElastiCache search - is running on Graviton instances. On the just-passed Prime Day shopping event, Amazon rented over 250,000 Graviton processors to support the operation.
"And recently, we have reached a significant milestone," Brown went on to say. "Over the last two years, more than 50 percent of all the CPU capacity landed in our datacenters was on AWS Graviton. Think about that. That's more Graviton processors than all the other processor types combined."
This is exactly what Microsoft said it wanted to do so many years ago, and it is exactly what we expect. In the long run, X86 is a legacy platform with a legacy price. Just like mainframes and RISC/Unix before it. And RISC-V will perhaps eventually do this to the Arm architecture. (We shall see, but an open source ISA with open source and composable blocks with expert oversight seems to be the path. Look at how Linux conquered operating systems and turned Windows Server into a legacy platform.)
Garman had this to say, giving us a sense of the magnitude of the Graviton server fleet inside of AWS: "Graviton is growing like crazy. Let's put this into context. In 2019, all of AWS was a $35 billion business. Today, there's as much Graviton running in the AWS fleet as all compute in 2019. It's pretty impressive growth."
We would love to know how big the server fleet was in 2019 and where it is today. What can be honestly estimated, we think, is that the Graviton server fleet is growing faster than AWS itself, and probably by a very wide margin. And this has hurt Intel a hell of a lot more than it has hurt AMD, which has had better X86 server CPUs than Intel for years now.
The only reason Garman did talk about Trainium3 is because the need for high performance compute in AI training (and increasingly inference) is growing much faster than anyone can supply compute engines. With Nvidia ramping up its "Blackwell" B100 and B200 GPUs and AMD broadening out its "Antares" MI300 series this coming year, AWS can't look like it is not committed to revving its AI silicon if it wants customers to be comfortable porting their AI workloads to Trainium. Hence the breadcrumb about Trainium3.
That said, we do expect for AWS to say something else about Trainium3 before re:Invent comes around on the guitar next November or December, just because everyone else - Google and Microsoft are the ones that matter - will be making noise about their homegrown AI accelerators in 2025.
Like the Graviton line, we think the Trainium line is on a two year cadence of launches from here on out. These devices are expensive, and AWS has to amortize the cost of Trainium development over the largest possible number of devices to make the financials work out - just as it has had to do with Graviton CPUs. And like Gravitons, we do not think the day is too distant when AWS will have half of its AI training and inference capacity on its homegrown Annapurna Labs chips. Which means trouble for Nvidia and AMD in the long run. Especially if Google, Microsoft, Tencent, Baidu, and Alibaba all do the same thing.
AWS is not silly enough to try to take on Nvidia in the GPU accelerator market, but like Google with the TPU, SambaNova with the RDU, Groq with the GroqChip, and Graphcore with the IPU, the cloud builder absolutely thinks it can build a systolic array to do AI training and inference that is differentiated and that adds value for cloud customers - and presumably will have better margins or at least more control compared to just buying Nvidia GPUs and being done with it.
As we pointed out above, AWS executives didn't say much about Trainium3, but they were very excited about Trainium2 becoming available in Trn2 instances in UltraServer pods.
We detailed the architecture of the Trainium2 and its predecessor Trainium1 as well as its companion Inferentia1 and Inferentia2 accelerators for AI inference back in December 2023 after last year's re:Invent conference. (You can read it here.) This week, AWS talked a little bit more about the architecture of the systems that use the Trainium2 accelerators and also showed off the networking hardware it has built to scale up and scale out its AI clusters based on them.
As we pointed out last year, it looks like Trainium2 has two chiplets interlinked on a single package, probably using a NeuronLink die-to-die interconnect that is based on the fabric interconnect used to connect Trainium1 and Trainium2 chips to each other to share work coherently across their shared HBM memories.
A Trainium2 server has a head node with a pair of host processors (presumably they are Graviton4, but DeSantis did not say) coupled to three Nitro DPUs, like this:
And here is a top view of the compute node, which four Nitros on the front end and two Trainium2s on the back end, with a cableless design to speed up deployment:
Two switch sleds, one host sled, and eight compute sleds make up a Trainium2 server, which uses 2 TB/sec NeuronLink cables to interconnect the sixteen Tranium2 chips into a 2D torus configuration that shares the 96 GB of HBM3 main memory on each device with all of the other devices. Each Trainium2 server has 1.5 TB of HBM3 memory with 46 TB/sec of aggregate memory bandwidth (that is a little less than 3 TB/sec per Trainium2 card.) This node has 20.8 petaflops of performance on dense FP8 data and 83.3 petaflops on sparse FP8 data. (AWS is getting a 4:1 compression ratio on sparse data, compared to 2:1 with Nvidia's "Hopper" and "Blackwell" GPUs and 10:1 with the Cerebras Systems waferscale engines.)
Four of these servers are interconnected to create a Trainium2 UltraServer, which has 6 TB of total HBM3 memory capacity across 64 of the AI accelerators, with an aggregate of 184 TB/sec of memory bandwidth. This pod has 12.8 Tb/sec of Ethernet bandwidth for interconnectivity using EFAv3 adapters. The UltraServer pod has 83.2 petaflops of oomph on dense FP8 data and 332.8 petaflops on sparse FP8 data.
Here is DeSantis showing off the iron behind the Trn2 UltraServer instance:
At the top of the rack, buried behind a lot of wires, are a pair of switches, which comprise the endpoints of the 3.2 Tb/sec EFAv3 Ethernet network that links multiple Tranium2 servers to each other to create the UltraServer pod and to link pods to each other and to the outside world:
Don't think that is all there is to networking. If you want to run large-scale foundation models, you are going to need a lot more than 64 accelerators. And to lash together machines with hundreds of thousands of accelerators that can do a hero training run, AWS has cooked up a network fabric - presumably based on Ethernet - called 10p10u that has the goal of delivering tens of petabits per second of bandwidth with under 10 microseconds of latency across the network.
Here is what a rack of the 10p10u network fabric looks like:
That wiring above in the patch rack gets pretty hairy, so AWS invented a fiber optic trunk cable that has a 16:1 compression in the number of wires to manage because it puts hundreds of fiber optic connections in a single fat pipe. This makes the patch rack simpler, as is shown below:
The patch rack on the right is using the fiber optic trunk cable, and it is a lot cleaner and a lot smaller as well. Fewer connections and wires to manage means fewer mistakes, and this matters when you are trying to build out AI infrastructure fast.
As far as we know, this 10u10p network is not just used exclusively for AI workloads, but AI workloads are clearly driving its adoption. And DeSantis showed how fast it has ramped compared to older - and presumably slower - Ethernet networks created by AWS. Take a gander:
Assuming that this is cumulative link counts, which is all that makes sense to count, the older Euclid network fabric (presumably 100 Gb/sec) has risen gradually over four years to reach nearly 1.5 million ports. The network called One Fabric launched about the same time as the 10u10p network did in the middle of 2022, and we presume that one is using 400 Gb/sec Ethernet while 10u10p is almost certainly based on 800 Gb/sec Ethernet. But those are admittedly guesses. One Fabric has around 1 million links, while 10u10p looks like it has around 3.3 million links.
To wrap it all up and put a bow on it, Garmin says that the Trn2 instances will yield somewhere between 30 percent and 40 percent better bang for the buck compared to GPU based instances on the AWS cloud. Now where have we heard those numbers before? Oh right. . . . Graviton's price/performance advantage over X86 on the AWS cloud.
AWS can make those gaps between outside compute engines and its homegrown ones whatever it wants, of course. And that is probably the right gap to keep if it wants Trainium to be half of its AI training fleet in the not too distant future.
One last thing. As part of the keynotes, both DeSantis and Garman talked about a supercluster code-named Project Ranier that AWS was building so AI model partner Anthropic, which Amazon has pumped $8 billion into so far, has machinery on which to train its next-generation Claude 4 foundation models. Garman said that Project Ranier would have "hundreds of thousands" of Trainium2 chips and would have 5X the performance of the machine that the Claude 3 models were trained upon.