Fast training of machine learning (ML) models is critical for research and engineering teams that deliver new products, services, and research breakthroughs that were previously out of reach. Here at Google, recent ML-enabled advances have included more helpful search results and a single ML model that can translate 100 different languages.
The latest results from the industry-standard MLPerf benchmark competition demonstrate that Google has built the world’s fastest ML training supercomputer. Using this supercomputer, as well as our latest Tensor Processing Unit (TPU) chip, Google set performance records in six out of eight MLPerf benchmarks.
We achieved these results with ML model implementations in TensorFlow, JAX, and Lingvo. Four of the eight models were trained from scratch in under 30 seconds. To put that in perspective, consider that in 2015, it took more than three weeks to train one of these models on the most advanced hardware accelerator available. Google’s latest TPU supercomputer can train the same model almost five orders of magnitude faster just five years later.
In this blog post we’ll look at some of the details of the competition, how our submissions achieve such high performance, and what it all means for your model training speed.