TF-Replicator: Distributed Machine Learning for Researchers
At DeepMind, the Research Platform Team builds infrastructure to empower and accelerate our AI research. Today, we are excited to share how we developed TF-Replicator, a software library that helps researchers deploy their TensorFlow models on GPUs and Cloud TPUs with minimal effort and no previous experience with distributed systems. TF-Replicators programming model has now been open sourced as part of TensorFlows tf.distribute.Strategy. This blog post gives an overview of the ideas and technical challenges underlying TF-Replicator. For a more comprehensive description, please read our arXiv paper.A recurring theme in recent AI breakthroughs – from AlphaFold to BigGAN to AlphaStar – is the need for effortless and reliable scalability. Increasing amounts of computational capacity allow researchers to train ever-larger neural networks with new capabilities. To address this, the Research Platform Team developed TF-Replicator, which allows researchers to target different hardware accelerators for Machine Learning, scale up workloads to many devices, and seamlessly switch between different types of accelerators.Read More
Related Google News:
- Databricks on Google Cloud: an open integrated platform for data, analytics and machine learning February 17, 2021
- Uncovering Unknown Unknowns in Machine Learning February 11, 2021
- Can machine learning make you a better athlete? February 4, 2021
- Machine Learning for Computer Architecture February 4, 2021
- Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning February 3, 2021
- Learning to Reason Over Tables from Less Data January 29, 2021
- Using machine learning to improve road maintenance January 13, 2021
- The Magic Of Distributed Joins in Cloud Spanner January 7, 2021