Machine Learning (ML) is being ever more employed by business in an increasingly widening collection of commercial applications. There are five emerging giants of ML today: Netflix, Google, Twitter, Facebook and Uber. Netflix recently hosted a Machine Learning Platform meetup between them at their HQ in Los Gatos. Faisal Siddiqi, Engineering Manager, Personalization and Data Infrastructure, posted a blog detailing the talks from top practitioners at each of these companies, summarized here.
Netflix – VectorFlow, a Neural Network Library for Sparse Data
ML has been used for several years on Netflix’s Recommendations and Personalization challenges. More recently, ML has expanded into a range of newer applications, including Efficient Content Delivery, Adaptive Streaming QoE, Content Promotion, Content Price Modeling and Programmatic Marketing. Speeding up and scaling the application of ML is a top priority for the researchers, engineers, and data scientists at Netflix.
At the Platform meetup, Netflix’s Benoit Rostykus, Senior Machine Learning Researcher, presented on VectorFlow, an open-sourced neural network library for handling sparse data and not-so-deep networks. VectorFlow is open-sourced and available on github.
Rostykus argued that while there are applications in ML which are served by increased training data and additional layers (e.g. images in convolutional nets), there are a surprising number of applications on the opposite end of the spectrum as well, such as programmatic advertising, causal inference, real-time bidding and survival regression. VectorFlow was intended for these contexts, something which has informed the decisions behind its design; for instance, the selection of the language, D – which was to address the goal of a single language for adhoc and production contexts — offering the power of C++, combined with the clarity of Python.
VectorFlow speeds things up considerably during training because it optimizes latency instead of throughput. Through avoiding large data transfers between RAM and CPU/GPU memory, and circumventing allocations during forward/back-prop, Rostykus demonstrated that for a training set size of 33M rows with a data sparsity of 7%, VectorFlow ran each pass of SGD in 4.2 seconds on a single machine with 16 commodity CPU cores.
The future of VectorFlow includes deeper sparsity support, additional complex nodes, and extra optimizers, without giving up the underlying principle behind VectorFlow that Simple > Complex.
Google – Focused Learning and Factorized Deep Retrieval
Ed Chi, Research Scientist at Google, presented first and discussed various implementation challenges that the TensorFlow team encountered and solved in the internal TFX (TensorFlow Extended) system. They intend to add some of these features to TF Serving open-source.
Chi focused on recommender systems and how to learn user and item representation for sparse data, particularly in relation to Focused Learning and Factorized Deep Retrieval. Chi started by presenting the Tyranny of the Majority in describing the outsized and often negative impact of dense sub-zones on otherwise sparse data sets. As a solution, he discussed Focused Learning in which you choose subsets of data to focus on and then work out which models can learn on each subset with differing data density. As an example, he talked about how using focused and unfocused regularization factors in the cost function allowed TensorFlow to model the problem more successfully; in relation to movie recommendations, he discussed how recommendations must both ensure relevance and reveal fresh content. He showed how Focused Learning led to surprisingly large improvements in model prediction quality for data sets (e.g. documentaries as one of the most sparse categories).
Factorized Deep Retrieval is especially good at predicting co-watch patterns of videos when the number of items to recommend is massive (YouTube). Chi presented TensorFlow’s solution for picking sparse but relevant candidates as distributed implementation of WALS (weighted alternating least squares) Factorization. Online rankings for serving were staged into a tiered approach with a first pass nominator picking a judiciously small set. Subsequent rankers refined the selection further until only highly relevant candidates were shown to the user.
Twitter – Twitter’s Parameter Server for Online Learning
Following Google, Senior Software Engineer & Tech Lead from Twitter, Joe Xie discussed ML in relation to real time use cases on Twitter. He discussed their single/merged online training and prediction pipeline in relation to the parameter server approach that his team takes to scaling training and serving. Xie demonstrated how his team at Twitter tackle one area to scale up at a time by stepping through three stages of their approach to Twitter’s parameter server, beginning with (i) decoupling the training and prediction requests, then (ii) concentrating on scaling up the training traffic and lastly (iii) by scaling up the model size.
By separating training and prediction requests in v1, they were able to increase by tenfold the prediction queries/sec. In V2, they increased the training set size by twenty times, and by v3, they were able to accept models ten times larger in size.
In next steps, they intend to investigate feature sharding and data-parallel optimizations across their v3 distributed architecture.
Uber – Horovod for Distributed TensorFlow
Alex Sergeev, Senior Software Engineer from Uber, introduced the meetup to a new distributed training library which Uber has recently open-sourced — Horovod, intended to make distributed TensorFlow workloads more straightforward.
Alex began by setting out the problem of scaling up model training: with ever growing data sizes and the need for quicker training, Uber decided to explore distributed TensorFlow. His team immediately encountered challenges with the distributed TensorFlow package, particularly around a lack of clarity with which knobs to use, and in relation to GPU utilization when training data at scale.
The ML Uber team explored various approaches to leveraging data-parallelism in the distributed training space. Motivated by earlier investigations at Baidu and Facebook, the team decided to take a new approach to deep network training. They split up the training set into chunks of data, each of which was processed in parallel by executors, computing gradients, then averaged and fed back into the model.
Conventionally in TensorFlow the workers compute the gradients and send them to the Parameter Server(s) for averaging. Horovod, however, employs a data-parallel “ring-allreduce” algorithm that removes the need to have a parameter server. Each of the workers not only compute the gradients but also average them, communicating peer-to-peer via NVDIA’s NCCL2 library.
Facebook – GPU Training at Scale
Aapo Kyrola, Software Engineer (AI) from Facebook shared their experiences working on GPU optimization for large scale training.
Kyrola gave an overview of Caffe2, a lightweight ML framework, stressing how training can be modeled as a directed graph problem. He likened synchronous SGD (stochastic gradient descent) with asynchronous SGD and described how Caffe2 pushed the boundaries on sync SGD with GLOO – an open-sourced fast library which leverages NVIDIA’s NCCL2 for inter-GPU reductions.
Kyrola also gave a case study of Facebook’s recently published landmark achievement of training a ResNet-50 (a 50 layer residual convolutional net) on the ImageNet-1K dataset in under one hour. The “all-reduce” algorithm from GLOO was used a data-parallel approach from Sync SGD. This allowed them to adjust the learning rate piece-wise with an 8K mini-batch size to achieve outstanding error rate metrics.
Kyrola ended by describing how Sync SGD can be pushed on modern GPUs, and how learning rate is a significant hyper-parameter for mini-batch size.
For full details, check out Faisal Siddiqi’s blog.