Earlier this year, Google announced its Learn2Compress technology as part of its ML Kit. Google’s AI research team developed Learn2Compress to power its automatic model compression service. The innovative Learn2Compress technique enables the reduction of large AI models into small hardware form to enable on-device machine learning, both training and inference. On-device ML enables AI processing directly on mobile and IoT devices at the edge, “with the benefits of data privacy and access everywhere, regardless of connectivity”.
While ML Kit has been available to all mobile developers since it was launched at Google I/O in May, Learn2Compress is currently only available to a small number of developers, but Google says it will soon be “offered more broadly”. There is a sign-up feature for developers who want to explore using Learn2Compress now.
The Bigger Picture: Why is AI Processing Moving to the Edge?
In general, AI including ML, has been performed in the cloud as successful deep learning models typically demand large amounts of compute memory and power to train and run. Over the last six years, the industry has experienced a 300,000x growth in computing requirements, with graphics processing units (GPUs) contributing most of that horsepower.
As computing at the edge rises in popularity (as we are detailing in our current research series) and smaller devices – mobile and IoT – require at least some on-device data processing capability, however, the urgent need for compression techniques for AI processing has arisen. Instead of data being sent back to the cloud as the main site for AI models, a more decentralized architecture is being developed, in which some or all of AI processing, both training and inference, can take place at the edge instead of in a centralized cloud-based, environment.
The Compute Opportunity for AI at the Edge
In Tractica’s recent report, Artificial Intelligence for Edge Devices, the tech-focused market research company estimated that the compute opportunity for AI at the edge will grow to $51.6 billion by 2025. The main drivers for the transition, according to Tractica, include, “the need for data privacy, issues with bandwidth, cost, latency, and security”. As key reasons behind the migration of AI processing to the edge, Tractica also cites the innovative impact of model compression technologies such as Google’s Learn2Compress, along with the rise in federated learning and blockchain-based decentralized AI architectures.
Tractica forecasts that AI edge device shipments worldwide will grow from 161.4 million units in 2018 to 2.6 billion units annually by 2025. In terms of unit volumes, Tractica’s analysis indicates that the leading AI-enabled edge devices will be mobile phones, smart speakers and PCs/tablets in that order, followed by head-mounted displays, automotive, drones, consumer and enterprise robots, and security cameras.
“Privacy, security, cost, latency, and bandwidth all need to be considered when evaluating cloud versus edge processing,” says London-based research director Aditya Kaul. “Depending on the AI application and device category, there are several hardware options for performing AI edge processing. These options include CPUs, GPUs, ASICs, FPGAs, and SoC accelerators.”
Another interesting aspect of Tractica’s research (for another time), also raised in Kaul’s recent blog post, is the intersection between 5G and edge computing and the question, if 5G lives up to its hype, could it make edge computing irrelevant? While Kaul also points to the compute possibilities for AI processing at the edge, he also admits, this “transition becomes a bit blurry with 5G entering the picture”.
Moving AI Inference to the Edge
Edge processing today is largely concentrated on migrating AI inference (which takes place post-training) to the device. AI inference is relatively less compute intensive than training so has more possibility for movement to the edge device. Google’s on-device ML models have been largely focused on the running of inference directly on its devices.
Back to Google: Machine Learning at Google
At the Google I/O ‘18 conference, Brahim Elbouchikhi, ML Product Manager, led a presentation on the ML Kit and Google’s work in the field. He opened by describing the context for Google’s developments in the space and discussing the last few years in machine learning. According to Google Trends, there has been a 50x increase in deep learning over the last eight years. Elbouchikhi reminded the audience that the promise of deep learning has been around since the 1970s, but what has changed is that now there is the computing power to back up that promise and run the models that have been developed for some time; increasingly, that promise is being brought not only to desktop computing, but also to mobiles. As Elbouchikhi writes in an accompanying blog post, “In today’s fast-moving world, people have come to expect mobile apps to be intelligent – adapting to users’ activity or delighting them with surprising smarts. As a result, we think machine learning will become an essential tool in mobile development”.
Google’s ML Kit: an SDK for Developers of All Backgrounds
This is the context behind the release of Google’s ML Kit in Beta, an SDK that “brings Google’s machine learning expertise to mobile developers in a powerful, yet easy-to-use package on Firebase”. The goal is to make machine learning a viable option for all developers regardless of their current machine learning expertise. It does this in several ways, including by offering five production-ready “base” APIs that require the feeding in of data to ML kit to get back an intuitive response. The five APIs address common mobile use cases:
- Text recognition
- Face detection
- Barcode scanning
- Image labeling
- Landmark recognition
The APIs are both offered as on-device and Cloud options. The on-device APIs work without a network connection allowing for quick processing of data while the cloud-based APIs “leverage the power of Google Cloud Platform’s machine learning technology to give a higher level of accuracy”.
The ML Kit also allows more seasoned developers the opportunity to develop their own TensorFlow Lite models and build a custom API. The models are uploaded to the Firebase console and Google then manages hosting and serving them to app users. This means that updates can be made without developers needing to re-publish their apps. It also reduces their app install size as models are kept out of APK/bundles.
Another challenge Google is addressing in terms of data size is the growth in size of apps generally, which can harm app store install rates and potentially cost users more in terms of data overages. Machine learning can further heighten this trend as models can reach 10’s of megabytes in size. Learn2Compress and model compression is a response to this. Developers upload a fll TensorFlow model and training data, and receive in return a compressed TensorFlow Lite model.
Learn2Compress: In Detail
In the words of Sujith Ravi, Senior Staff Research Scientist, Google Expander Team, Learn2Compress “enables custom on-device deep learning models in TensorFlow Lite that run efficiently on mobile devices, without developers having to worry about optimizing for memory and speed”.
How it Works
Learn2Compress takes the learning framework introduced in previous on-device ML works like ProjectionNet and combines it with various cutting-edge techniques for compressing neural network models. The user feeds a large pre-trained TensorFlow model into Learn2Compress, which then performs training and optimization and automatically creates ready-to-use on-device models that are a smaller size, more memory and power-efficient and quicker at inference with “minimal loss in accuracy”.
The various neural network optimization and compression techniques Learn2Compress uses to do this include:
- Pruning shrinks model size by removing weights or operations that are considered least useful for predictions, such as low-scoring weights. This can be particularly effective for on-device models that involve sparse inputs or outputs. They can be reduced up to 2x in size while maintaining 97% of the original prediction quality.
- Quantization techniques are most effective when applied during training and can improve inference speed by minimizing the number of bits used for model weights and activations. Using 8-bit fixed point representation, for instance, instead of floats can increase the speed of the model inference, decrease power and reduce size by 4x.
- Joint training and distillation approaches follow a teacher-student learning strategy — Google uses a larger teacher network (in the case of Learn2Compress, a user-provided TensorFlow model) to train a compact student network (on-device model) with minimal loss in accuracy. The teacher network can be jointly optimized or fixed. It is able to simultaneously train several student models of different sizes. Instead of a single model, Learn2Compress is able to therefore generate multiple on-device models in a single shot, all at different sizes and inference speeds, allowing the developer to select one best suited for the needs of their specific application.
These techniques also help improve the efficiency of the compression process along with its scalability to large-scale datasets.
Learn2Compress in Use
In Ravi’s post, he describes how Google has been able to demonstrate the effectiveness of Learn2Compress by using it “to build compact on-device models of several state-of-the-art deep networks used in image and natural language tasks such as MobileNets, NASNet, Inception, ProjectionNet, among others”. In doing so, the research team has shown that, “for a given task and dataset, we can generate multiple on-device models at different inference speeds and model sizes”.
Examples of these include image classification, Learn2Compress’ current primary function, it is able “to generate small and fast models with good prediction accuracy suited for mobile applications. For example, on ImageNet task, Learn2Compress achieves a model 22x smaller than Inception v3 baseline and 4x smaller than MobileNet v1 baseline with just 4.6-7% drop in accuracy. On CIFAR-10, jointly training multiple Learn2Compress models with shared parameters, takes only 10% more time than training a single Learn2Compress large model, but yields 3 compressed models that are upto 94x smaller in size and upto 27x faster with up to 36x lower cost and good prediction quality (90-95% top-1 accuracy)”.
The research team has also been experimenting with how well Learn2Compress is able to perform on developer-use cases. On Fishbrain, a social platform used by fishing fans, Learn2Compress was used to compress their current image classification cloud model (80MB+ in size and 91.8% top-3 accuracy) to a far smaller on-device model, under 5MB in size, with a similar degree of accuracy. In various cases, Google has found the compressed models can “even slightly outperform the original large model’s accuracy due to better regularization effects”.
Ravi stresses that the company is looking to improve Learn2Compress with “with future advances in ML and deep learning, and extend to more use-cases beyond image classification. We are excited and looking forward to make this available soon through ML Kit’s compression service on the Cloud. We hope this will make it easy for developers to automatically build and optimize their own on-device ML models so that they can focus on building great apps and cool user experiences involving computer vision, natural language and other machine learning applications”.