Categories
Offsites

Lyra: A New Very Low-Bitrate Codec for Speech Compression

Connecting to others online via voice and video calls is something that is increasingly a part of everyday life. The real-time communication frameworks, like WebRTC, that make this possible depend on efficient compression techniques, codecs, to encode (or decode) signals for transmission or storage. A vital part of media applications for decades, codecs allow bandwidth-hungry applications to efficiently transmit data, and have led to an expectation of high-quality communication anywhere at any time.

As such, a continuing challenge in developing codecs, both for video and audio, is to provide increasing quality, using less data, and to minimize latency for real-time communication. Even though video might seem much more bandwidth hungry than audio, modern video codecs can reach lower bitrates than some high-quality speech codecs used today. Combining low-bitrate video and speech codecs can deliver a high-quality video call experience even in low-bandwidth networks. Yet historically, the lower the bitrate for an audio codec, the less intelligible and more robotic the voice signal becomes. Furthermore, while some people have access to a consistent high-quality, high-speed network, this level of connectivity isn’t universal, and even those in well connected areas at times experience poor quality, low bandwidth, and congested network connections.

To solve this problem, we have created Lyra, a high-quality, very low-bitrate speech codec that makes voice communication available even on the slowest networks. To do this, we’ve applied traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data to create a novel method for compressing and transmitting voice signals.

Lyra Overview
The basic architecture of the Lyra codec is quite simple. Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission. The features themselves are log mel spectrograms, a list of numbers representing the speech energy in different frequency bands, which have traditionally been used for their perceptual relevance because they are modeled after human auditory response. On the other end, a generative model uses those features to recreate the speech signal. In this sense, Lyra is very similar to other traditional parametric codecs, such as MELP.

However traditional parametric codecs, which simply extract from speech critical parameters that can then be used to recreate the signal at the receiving end, achieve low bitrates, but often sound robotic and unnatural. These shortcomings have led to the development of a new generation of high-quality audio generative models that have revolutionized the field by being able to not only differentiate between signals, but also generate completely new ones. DeepMind’s WaveNet was the first of these generative models that paved the way for many to come. Additionally, WaveNetEQ, the generative model-based packet-loss-concealment system currently used in Duo, has demonstrated how this technology can be used in real-world scenarios.

A New Approach to Compression with Lyra
Using these models as a baseline, we’ve developed a new model capable of reconstructing speech using minimal amounts of data. Lyra harnesses the power of these new natural-sounding generative models to maintain the low bitrate of parametric codecs while achieving high quality, on par with state-of-the-art waveform codecs used in most streaming and communication platforms today. The drawback of waveform codecs is that they achieve this high quality by compressing and sending over the signal sample-by-sample, which requires a higher bitrate and, in most cases, isn’t necessary to achieve natural sounding speech.

One concern with generative models is their computational complexity. Lyra avoids this issue by using a cheaper recurrent generative model, a WaveRNN variation, that works at a lower rate, but generates in parallel multiple signals in different frequency ranges that it later combines into a single output signal at the desired sample rate. This trick enables Lyra to not only run on cloud servers, but also on-device on mid-range phones in real time (with a processing latency of 90ms, which is in line with other traditional speech codecs). This generative model is then trained on thousands of hours of speech data and optimized, similarly to WaveNet, to accurately recreate the input audio.

Comparison with Existing Codecs
Since the inception of Lyra, our mission has been to provide the best quality audio using a fraction of the bitrate data of alternatives. Currently, the royalty-free open-source codec Opus, is the most widely used codec for WebRTC-based VOIP applications and, with audio at 32kbps, typically obtains transparent speech quality, i.e., indistinguishable from the original. However, while Opus can be used in more bandwidth constrained environments down to 6kbps, it starts to demonstrate degraded audio quality. Other codecs are capable of operating at comparable bitrates to Lyra (Speex, MELP, AMR), but each suffer from increased artifacts and result in a robotic sounding voice.

Lyra is currently designed to operate at 3kbps and listening tests show that Lyra outperforms any other codec at that bitrate and is compared favorably to Opus at 8kbps, thus achieving more than a 60% reduction in bandwidth. Lyra can be used wherever the bandwidth conditions are insufficient for higher-bitrates and existing low-bitrate codecs do not provide adequate quality.

Clean Speech
Original
Opus@6kbps
Lyra@3kbps
Speex@3kbps
Noisy Environment
Original
Opus@6kbps
Lyra@3kbps
Speex@3kbps
Reference Opus@6kbps Lyra@3kbps

Ensuring Fairness
As with any ML based system, the model must be trained to make sure that it works for everyone. We’ve trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners. One of the design goals of Lyra is to ensure universally accessible high-quality audio experiences. Lyra trains on a wide dataset, including speakers in a myriad of languages, to make sure the codec is robust to any situation it might encounter.

Societal Impact and Where We Go From Here
The implications of technologies like Lyra are far reaching, both in the short and long term. With Lyra, billions of users in emerging markets can have access to an efficient low-bitrate codec that allows them to have higher quality audio than ever before. Additionally, Lyra can be used in cloud environments enabling users with various network and device capabilities to chat seamlessly with each other. Pairing Lyra with new video compression technologies, like AV1, will allow video chats to take place, even for users connecting to the internet via a 56kbps dial-in modem.

Duo already uses ML to reduce audio interruptions, and is currently rolling out Lyra to improve audio call quality and reliability on very low bandwidth connections. We will continue to optimize Lyra’s performance and quality to ensure maximum availability of the technology, with investigations into acceleration via GPUs and TPUs. We are also beginning to research how these technologies can lead to a low-bitrate general-purpose audio codec (i.e., music and other non-speech use cases).

Acknowledgements
Thanks to everyone who made Lyra possible including Jan Skoglund, Felicia Lim, Michael Chinen, Bastiaan Kleijn, Tom Denton, Andrew Storus, Yero Yeh (Chrome Media), Henrik Lundin, Niklas Blum, Karl Wiberg (Google Duo), Chenjie Gu, Zach Gleicher, Norman Casagrande, Erich Elsen (DeepMind).

Categories
Offsites

The Technology Behind Cinematic Photos

Looking at photos from the past can help people relive some of their most treasured moments. Last December we launched Cinematic photos, a new feature in Google Photos that aims to recapture the sense of immersion felt the moment a photo was taken, simulating camera motion and parallax by inferring 3D representations in an image. In this post, we take a look at the technology behind this process, and demonstrate how Cinematic photos can turn a single 2D photo from the past into a more immersive 3D animation.

Camera 3D model courtesy of Rick Reitano.

Depth Estimation
Like many recent computational photography features such as Portrait Mode and Augmented Reality (AR), Cinematic photos requires a depth map to provide information about the 3D structure of a scene. Typical techniques for computing depth on a smartphone rely on multi-view stereo, a geometry method to solve for the depth of objects in a scene by simultaneously capturing multiple photos at different viewpoints, where the distances between the cameras is known. In the Pixel phones, the views come from two cameras or dual-pixel sensors.

To enable Cinematic photos on existing pictures that were not taken in multi-view stereo, we trained a convolutional neural network with encoder-decoder architecture to predict a depth map from just a single RGB image. Using only one view, the model learned to estimate depth using monocular cues, such as the relative sizes of objects, linear perspective, defocus blur, etc.

Because monocular depth estimation datasets are typically designed for domains such as AR, robotics, and self-driving, they tend to emphasize street scenes or indoor room scenes instead of features more common in casual photography, like people, pets, and objects, which have different composition and framing. So, we created our own dataset for training the monocular depth model using photos captured on a custom 5-camera rig as well as another dataset of Portrait photos captured on Pixel 4. Both datasets included ground-truth depth from multi-view stereo that is critical for training a model.

Mixing several datasets in this way exposes the model to a larger variety of scenes and camera hardware, improving its predictions on photos in the wild. However, it also introduces new challenges, because the ground-truth depth from different datasets may differ from each other by an unknown scaling factor and shift. Fortunately, the Cinematic photo effect only needs the relative depths of objects in the scene, not the absolute depths. Thus we can combine datasets by using a scale-and-shift-invariant loss during training and then normalize the output of the model at inference.

The Cinematic photo effect is particularly sensitive to the depth map’s accuracy at person boundaries. An error in the depth map can result in jarring artifacts in the final rendered effect. To mitigate this, we apply median filtering to improve the edges, and also infer segmentation masks of any people in the photo using a DeepLab segmentation model trained on the Open Images dataset. The masks are used to pull forward pixels of the depth map that were incorrectly predicted to be in the background.

Camera Trajectory
There can be many degrees of freedom when animating a camera in a 3D scene, and our virtual camera setup is inspired by professional video camera rigs to create cinematic motion. Part of this is identifying the optimal pivot point for the virtual camera’s rotation in order to yield the best results by drawing one’s eye to the subject.

The first step in 3D scene reconstruction is to create a mesh by extruding the RGB image onto the depth map. By doing so, neighboring points in the mesh can have large depth differences. While this is not noticeable from the “face-on” view, the more the virtual camera is moved, the more likely it is to see polygons spanning large changes in depth. In the rendered output video, this will look like the input texture is stretched. The biggest challenge when animating the virtual camera is to find a trajectory that introduces parallax while minimizing these “stretchy” artifacts.

The parts of the mesh with large depth differences become more visible (red visualization) once the camera is away from the “face-on” view. In these areas, the photo appears to be stretched, which we call “stretchy artifacts”.

Because of the wide spectrum in user photos and their corresponding 3D reconstructions, it is not possible to share one trajectory across all animations. Instead, we define a loss function that captures how much of the stretchiness can be seen in the final animation, which allows us to optimize the camera parameters for each unique photo. Rather than counting the total number of pixels identified as artifacts, the loss function triggers more heavily in areas with a greater number of connected artifact pixels, which reflects a viewer’s tendency to more easily notice artifacts in these connected areas.

We utilize padded segmentation masks from a human pose network to divide the image into three different regions: head, body and background. The loss function is normalized inside each region before computing the final loss as a weighted sum of the normalized losses. Ideally the generated output video is free from artifacts but in practice, this is rare. Weighting the regions differently biases the optimization process to pick trajectories that prefer artifacts in the background regions, rather than those artifacts near the image subject.

During the camera trajectory optimization, the goal is to select a path for the camera with the least amount of noticeable artifacts. In these preview images, artifacts in the output are colored red while the green and blue overlay visualizes the different body regions.

Framing the Scene
Generally, the reprojected 3D scene does not neatly fit into a rectangle with portrait orientation, so it was also necessary to frame the output with the correct right aspect ratio while still retaining the key parts of the input image. To accomplish this, we use a deep neural network that predicts per-pixel saliency of the full image. When framing the virtual camera in 3D, the model identifies and captures as many salient regions as possible while ensuring that the rendered mesh fully occupies every output video frame. This sometimes requires the model to shrink the camera’s field of view.

Heatmap of the predicted per-pixel saliency. We want the creation to include as much of the salient regions as possible when framing the virtual camera.

Conclusion
Through Cinematic photos, we implemented a system of algorithms – with each ML model evaluated for fairness – that work together to allow users to relive their memories in a new way, and we are excited about future research and feature improvements. Now that you know how they are created, keep an eye open for automatically created Cinematic photos that may appear in your recent memories within the Google Photos app!

Acknowledgments
Cinematic Photos is the result of a collaboration between Google Research and Google Photos teams. Key contributors also include: Andre Le, Brian Curless, Cassidy Curtis, Ce Liu‎, Chun-po Wang, Daniel Jenstad, David Salesin, Dominik Kaeser, Gina Reynolds, Hao Xu, Huiwen Chang, Huizhong Chen‎, Jamie Aspinall, Janne Kontkanen, Matthew DuVall, Michael Kucera, Michael Milne, Mike Krainin, Mike Liu, Navin Sarma, Orly Liba, Peter Hedman, Rocky Cai‎, Ruirui Jiang‎, Steven Hickson, Tracy Gu, Tyler Zhu, Varun Jampani, Yuan Hao, Zhongli Ding.

Categories
Offsites

Introducing Model Search: An Open Source Platform for Finding Optimal ML Models

The success of a neural network (NN) often depends on how well it can generalize to various tasks. However, designing NNs that can generalize well is challenging because the research community’s understanding of how a neural network generalizes is currently somewhat limited: What does the appropriate neural network look like for a given problem? How deep should it be? Which types of layers should be used? Would LSTMs be enough or would Transformer layers be better? Or maybe a combination of the two? Would ensembling or distillation boost performance? These tricky questions are made even more challenging when considering machine learning (ML) domains where there may exist better intuition and deeper understanding than others.

In recent years, AutoML algorithms have emerged [e.g., 1, 2, 3] to help researchers find the right neural network automatically without the need for manual experimentation. Techniques like neural architecture search (NAS), use algorithms, like reinforcement learning (RL), evolutionary algorithms, and combinatorial search, to build a neural network out of a given search space. With the proper setup, these techniques have demonstrated they are capable of delivering results that are better than the manually designed counterparts. But more often than not, these algorithms are compute heavy, and need thousands of models to train before converging. Moreover, they explore search spaces that are domain specific and incorporate substantial prior human knowledge that does not transfer well across domains. As an example, in image classification, the traditional NAS searches for two good building blocks (convolutional and downsampling blocks), that it arranges following traditional conventions to create the full network.

To overcome these shortcomings and to extend access to AutoML solutions to the broader research community, we are excited to announce the open source release of Model Search, a platform that helps researchers develop the best ML models, efficiently and automatically. Instead of focusing on a specific domain, Model Search is domain agnostic, flexible and is capable of finding the appropriate architecture that best fits a given dataset and problem, while minimizing coding time, effort and compute resources. It is built on Tensorflow, and can run either on a single machine or in a distributed setting.

Overview
The Model Search system consists of multiple trainers, a search algorithm, a transfer learning algorithm and a database to store the various evaluated models. The system runs both training and evaluation experiments for various ML models (different architectures and training techniques) in an adaptive, yet asynchronous, fashion. While each trainer conducts experiments independently, all trainers share the knowledge gained from their experiments. At the beginning of every cycle, the search algorithm looks up all the completed trials and uses beam search to decide what to try next. It then invokes mutation over one of the best architectures found thus far and assigns the resulting model back to a trainer.

Model Search schematic illustrating the distributed search and ensembling. Each trainer runs independently to train and evaluate a given model. The results are shared with the search algorithm, which it stores. The search algorithm then invokes mutation over one of the best architectures and then sends the new model back to a trainer for the next iteration. S is the set of training and validation examples and A are all the candidates used during training and search.

The system builds a neural network model from a set of predefined blocks, each of which represents a known micro-architecture, like LSTM, ResNet or Transformer layers. By using blocks of pre-existing architectural components, Model Search is able to leverage existing best knowledge from NAS research across domains. This approach is also more efficient, because it explores structures, not their more fundamental and detailed components, therefore reducing the scale of the search space.

Neural network micro architecture blocks that work well, e.g., a ResNet Block.

Because the Model Search framework is built on Tensorflow, blocks can implement any function that takes a tensor as an input. For example, imagine that one wants to introduce a new search space built with a selection of micro architectures. The framework will take the newly defined blocks and incorporate them into the search process so that algorithms can build the best possible neural network from the components provided. The blocks provided can even be fully defined neural networks that are already known to work for the problem of interest. In that case, Model Search can be configured to simply act as a powerful ensembling machine.

The search algorithms implemented in Model Search are adaptive, greedy and incremental, which makes them converge faster than RL algorithms. They do however imitate the “explore & exploit” nature of RL algorithms by separating the search for a good candidate (explore step), and boosting accuracy by ensembling good candidates that were discovered (exploit step). The main search algorithm adaptively modifies one of the top k performing experiments (where k can be specified by the user) after applying random changes to the architecture or the training technique (e.g., making the architecture deeper).

An example of an evolution of a network over many experiments. Each color represents a different type of architecture block. The final network is formed via mutations of high performing candidate networks, in this case adding depth.

To further improve efficiency and accuracy, transfer learning is enabled between various internal experiments. Model Search does this in two ways — via knowledge distillation or weight sharing. Knowledge distillation allows one to improve candidates’ accuracies by adding a loss term that matches the high performing models’ predictions in addition to the ground truth. Weight sharing, on the other hand, bootstraps some of the parameters (after applying mutation) in the network from previously trained candidates by copying suitable weights from previously trained models and randomly initializing the remaining ones. This enables faster training, which allows opportunities to discover more (and better) architectures.

Experimental Results
Model Search improves upon production models with minimal iterations. In a recent paper, we demonstrated the capabilities of Model Search in the speech domain by discovering a model for keyword spotting and language identification. Over fewer than 200 iterations, the resulting model slightly improved upon internal state-of-the-art production models designed by experts in accuracy using ~130K fewer trainable parameters (184K compared to 315K parameters).

Model accuracy given iteration in our system compared to the previous production model for keyword spotting, a similar graph can be found for language identification in the linked paper.

We also applied Model Search to find an architecture suitable for image classification on the heavily explored CIFAR-10 imaging dataset. Using a set known convolution blocks, including convolutions, resnet blocks (i.e., two convolutions and a skip connection), NAS-A cells, fully connected layers, etc., we observed that we were able to quickly reach a benchmark accuracy of 91.83 in 209 trials (i.e., exploring only 209 models). In comparison, previous top performers reached the same threshold accuracy in 5807 trials for the NasNet algorithm (RL), and 1160 for PNAS (RL + Progressive).

Conclusion
We hope the Model Search code will provide researchers with a flexible, domain-agnostic framework for ML model discovery. By building upon previous knowledge for a given domain, we believe that this framework is powerful enough to build models with the state-of-the-art performance on well studied problems when provided with a search space composed of standard building blocks.

Acknowledgements
Special thanks to all code contributors to the open sourcing and the paper: Eugen Ehotaj, Scotty Yak, Malaika Handa, James Preiss, Pai Zhu, Aleks Kracun, Prashant Sridhar, Niranjan Subrahmanya, Ignacio Lopez Moreno, Hyun Jin Park, and Patrick Violette.

Categories
Offsites

Mastering Atari with Discrete World Models

Deep reinforcement learning (RL) enables artificial agents to improve their decisions over time. Traditional model-free approaches learn which of the actions are successful in different situations by interacting with the environment through a large amount of trial and error. In contrast, recent advances in deep RL have enabled model-based approaches to learn accurate world models from image inputs and use them for planning. World models can learn from fewer interactions, facilitate generalization from offline data, enable forward-looking exploration, and allow reusing knowledge across multiple tasks.

Despite their intriguing benefits, existing world models (such as SimPLe) have not been accurate enough to compete with the top model-free approaches on the most competitive reinforcement learning benchmarks — to date, the well-established Atari benchmark requires model-free algorithms, such as DQN, IQN, and Rainbow, to reach human-level performance. As a result, many researchers have focused instead on developing task-specific planning methods, such as VPN and MuZero, which learn by predicting sums of expected task rewards. However, these methods are specific to individual tasks and it is unclear how well they would generalize to new tasks or learn from unsupervised datasets. Similar to the recent breakthrough of unsupervised representation learning in computer vision [1, 2], world models aim to learn patterns in the environment that are more general than any particular task to later solve tasks more efficiently.

Today, in collaboration with DeepMind and the University of Toronto, we introduce DreamerV2, the first RL agent based on a world model to achieve human-level performance on the Atari benchmark. It constitutes the second generation of the Dreamer agent that learns behaviors purely within the latent space of a world model trained from pixels. DreamerV2 relies exclusively on general information from the images and accurately predicts future task rewards even when its representations were not influenced by those rewards. Using a single GPU, DreamerV2 outperforms top model-free algorithms with the same compute and sample budget.

Gamer normalized median score across the 55 Atari games after 200 million steps. DreamerV2 substantially outperforms previous world models. Moreover, it exceeds top model-free agents within the same compute and sample budget.
Behaviors learned by DreamerV2 for some of the 55 Atari games. These videos show images from the environment. Video predictions are shown below in the blog post.

An Abstract Model of the World
Just like its predecessor, DreamerV2 learns a world model and uses it to train actor-critic behaviors purely from predicted trajectories. The world model automatically learns to compute compact representations of its images that discover useful concepts, such as object positions, and learns how these concepts change in response to different actions. This lets the agent generate abstractions of its images that ignore irrelevant details and enables massively parallel predictions on a single GPU. During 200 million environment steps, DreamerV2 predicts 468 billion compact states for learning its behavior.

DreamerV2 builds upon the Recurrent State-Space Model (RSSM) that we introduced for PlaNet and was also used for DreamerV1. During training, an encoder turns each image into a stochastic representation that is incorporated into the recurrent state of the world model. Because the representations are stochastic, they do not have access to perfect information about the images and instead extract only what is necessary to make predictions, making the agent robust to unseen images. From each state, a decoder reconstructs the corresponding image to learn general representations. Moreover, a small reward network is trained to rank outcomes during planning. To enable planning without generating images, a predictor learns to guess the stochastic representations without access to the images from which they were computed.

Learning process of the world model used by DreamerV2. The world model maintains recurrent states (h1–h3) that receive actions (a1–a2) and incorporate information about the images (x1–x3) via stochastic representations (z1–z3). A predictor guesses the representations as (ẑ1–ẑ3) without access to the images from which they were generated.

Importantly, DreamerV2 introduces two new techniques to RSSM that lead to a substantially more accurate world model for learning successful policies. The first technique is to represent each image with multiple categorical variables instead of the Gaussian variables used by PlaNet, DreamerV1, and many more world models in the literature [1, 2, 3, 4, 5]. This leads the world model to reason about the world in terms of discrete concepts and enables more accurate predictions of future representations.

The encoder turns each image into 32 distributions over 32 classes each, the meanings of which are determined automatically as the world model learns. The one-hot vectors sampled from these distributions are concatenated to a sparse representation that is passed on to the recurrent state. To backpropagate through the samples, we use straight-through gradients that are easy to implement using automatic differentiation. Representing images with categorical variables allows the predictor to accurately learn the distribution over the one-hot vectors of the possible next images. In contrast, earlier world models that use Gaussian predictors cannot accurately match the distribution over multiple Gaussian representations for the possible next images.

Multiple categoricals that represent possible next images can be accurately predicted by a categorical predictor, whereas a Gaussian predictor is not flexible enough to accurately predict multiple possible Gaussian representations.

The second new technique of DreamerV2 is KL balancing. Many previous world models use the ELBO objective that encourages accurate reconstructions while keeping the stochastic representations (posteriors) close to their predictions (priors) to regularize the amount of information extracted from each image and facilitate generalization. Because the objective is optimized end-to-end, the stochastic representations and their predictions can be made more similar by bringing either of the two towards the other. However, bringing the representations towards their predictions can be problematic when the predictor is not yet accurate. KL balancing lets the predictions move faster toward the representations than vice versa. This results in more accurate predictions, a key to successful planning.

Long-term video predictions of the world model for holdout sequences. Each model receives 5 frames as input (not shown) and then predicts 45 steps forward given only actions. The video predictions are only used to gain insights into the quality of the world model. During planning, only compact representations are predicted, not images.

Measuring Atari Performance
DreamerV2 is the first world model that enables learning successful behaviors with human-level performance on the well-established and competitive Atari benchmark. We select the 55 games that many previous studies have in common and recommend this set of games for future work. Following the standard evaluation protocol, the agents are allowed 200M environment interactions using an action repeat of 4 and sticky actions (25% chance that an action is ignored and the previous action is repeated instead). We compare to the top model-free agents IQN and Rainbow, as well as to the well-known C51 and DQN agents implemented in the Dopamine framework.

Different standards exist for aggregating the scores across the 55 games. Ideally, a new algorithm would perform better under all conditions. For all four aggregation methods, DreamerV2 indeed outperforms all compared model-free algorithms while using the same computational budget.

DreamerV2 outperforms the top model-free agents according to four methods for aggregating scores across the 55 Atari games. We introduce and recommend the Clipped Record Mean (right-most plot) as an informative and robust performance metric.

The first three aggregation methods were previously proposed in the literature. We identify important drawbacks in each and recommend a new aggregation method, the clipped record mean to overcome their drawbacks.

  • Gamer Median. Most commonly, scores for each game are normalized by the performance of a human gamer that was assessed for the DQN paper and the median of the normalized scores of all games is reported. Unfortunately, the median ignores the scores of many simpler and harder games.
  • Gamer Mean. The mean takes the scores for all games into account but is mainly influenced by a small number of games where the human gamer performed poorly. This makes it easy for an algorithm to achieve large normalized scores on some games (e.g., James Bond, Video Pinball) that then dominate the mean.
  • Record Mean. Prior work recommends normalization based on the human world record instead, but such a metric is still overly influenced by a small number of games where it is easy for the artificial agents to outscore the human record.
  • Clipped Record Mean. We introduce a new metric that normalizes scores by the world record and clips them to not exceed the record. This yields an informative and robust metric that takes the performance on all games into account to an approximately equal amount.

While many current algorithms exceed the human gamer baseline, they are still quite far behind the human world record. As shown in the right-most plot above, DreamerV2 leads by achieving 25% of the human record on average across games. Clipping the scores at the record line lets us focus our efforts on developing methods that come closer to the human world record on all of the games rather than exceeding it on just a few games.

What matters and what doesn’t
To gain insights into the important components of DreamerV2, we conduct an extensive ablation study. Importantly, we find that categorical representations offer a clear advantage over Gaussian representations despite the fact that Gaussians have been used extensively in prior works. KL balancing provides an even more substantial advantage over the KL regularizer used by most generative models.

By preventing the image reconstruction or reward prediction gradients from shaping the model states, we study their importance for learning successful representations. We find that DreamerV2 relies completely on universal information from the high-dimensional input images and its representations enable accurate reward predictions even when they were not trained using information about the reward. This mirrors the success of unsupervised representation learning in the computer vision community.

Atari performance for various ablations of DreamerV2 (Clipped Record Mean). Categorical representations, KL balancing, and learning about the images are crucial for the success of DreamerV2. Using reward information, that is specific to narrow tasks, offers no additional benefits for learning the world model.

Conclusion
We show how to learn a powerful world model to achieve human-level performance on the competitive Atari benchmark and outperform the top model-free agents. This result demonstrates that world models are a powerful approach for achieving high performance on reinforcement learning problems and are ready to use for practitioners and researchers. We see this as an indication that the success of unsupervised representation learning in computer vision [1, 2] is now starting to be realized in reinforcement learning in the form of world models. An unofficial implementation of DreamerV2 is available on Github and provides a productive starting point for future research projects. We see world models that leverage large offline datasets, long-term memory, hierarchical planning, and directed exploration as exciting avenues for future research.

Acknowledgements
This project is a collaboration with Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. We further thank everybody on the Brain Team and beyond who commented on our paper draft and provided feedback at any point throughout the project.

Categories
Offsites

First mlverse survey results – software, applications, and beyond

Thank you everyone who participated in our first mlverse survey!

Wait: What even is the mlverse?

The mlverse originated as an abbreviation of multiverse1, which, on its part, came into being as an intended allusion to the well-known tidyverse. As such, although mlverse software aims for seamless interoperability with the tidyverse, or even integration when feasible (see our recent post featuring a wholly tidymodels-integrated torch network architecture), the priorities are probably a bit different: Often, mlverse software’s raison d’être is to allow R users to do things that are commonly known to be done with other languages, such as Python.

As of today, mlverse development takes place mainly in two broad areas: deep learning, and distributed computing / ML automation. By its very nature, though, it is open to changing user interests and demands. Which leads us to the topic of this post.

The survey

GitHub issues and community questions are valuable feedback, but we wanted something more direct. We wanted a way to find out how you, our users, employ the software, and what for; what you think could be improved; what you wish existed but is not there (yet). To that end, we created a survey. Complementing software- and application-related questions for the above-mentioned broad areas, the survey had a third section, asking about how you perceive ethical and social implications of AI as applied in the “real world”.

A few things upfront:

Firstly, the survey was completely anonymous, in that we asked for neither identifiers (such as e-mail addresses) nor things that render one identifiable, such as gender or geographic location. In the same vein, we had collection of IP addresses disabled on purpose.

Secondly, just like GitHub issues are a biased sample, this survey’s participants must be. Main venues of promotion were rstudio::global, Twitter, LinkedIn, and RStudio Community. As this was the first time we did such a thing (and under significant time constraints), not everything was planned to perfection – not wording-wise and not distribution-wise. Nevertheless, we got a lot of interesting, helpful, and often very detailed answers, – and for the next time we do this, we’ll have our lessons learned!

Thirdly, all questions were optional, naturally resulting in different numbers of valid answers per question. On the other hand, not having to select a bunch of “not applicable” boxes freed respondents to spend time on topics that mattered to them.

As a final pre-remark, most questions allowed for multiple answers.

In sum, we ended up with 138 completed2 surveys. Thanks again everyone who participated, and especially, thank you for taking the time to answer the – many – free-form questions!

Deep learning Areas and applications

Our first goal was to find out in which settings, and for what kinds of applications, deep-learning software is being used.

Overall, 72 respondents reported using DL in their jobs in industry, followed by academia (23), studies (21), spare time (43), and not-actually-using-but-wanting-to (24).

Of those working with DL in industry, more than twenty said they worked in consulting, finance, and healthcare (each). IT, education, retail, pharma, and transportation were each mentioned more than ten times:

Number of users reporting to use DL in industry. Smaller groups not displayed.

(#fig:unnamed-chunk-1)Number of users reporting to use DL in industry. Smaller groups not displayed.

In academia, dominant fields (as per survey participants) were bioinformatics, genomics, and IT, followed by biology, medicine, pharmacology, and social sciences:

Number of users reporting to use DL in academia. Smaller groups not displayed.

(#fig:unnamed-chunk-2)Number of users reporting to use DL in academia. Smaller groups not displayed.

What application areas matter to larger subgroups of “our” users? Nearly a hundred (of 138!) respondents said they used DL for some kind of image-processing application (including classification, segmentation, and object detection). Next up was time-series forecasting, followed by unsupervised learning.

The popularity of unsupervised DL was a bit unexpected; had we anticipated this, we would have asked for more detail here. So if you’re one of the people who selected this – or if you didn’t participate, but do use DL for unsupervised learning – please let us know a bit more in the comments!

Next, NLP was about on par with the former; followed by DL on tabular data, and anomaly detection. Bayesian deep learning, reinforcement learning, recommendation systems, and audio processing were still mentioned frequently.

Applications deep learning is used for. Smaller groups not displayed.

(#fig:unnamed-chunk-3)Applications deep learning is used for. Smaller groups not displayed.

Frameworks and skills

We also asked what frameworks and languages participants were using for deep learning, and what they were planning on using in the future. Single-time mentions (e.g., deeplearning4J) are not displayed.

Framework / language used for deep learning. Single mentions not displayed.

(#fig:unnamed-chunk-4)Framework / language used for deep learning. Single mentions not displayed.

An important thing for any software developer or content creator to investigate is proficiency/levels of expertise present in their audiences. It (nearly) goes without saying that expertise is very different from self-reported expertise. I’d like to be very cautious, then, to interpret the below results.

While with regard to R skills3, the aggregate self-ratings look plausible (to me), I would have guessed a slightly different outcome re DL. Judging from other sources (like, e.g., GitHub issues), I tend to suspect more of a bimodal distribution (a far stronger version of the bimodality we’re already seeing, that is). To me, it seems like we have rather many users who know a lot about DL. In agreement with my gut feeling, though, is the bimodality itself – as opposed to, say, a Gaussian shape.

But of course, sample size is moderate, and sample bias is present.

Self-rated skills re R and deep learning.

(#fig:unnamed-chunk-5)Self-rated skills re R and deep learning.

Wishes and suggestions

Now, to the free-form questions. We wanted to know what we could do better.

I’ll address the most salient topics in order of frequency of mention.4 For DL, this is surprisingly easy (as opposed to Spark, as you’ll see).

“No Python”

The number one concern with deep learning from R, for survey respondents, clearly has to do not with R but with Python. This topic appeared in various forms, the most frequent being frustration over how hard it can be, dependent on the environment, to get Python dependencies for TensorFlow/Keras correct. (It also appeared as enthusiasm for torch, which we are very happy about.)

Let me clarify and add some context.

TensorFlow is a Python framework (nowadays subsuming Keras, which is why I’ll be addressing both of those as “TensorFlow” for simplicity) that is made available from R through packages tensorflow and keras . As with other Python libraries, objects are imported and accessible via reticulate . While tensorflow provides the low-level access, keras brings idiomatic-feeling, nice-to-use wrappers that let you forget about the chain of dependencies involved.

On the other hand, torch, a recent addition to mlverse software, is an R port of PyTorch that does not delegate to Python. Instead, its R layer directly calls into libtorch, the C++ library behind PyTorch. In that way, it is like a lot of high-duty R packages, making use of C++ for performance reasons.

Now, this is not the place for recommendations. Here are a few thoughts though.

Clearly, as one respondent remarked, as of today the torch ecosystem does not offer functionality on par with TensorFlow, and for that to change time and – hopefully! more on that below – your, the community’s, help is needed. Why? Because torch is so young, for one; but also, there is a “systemic” reason! With TensorFlow, as we can access any symbol via the tf object, it is always possible, if inelegant, to do from R what you see done in Python. Respective R wrappers nonexistent5, quite a few blog posts (see, e.g., https://blogs.rstudio.com/ai/posts/2020-04-29-encrypted_keras_with_syft/, or A first look at federated learning with TensorFlow) relied on this!

Switching to the topic of tensorflow’s Python dependencies causing problems with installation, my experience (from GitHub issues, as well as my own) has been that difficulties are quite system-dependent. On some OSes, complications seem to appear more often than on others; and low-control (to the individual user) environments like HPC clusters can make things especially difficult. In any case though, I have to (unfortunately) admit that when installation problems appear, they can be very tricky to solve.

tidymodels integration

The second most frequent mention clearly was the wish for tighter tidymodels integration. Here, we wholeheartedly agree. As of today, there is no automated way to accomplish this for torch models generically, but it can be done for specific model implementations.

Last week, torch, tidymodels, and high-energy physics featured the first tidymodels-integrated torch package. And there’s more to come. In fact, if you are developing a package in the torch ecosystem, why not consider doing the same? Should you run into problems, the growing torch community will be happy to help.

Documentation, examples, teaching materials

Thirdly, several respondents expressed the wish for more documentation, examples, and teaching materials. Here, the situation is different for TensorFlow than for torch.

For tensorflow, the website has a multitude of guides, tutorials, and examples. For torch, reflecting the discrepancy in respective lifecycles, materials are not that abundant (yet). However, after a recent refactoring, the website has a new, four-part Get started section addressed to both beginners in DL and experienced TensorFlow users curious to learn about torch. After this hands-on introduction, a good place to get more technical background would be the section on tensors, autograd, and neural network modules.

Truth be told, though, nothing would be more helpful here than contributions from the community. Whenever you solve even the tiniest problem (which is often how things appear to oneself), consider creating a vignette explaining what you did. Future users will be thankful, and a growing user base means that over time, it’ll be your turn to find that some things have already been solved for you!

Community, community, community

The remaining items discussed didn’t come up quite as often (individually), but taken together, they all have something in common: They all are wishes we happen to have, as well!

This definitely holds in the abstract – let me cite:

“Develop more of a DL community”

“Larger developer community and ecosystem. Rstudio has made great tools, but for applied work is has been hard to work against the momentum of working in Python.”

We wholeheartedly agree, and building a larger community is exactly what we’re trying to do. I like the formulation “a DL community” insofar it is framework-independent. In the end, frameworks are just tools, and what counts is our ability to usefully apply those tools to problems we need to solve.

Concrete wishes include

  • More paper/model implementations (such as TabNet).

  • Facilities for easy data reshaping and pre-processing (e.g., in order to pass data to RNNs or 1dd convnets in the expected 3-d format).

  • Probabilistic programming for torch (analogously to TensorFlow Probability).

  • A high-level library (such as fast.ai) based on torch.

In other words, there is a whole cosmos of useful things to create; and no small group alone can do it. This is where we hope we can build a community of people, each contributing what they’re most interested in, and to whatever extent they wish.

Spark Areas and applications

For Spark, questions broadly paralleled those asked about deep learning.

Overall, judging from this survey (and unsurprisingly), Spark is predominantly used in industry (n = 39). For academic staff and students (taken together), n = 8. Seventeen people reported using Spark in their spare time, while 34 said they wanted to use it in the future.

Looking at industry sectors, we again find finance, consulting, and healthcare dominating.6

Number of users reporting to use Spark in industry. Smaller groups not displayed.

(#fig:unnamed-chunk-6)Number of users reporting to use Spark in industry. Smaller groups not displayed.

What do survey respondents do with Spark? Analyses of tabular data and time series dominate:

Number of users reporting to use Spark in industry. Smaller groups not displayed.

(#fig:unnamed-chunk-7)Number of users reporting to use Spark in industry. Smaller groups not displayed.

Frameworks and skills

As with deep learning, we wanted to know what language people use to do Spark. If you look at the below graphic, you see R appearing twice: once in connection with sparklyr, once with SparkR. What’s that about?

Both sparklyr and SparkR are R interfaces for Apache Spark, each designed and built with a different set of priorities and, consequently, trade-offs in mind.

sparklyr, one the one hand, will appeal to data scientists at home in the tidyverse, as they’ll be able to use all the data manipulation interfaces they’re familiar with from packages such as dplyr, DBI, tidyr, or broom.

SparkR, on the other hand, is a light-weight R binding for Apache Spark, and is bundled with the same. It’s an excellent choice for practitioners who are well-versed in Apache Spark and just need a thin wrapper to access various Spark functionalities from R.

Language / language bindings used to do Spark.

(#fig:unnamed-chunk-8)Language / language bindings used to do Spark.

When asked to rate their expertise in R7 and Spark, respectively, respondents showed similar behavior as observed for deep learning above: Most people seem to think more of their R skills than their theoretical Spark-related knowledge. However, even more caution should be exercised here than above: The number of responses here was significantly lower.

Self-rated skills re R and Spark.

(#fig:unnamed-chunk-9)Self-rated skills re R and Spark.

Wishes and suggestions

Just like with DL, Spark users were asked what could be improved, and what they were hoping for.

Interestingly, answers were less “clustered” than for DL. While with DL, a few things cropped up again and again, and there were very few mentions of concrete technical features, here we see about the opposite: The great majority of wishes were concrete, technical, and often only came up once.

Probably though, this is not a coincidence.

Looking back at how sparklyr has evolved from 2016 until now, there is a persistent theme of it being the bridge that joins the Apache Spark ecosystem to numerous useful R interfaces, frameworks, and utilities (most notably, the tidyverse).

Many of our users’ suggestions were essentially a continuation of this theme. This holds, for example, for two features already available as of sparklyr 1.4 and 1.2, respectively: support for the Arrow serialization format and for Databricks Connect. It also holds for tidymodels integration (a frequent wish), a simple R interface for defining Spark UDFs (frequently desired, this one too), out-of-core direct computations on Parquet files, and extended time-series functionalities.

We’re thankful for the feedback and will evaluate carefully what could be done in each case. In general, integrating sparklyr with some feature X is a process to be planned carefully, as modifications could, in theory, be made in various places (sparklyr; X; both sparklyr and X; or even a newly-to-be-created extension). In fact, this is a topic deserving of much more detailed coverage, and has to be left to a future post.

Ethics and AI in society

To start, this is probably the section that will profit most from more preparation, the next time we do this survey. Due to time pressure, some (not all!) of the questions ended up being too suggestive, possibly resulting in social-desirability bias.8

Next time, we’ll try to avoid this, and questions in this area will likely look pretty different (more like scenarios or what-if stories)9. However, I was told by several people they’d been positively surprised by simply encountering this topic at all in the survey. So perhaps this is the main point – although there are a few results that I’m sure will be interesting by themselves!

Anticlimactically, the most non-obvious results are presented first.10

“Are you worried about societal/political impacts of how AI is used in the real world?”

For this question, we had four answer options, formulated in a way that left no real “middle ground”. (The labels in the graphic below verbatim reflect those options.)

Number of users responding to the question 'Are you worried about societal/political impacts of how AI is used in the real world?' with the answer options given.

(#fig:unnamed-chunk-10)Number of users responding to the question ‘Are you worried about societal/political impacts of how AI is used in the real world?’ with the answer options given.

The next question is definitely one to keep for future editions, as from all questions in this section, it definitely has the highest information content.

“When you think of the near future, are you more afraid of AI misuse or more hopeful about positive outcomes?”

Here, the answer was to be given by moving a slider, with -100 signifying “I tend to be more pessimistic”; and 100, “I tend to be more optimistic”. Although it would have been possible to remain undecided, choosing a value close to 0, we instead see a bimodal distribution:

When you think of the near future, are you more afraid of AI misuse or more hopeful about positive outcomes?

(#fig:unnamed-chunk-11)When you think of the near future, are you more afraid of AI misuse or more hopeful about positive outcomes?

Why worry, and what about

The following two questions are those already alluded to as possibly being overly prone to social-desirability bias. They asked what applications people were worried about, and for what reasons, respectively. Both questions allowed to select however many responses one wanted, intentionally not forcing people to rank things that are not comparable (the way I see it). In both cases though, it was possible to explicitly indicate None (corresponding to “I don’t really find any of these problematic” and “I am not extensively worried”, respectively.)

What applications of AI do you feel are most problematic?

Number of users selecting the respective application in response to the question: What applications of AI do you feel are most problematic?

(#fig:unnamed-chunk-12)Number of users selecting the respective application in response to the question: What applications of AI do you feel are most problematic?

If you are worried about misuse and negative impacts, what exactly is it that worries you?

Number of users selecting the respective impact in response to the question: If you are worried about misuse and negative impacts, what exactly is it that worries you?

(#fig:unnamed-chunk-13)Number of users selecting the respective impact in response to the question: If you are worried about misuse and negative impacts, what exactly is it that worries you?

Complementing these questions, it was possible to enter further thoughts and concerns in free-form. Although I can’t cite everything that was mentioned here, recurring themes were:

  • Misuse of AI to the wrong purposes, by the wrong people, and at scale.

  • Not feeling responsible for how one’s algorithms are used (the I’m just a software engineer topos).

  • Reluctance, in AI but in society overall as well, to even discuss the topic (ethics).

Finally, although this was mentioned just once, I’d like to relay a comment that went in a direction absent from all provided answer options, but that probably should have been there already: AI being used to construct social credit systems.

“It’s also that you somehow might have to learn to game the algorithm, which will make AI application forcing us to behave in some way to be scored good. That moment scares me when the algorithm is not only learning from our behavior but we behave so that the algorithm predicts us optimally (turning every use case around).”11

Conclusion

This has become a long text. But I think that seeing how much time respondents took to answer the many questions, often including lots of detail in the free-form answers, it seemed like a matter of decency to, in the analysis and report, go into some detail as well.

Thanks again to everyone who took part! We hope to make this a recurring thing, and will strive to design the next edition in a way that makes answers even more information-rich.

Thanks for reading!

  1. Calling it an abbreviation is, in..

Categories
Offsites

Rearranging the Visual World

Rearranging objects (such as organizing books on a bookshelf, moving utensils on a dinner table, or pushing piles of coffee beans) is a fundamental skill that can enable robots to physically interact with our diverse and unstructured world. While easy for people, accomplishing such tasks remains an open research challenge for embodied machine learning (ML) systems, as it requires both high-level and low-level perceptual reasoning. For example, when stacking a pile of books, one might consider where the books should be stacked, and in which order, while ensuring that the edges of the books align with each other to form a neat pile.

Across many application areas in ML, simple differences in model architecture can exhibit vastly different generalization properties. Therefore, one might ask whether there are certain deep network architectures that favor simple underlying elements of the rearrangement problem. Convolutional architectures, for example, are common in computer vision as they encode translational invariance, yielding the same response even if an image is shifted, while Transformer architectures are common in language processing because they exploit self-attention to capture long-range contextual dependencies. In robotics applications, one common architectural element is to use object-centric representations such as poses, keypoints, or object descriptors inside learned models, but these representations require additional training data (often manually annotated) and struggle to describe difficult scenarios such as deformables (e.g., playdough), fluids (honey), or piles of stuff (chopped onions).

Today, we present the Transporter Network, a simple model architecture for learning vision-based rearrangement tasks, which appeared as a publication and plenary talk during CoRL 2020. Transporter Nets use a novel approach to 3D spatial understanding that avoids reliance on object-centric representations, making them general for vision-based manipulation but far more sample efficient than benchmarked end-to-end alternatives. As a consequence, they are fast and practical to train on real robots. We are also releasing an accompanying open-source implementation of Transporter Nets together with Ravens, our new simulated benchmark suite of ten vision-based manipulation tasks.

Transporter Networks: Rearranging the Visual World for Robotic Manipulation
The key idea behind the Transporter Network architecture is that one can formulate the rearrangement problem as learning how to move a chunk of 3D space. Rather than relying on an explicit definition of objects (which is bound to struggle at capturing all edge cases), 3D space is a much broader definition for what could serve as the atomic units being rearranged, and can broadly encompass an object, part of an object, or multiple objects, etc. Transporter Nets leverage this structure by capturing a deep representation of the 3D visual world, then overlaying parts of it on itself to imagine various possible rearrangements of 3D space. It then chooses the rearrangements that best match those it has seen during training (e.g., from expert demonstrations), and uses them to parameterize robot actions. This formulation allows Transporter Nets to generalize to unseen objects and enables them to better exploit geometric symmetries in the data, so that they can extrapolate to new scene configurations. Transporter Nets are applicable to a wide variety of rearrangement tasks for robotic manipulation, expanding beyond our earlier models, such as affordance-based manipulation and TossingBot, that focus only on grasping and tossing.

Transporter Nets capture a deep representation of the visual world, then overlay parts of it on itself to imagine various possible rearrangements of 3D space to find the best one and inform robot actions.

Ravens Benchmark
To evaluate the performance of Transporter Nets in a consistent environment for fair comparisons to baselines and ablations, we developed Ravens, a benchmark suite of ten simulated vision-based rearrangement tasks. Ravens features a Gym API with a built-in stochastic oracle to evaluate the sample efficiency of imitation learning methods. Ravens avoids assumptions that cannot transfer to a real setup: observation data contains only RGB-D images and camera parameters; actions are end effector poses (transposed into joint positions with inverse kinematics).

Experiments on these ten tasks show that Transporter Nets are orders of magnitude more sample efficient than other end-to-end methods, and are capable of achieving over 90% success on many tasks with just 100 demonstrations, while the baselines struggle to generalize with the same amount of data. In practice, this makes collecting enough demonstrations a more viable option for training these models on real robots (which we show examples of below).

Our new Ravens benchmark includes ten simulated vision-based manipulation tasks, including pushing and pick-and-place, for which experiments show that Transporter Nets are orders of magnitude more sample efficient than other end-to-end methods. Ravens features a Gym API with a built-in stochastic oracle to evaluate the sample efficiency of imitation learning methods.

Our new Ravens benchmark includes ten simulated vision-based manipulation tasks, including pushing and pick-and-place, for which experiments show that Transporter Nets are orders of magnitude more sample efficient than other end-to-end methods. Ravens features a Gym API with a built-in stochastic oracle to evaluate the sample efficiency of imitation learning methods.

Highlights
Given 10 example demonstrations, Transporter Nets can learn pick and place tasks such as stacking plates (surprisingly easy to misplace!), multimodal tasks like aligning any corner of a box to a marker on the tabletop, or building a pyramid of blocks.

By leveraging closed-loop visual feedback, Transporter Nets have the capacity to learn various multi-step sequential tasks with a modest number of demonstrations: such as moving disks for Tower of Hanoi, palletizing boxes, or assembling kits of new objects not seen during training. These tasks have considerably “long horizons”, meaning that to solve the task the model must correctly sequence many individual choices. Policies also tend to learn emergent recovery behaviors.

One surprising thing about these results was that beyond just perception, the models were starting to learn behaviors that resemble high-level planning. For example, to solve Towers of Hanoi, the models have to pick which disk to move next, which requires recognizing the state of the board based on the current visible disks and their positions. With a box-palletizing task, the models must locate the empty spaces of the pallet, and identify how new boxes can fit into those voids. Such behaviors are exciting because they suggest that with all the baked-in invariances, the model can focus its capacity on learning the more high-level patterns in manipulation.

Transporter Nets can also learn tasks that use any motion primitive defined by two end effector poses, such as pushing piles of small objects into a target set, or reconfiguring a deformable rope to connect the two end-points of a 3-sided square. This suggests that rigid spatial displacements can serve as useful priors for nonrigid ones.

Conclusion
Transporter Nets bring a promising approach to learning vision-based manipulation, but are not without limitations. For example, they can be susceptible to noisy 3D data, we have only demonstrated them for sparse waypoint-based control with motion primitives, and it remains unclear how to extend them beyond spatial action spaces to force or torque-based actions. But overall, we are excited about this direction of work, and we hope that it provides inspiration for extensions beyond the applications we’ve discussed. For more details, please check out our paper.

Acknowledgements
This research was done by Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Vikas Sindhwani, and Johnny Lee, with special thanks to Ken Goldberg, Razvan Surdulescu, Daniel Seita, Ayzaan Wahid, Vincent Vanhoucke, Anelia Angelova, Kendra Byrne, for helpful feedback on writing; Sean Snyder, Jonathan Vela, Larry Bisares, Michael Villanueva, Brandon Hurd for operations and hardware support; Robert Baruch for software infrastructure, Jared Braun for UI contributions; Erwin Coumans for PyBullet advice; Laura Graesser for video narration.

Categories
Offsites

3D Scene Understanding with TensorFlow 3D

The growing ubiquity of 3D sensors (e.g., Lidar, depth sensing cameras and radar) over the last few years has created a need for scene understanding technology that can process the data these devices capture. Such technology can enable machine learning (ML) systems that use these sensors, like autonomous cars and robots, to navigate and operate in the real world, and can create an improved augmented reality experience on mobile devices. The field of computer vision has recently begun making good progress in 3D scene understanding, including models for mobile 3D object detection, transparent object detection, and more, but entry to the field can be challenging due to the limited availability tools and resources that can be applied to 3D data.

In order to further improve 3D scene understanding and reduce barriers to entry for interested researchers, we are releasing TensorFlow 3D (TF 3D), a highly modular and efficient library that is designed to bring 3D deep learning capabilities into TensorFlow. TF 3D provides a set of popular operations, loss functions, data processing tools, models and metrics that enables the broader research community to develop, train and deploy state-of-the-art 3D scene understanding models.

TF 3D contains training and evaluation pipelines for state-of-the-art 3D semantic segmentation, 3D object detection and 3D instance segmentation, with support for distributed training. It also enables other potential applications like 3D object shape prediction, point cloud registration and point cloud densification. In addition, it offers a unified dataset specification and configuration for training and evaluation of the standard 3D scene understanding datasets. It currently supports the Waymo Open, ScanNet, and Rio datasets. However, users can freely convert other popular datasets, such as NuScenes and Kitti, into a similar format and use them in the pre-existing or custom created pipelines, and can leverage TF 3D for a wide variety of 3D deep learning research and applications, from quickly prototyping and trying new ideas to deploying a real-time inference system.

An example output of the 3D object detection model in TF 3D on a frame from Waymo Open Dataset is shown on the left. An example output of the 3D instance segmentation model on a scene from ScanNet dataset is shown on the right.

Here, we will present the efficient and configurable sparse convolutional backbone that is provided in TF 3D, which is the key to achieving state-of-the-art results on various 3D scene understanding tasks. Furthermore, we will go over each of the three pipelines that TF 3D currently supports: 3D semantic segmentation, 3D object detection and 3D instance segmentation.

3D Sparse Convolutional Network
The 3D data captured by sensors often consists of a scene that contains a set of objects of interest (e.g. cars, pedestrians, etc.) surrounded mostly by open space, which is of limited (or no) interest. As such, 3D data is inherently sparse. In such an environment, standard implementation of convolutions would be computationally intensive and consume a large amount of memory. So, in TF 3D we use submanifold sparse convolution and pooling operations, which are designed to process 3D sparse data more efficiently. Sparse convolutional models are core to the state-of-the-art methods applied in most outdoor self-driving (e.g. Waymo, NuScenes) and indoor benchmarks (e.g. ScanNet).

We also use various CUDA techniques to speed up the computation (e.g., hashing, partitioning / caching the filter in shared memory, and using bit operations). Experiments on the Waymo Open dataset shows that this implementation is around 20x faster than a well-designed implementation with pre-existing TensorFlow operations.

TF 3D then uses the 3D submanifold sparse U-Net architecture to extract a feature for each voxel. The U-Net architecture has proven to be effective by letting the network extract both coarse and fine features and combining them to make the predictions. The U-Net network consists of three modules, an encoder, a bottleneck, and a decoder, each of which consists of a number of sparse convolution blocks with possible pooling or un-pooling operations.

A 3D sparse voxel U-Net architecture. Note that a horizontal arrow takes in the voxel features and applies a submanifold sparse convolution to it. An arrow that is moving down performs a submanifold sparse pooling. An arrow that is moving up will gather back the pooled features, concatenate them with the features coming from the horizontal arrow, and perform a submanifold sparse convolution on the concatenated features.

The sparse convolutional network described above is the backbone for the 3D scene understanding pipelines that are offered in TF 3D. Each of the models described below uses this backbone network to extract features for the sparse voxels, and then adds one or multiple additional prediction heads to infer the task of interest. The user can configure the U-Net network by changing the number of encoder / decoder layers and the number of convolutions in each layer, and by modifying the convolution filter sizes, which enables a wide range of speed / accuracy tradeoffs to be explored through the different backbone configurations

3D Semantic Segmentation
The 3D semantic segmentation model has only one output head for predicting the per-voxel semantic scores, which are mapped back to points to predict a semantic label per point.

3D semantic segmentation of an indoor scene from ScanNet dataset.

3D Instance Segmentation
In 3D instance segmentation, in addition to predicting semantics, the goal is to group the voxels that belong to the same object together. The 3D instance segmentation algorithm used in TF 3D is based on our previous work on 2D image segmentation using deep metric learning. The model predicts a per-voxel instance embedding vector as well as a semantic score for each voxel. The instance embedding vectors map the voxels to an embedding space where voxels that correspond to the same object instance are close together, while those that correspond to different objects are far apart. In this case, the input is a point cloud instead of an image, and it uses a 3D sparse network instead of a 2D image network. At inference time, a greedy algorithm picks one instance seed at a time, and uses the distance between the voxel embeddings to group them into segments.

3D Object Detection
The 3D object detection model predicts per-voxel size, center, and rotation matrices and the object semantic scores. At inference time, a box proposal mechanism is used to reduce the hundreds of thousands of per-voxel box predictions into a few accurate box proposals, and then at training time, box prediction and classification losses are applied to per-voxel predictions. We apply a Huber loss on the distance between predicted and the ground-truth box corners. Since the function that estimates the box corners from its size, center and rotation matrix is differentiable, the loss will automatically propagate back to those predicted object properties. We use a dynamic box classification loss that classifies a box that strongly overlaps with the ground-truth as positive and classifies the non-overlapping boxes as negative.

Our 3D object detection results on ScanNet dataset.

In our recent paper, “DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes”, we describe in detail the single-stage weakly supervised learning algorithm used for object detection in TF 3D. In addition, in a follow up work, we extended the 3D object detection model to leverage temporal information by proposing a sparse LSTM-based multi-frame model. We go on to show that this temporal model outperforms the frame-by-frame approach by 7.5% in the Waymo Open dataset.

The 3D object detection and shape prediction model introduced in the DOPS paper. A 3D sparse U-Net is used to extract a feature vector for each voxel. The object detection module uses these features to propose 3D boxes and semantic scores. At the same time, the other branch of the network predicts a shape embedding that is used to output a mesh for each object.

Ready to Get Started?
We’ve certainly found this codebase to be useful for our 3D computer vision projects, and we hope that you will as well. Contributions to the codebase are welcome and please stay tuned for our own further updates to the framework. To get started please visit our github repository.

Acknowledgements
The release of the TensorFlow 3D codebase and model has been the result of widespread collaboration among Google researchers with feedback and testing from product groups. In particular we want to highlight the core contributions by Alireza Fathi and Rui Huang (work performed while at Google), with special additional thanks to Guangda Lai, Abhijit Kundu, Pei Sun, Thomas Funkhouser, David Ross, Caroline Pantofaru, Johanna Wald, Angela Dai and Matthias Niessner.

Categories
Offsites

Uncovering Unknown Unknowns in Machine Learning

The performance of machine learning (ML) models depends both on the learning algorithms, as well as the data used for training and evaluation. The role of the algorithms is well studied and the focus of a multitude of challenges, such as SQuAD, GLUE, ImageNet, and many others. In addition, there have been efforts to also improve the data, including a series of workshops addressing issues for ML evaluation. In contrast, research and challenges that focus on the data used for evaluation of ML models are not commonplace. Furthermore, many evaluation datasets contain items that are easy to evaluate, e.g., photos with a subject that is easy to identify, and thus they miss the natural ambiguity of real world context. The absence of ambiguous real-world examples in evaluation undermines the ability to reliably test machine learning performance, which makes ML models prone to develop “weak spots”, i.e., classes of examples that are difficult or impossible for a model to accurately evaluate, because that class of examples is missing from the evaluation set.

To address the problem of identifying these weaknesses in ML models, we recently launched the Crowdsourcing Adverse Test Sets for Machine Learning (CATS4ML) Data Challenge at HCOMP 2020 (open until 30 April, 2021 to researchers and developers worldwide). The goal of the challenge is to raise the bar in ML evaluation sets and to find as many examples as possible that are confusing or otherwise problematic for algorithms to process. CATS4ML relies on people’s abilities and intuition to spot new data examples about which machine learning is confident, but actually misclassifies.

What are ML “Weak Spots”?
There are two categories of weak spots: known unknowns and unknown unknowns. Known unknowns are examples for which a model is unsure about the correct classification. The research community continues to study this in a field known as active learning, and has found the solution to be, in very general terms, to interactively solicit new labels from people on uncertain examples. For example, if a model is not certain whether or not the subject of a photo is a cat, a person is asked to verify; but if the system is certain, a person is not asked. While there is room for improvement in this area, what is comforting is that the confidence of the model is correlated with its performance, i.e., one can see what the model doesn’t know.

Unknown unknowns, on the other hand, are examples where a model is confident about its answer, but is actually wrong. Efforts to proactively discover unknown unknowns (e.g., Attenberg 2015 and Crawford 2019) have helped uncover a multitude of unintended machine behaviours. In contrast to such approaches for the discovery of unknown unknowns, generative adversarial networks (GANs) generate unknown unknowns for image recognition models in the form of optical illusions for computers that cause deep learning models to make mistakes beyond human perception. While GANs uncover model exploits in the event of an intentional manipulation, real-world examples can better highlight a model’s failures in its day-to-day performance. These real-world examples are the unknown unknowns of interest to CATS4ML — the challenge aims to gather unmanipulated examples that humans can reliably interpret but on which many ML models would confidently disagree.

Example illustrating how optical illusions for computers caused by adversarial noise help discover machine manipulated unknown unknowns for ML models (based on Brown 2018).

First Edition of CATS4ML Data Challenge: Open Images Dataset
The CATS4ML Data Challenge focuses on visual recognition, using images and labels from the Open Images Dataset. The target images for the challenge are selected from the Open Images Dataset along with a set of 24 target labels from the same dataset. The challenge participants are invited to invent new and creative ways to explore this existing publicly available dataset and, focussed on a list of pre-selected target labels, discover examples of unknown unknowns for ML models.

Examples from the Open Images Dataset as possible unknown unknowns for ML models.

CATS4ML is a complementary effort to FAIR’s recently introduced DynaBench research platform for dynamic data collection. Where DynaBench tackles issues with static benchmarks using ML models with humans in the loop, CATS4ML focuses on improving evaluation datasets for ML by encouraging the exploration of existing ML benchmarks for adverse examples that can be unknown unknowns. The results will help detect and avoid future errors, and also will give insights to model explainability.

In this way, CATS4ML aims to raise greater awareness of the problem by providing dataset resources that developers can use to uncover the weak spots of their algorithms. This will also inform researchers on how to create benchmark datasets for machine learning that are more balanced, diverse and socially aware.

Get Involved
We invite the global community of ML researchers and practitioners to join us in the effort of discovering interesting, difficult examples from the Open Images Dataset. Register on the challenge website, download the target images and labeled data, contribute the images you discover and join the competition for the winning participant!

To score points in this competition, participants should submit a set of image-label pairs that will be confirmed by human-in-the-loop raters, whose votes should be in disagreement with the average machine score for the label over a number of machine learning models.

An example of how a submitted image can score points. The same image can score as a false positive (Left) and as a false negative (Right) with two different labels. In both cases the human verification is in disagreement with the machine score. Participants score on submitted image-label pairs, which means that one and the same image can be an example of an ML unknown unknown for different labels.

The challenge is open until 30 April, 2021 to researchers and developers worldwide. To learn more about CATS4ML and how to join, please review these slides and visit the challenge website.

Acknowledgements
The release of the CATS4ML Data Challenge has been possible thanks to the hard work of a lot of people including, but not limited to, the following (in alphabetical order of last name): Osman Aka, Ken Burke, Tulsee Doshi, Mig Gerard, Victor Gomes, Shahab Kamali, Igor Karpov, Devi Krishna, Daphne Luong, Carey Radebaugh, Jamie Taylor, Nithum Thain, Kenny Wibowo, Ka Wong, and Tong Zhou.

Categories
Offsites

torch, tidymodels, and high-energy physics

So what’s with the clickbait (high-energy physics)? Well, it’s not just clickbait. To showcase TabNet, we will be using the Higgs dataset (Baldi, Sadowski, and Whiteson (2014)), available at UCI Machine Learning Repository. I don’t know about you, but I always enjoy using datasets that motivate me to learn more about things. But first, let’s get acquainted with the main actors of this post!

TabNet

TabNet was introduced in Arik and Pfister (2020). It is interesting for three reasons:

  • It claims highly competitive performance on tabular data, an area where deep learning has not gained much of a reputation yet.

  • TabNet includes interpretability1 features by design.

  • It is claimed to significantly profit from self-supervised pre-training, again in an area where this is anything but undeserving of mention.

In this post, we won’t go into (3), but we do expand on (2), the ways TabNet allows access to its inner workings.

How do we use TabNet from R? The torch ecosystem includes a package – tabnet – that not only implements the model of the same name, but also allows you to make use of it as part of a tidymodels workflow.

tidymodels

To many R-using data scientists, the tidymodels framework will not be a stranger. tidymodels provides a high-level, unified approach to model training, hyperparameter optimization, and inference.

tabnet is the first (of many, we hope) torch models that let you use a tidymodels workflow all the way: from data pre-processing over hyperparameter tuning to performance evaluation and inference. While the first, as well as the last, may seem nice-to-have but not “mandatory”, the tuning experience is likely to be something you’ll won’t want to do without!

Using tabnet with tidymodels

In this post, we first showcase a tabnet-using workflow in a nutshell, making use of hyperparameter settings reported in the paper.

Then, we initiate a tidymodels-powered hyperparameter search, focusing on the basics but also, encouraging you to dig deeper at your leisure.

Finally, we circle back to the promise of interpretability, demonstrating what is offered by tabnet and ending in a short discussion.

In the flow with TabNet

As usual, we start by loading all required libraries. We also set a random seed, on the R as well as the torch sides. When model interpretation is part of your task, you will want to investigate the role of random initialization.

library(torch)
library(tabnet)
library(tidyverse)
library(tidymodels)
library(finetune) # to use tuning functions from the new finetune package
library(vip) # to plot feature importances

set.seed(777)
torch_manual_seed(777)

Next, we load the dataset.

# download from https://archive.ics.uci.edu/ml/datasets/HIGGS
higgs <- read_csv(
  "HIGGS.csv",
  col_names = c("class", "lepton_pT", "lepton_eta", "lepton_phi", "missing_energy_magnitude",
                "missing_energy_phi", "jet_1_pt", "jet_1_eta", "jet_1_phi", "jet_1_b_tag",
                "jet_2_pt", "jet_2_eta", "jet_2_phi", "jet_2_b_tag", "jet_3_pt", "jet_3_eta",
                "jet_3_phi", "jet_3_b_tag", "jet_4_pt", "jet_4_eta", "jet_4_phi", "jet_4_b_tag",
                "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"),
  col_types = "fdddddddddddddddddddddddddddd"
  )

What’s this about? In high-energy physics, the search for new particles takes place at powerful particle accelerators, such as (and most prominently) CERN’s Large Hadron Collider. In addition to actual experiments, simulation plays an important role. In simulations, “measurement” data are generated according to different underlying hypotheses, resulting in distributions that can be compared with each other. Given the likelihood of the simulated data, the goal then is to make inferences about the hypotheses.

The above dataset (Baldi, Sadowski, and Whiteson (2014)) results from just such a simulation. It explores what features could be measured assuming two different processes. In the first process, two gluons collide, and a heavy Higgs boson is produced; this is the signal process, the one we’re interested in. In the second, the collision of the gluons results in a pair of top quarks – this is the background process.

Through different intermediaries, both processes result in the same end products – so tracking these does not help. Instead, what the paper authors did was simulate kinematic features (momenta, specifically) of decay products, such as leptons (electrons and protons) and particle jets. In addition, they constructed a number of high-level features, features that presuppose domain knowledge. In their article, they showed that, in contrast to other machine learning methods, deep neural networks did nearly as well when presented with the low-level features (the momenta) only as with just the high-level features alone.

Certainly, it would be interesting to double-check these results on tabnet, and then, look at the respective feature importances. However, given the size of the dataset, non-negligible computing resources (and patience) will be required.

Speaking of size, let’s take a look:

higgs %>% glimpse()
Rows: 11,000,000
Columns: 29
$ class                    <fct> 1.000000000000000000e+00, 1.000000…
$ lepton_pT                <dbl> 0.8692932, 0.9075421, 0.7988347, 1…
$ lepton_eta               <dbl> -0.6350818, 0.3291473, 1.4706388, …
$ lepton_phi               <dbl> 0.225690261, 0.359411865, -1.63597…
$ missing_energy_magnitude <dbl> 0.3274701, 1.4979699, 0.4537732, 1…
$ missing_energy_phi       <dbl> -0.68999320, -0.31300953, 0.425629…
$ jet_1_pt                 <dbl> 0.7542022, 1.0955306, 1.1048746, 1…
$ jet_1_eta                <dbl> -0.24857314, -0.55752492, 1.282322…
$ jet_1_phi                <dbl> -1.09206390, -1.58822978, 1.381664…
$ jet_1_b_tag              <dbl> 0.000000, 2.173076, 0.000000, 0.00…
$ jet_2_pt                 <dbl> 1.3749921, 0.8125812, 0.8517372, 2…
$ jet_2_eta                <dbl> -0.6536742, -0.2136419, 1.5406590,…
$ jet_2_phi                <dbl> 0.9303491, 1.2710146, -0.8196895, …
$ jet_2_b_tag              <dbl> 1.107436, 2.214872, 2.214872, 2.21…
$ jet_3_pt                 <dbl> 1.1389043, 0.4999940, 0.9934899, 1…
$ jet_3_eta                <dbl> -1.578198314, -1.261431813, 0.3560…
$ jet_3_phi                <dbl> -1.04698539, 0.73215616, -0.208777…
$ jet_3_b_tag              <dbl> 0.000000, 0.000000, 2.548224, 0.00…
$ jet_4_pt                 <dbl> 0.6579295, 0.3987009, 1.2569546, 0…
$ jet_4_eta                <dbl> -0.01045457, -1.13893008, 1.128847…
$ jet_4_phi                <dbl> -0.0457671694, -0.0008191102, 0.90…
$ jet_4_btag               <dbl> 3.101961, 0.000000, 0.000000, 0.00…
$ m_jj                     <dbl> 1.3537600, 0.3022199, 0.9097533, 0…
$ m_jjj                    <dbl> 0.9795631, 0.8330482, 1.1083305, 1…
$ m_lv                     <dbl> 0.9780762, 0.9856997, 0.9856922, 0…
$ m_jlv                    <dbl> 0.9200048, 0.9780984, 0.9513313, 0…
$ m_bb                     <dbl> 0.7216575, 0.7797322, 0.8032515, 0…
$ m_wbb                    <dbl> 0.9887509, 0.9923558, 0.8659244, 1…
$ m_wwbb                   <dbl> 0.8766783, 0.7983426, 0.7801176, 0…

Eleven million “observations” (kind of) – that’s a lot! Like the authors of the TabNet paper (Arik and Pfister (2020)), we’ll use 500,000 of these for validation. (Unlike them, though, we won’t be able to train for 870,000 iterations!)

The first variable, class, is either 1 or 0, depending on whether a Higgs boson was present or not. While in experiments, only a tiny fraction of collisions produce one of those, both classes are about equally frequent in this dataset.

As for the predictors, the last seven are high-level (derived). All others are “measured”.

Data loaded, we’re ready to build a tidymodels workflow, resulting in a short sequence of concise steps.

First, split the data:

n <- 11000000
n_test <- 500000
test_frac <- n/n_all

split <- initial_time_split(higgs, prop = 1 - test_frac)
train <- training(split)
test  <- testing(split)

Second, create a recipe. We want to predict class from all other features present:

rec <- recipe(class ~ ., train) 

Third, create a parsnip model specification of class tabnet. The parameters passed are those reported by the TabNet paper, for the S-sized model variant used on this dataset.2

# hyperparameter settings (apart from epochs) as per the TabNet paper (TabNet-S)
mod <- tabnet(epochs = 3, batch_size = 16384, decision_width = 24, attention_width = 26,
              num_steps = 5, penalty = 0.000001, virtual_batch_size = 512, momentum = 0.6,
              feature_reusage = 1.5, learn_rate = 0.02) %>%
  set_engine("torch", verbose = TRUE) %>%
  set_mode("classification")

Fourth, bundle recipe and model specifications in a workflow:

wf <- workflow() %>%
  add_model(mod) %>%
  add_recipe(rec)

Fifth, train the model. This will take some time. Training finished, we save the trained parsnip model, so we can reuse it at a later time.

fitted_model <- wf %>% fit(train)

# access the underlying parsnip model and save it to RDS format
# depending on when you read this, a nice wrapper may exist
# see https://github.com/mlverse/tabnet/issues/27  
fitted_model$fit$fit$fit %>% saveRDS("saved_model.rds")

After three epochs, loss was at 0.609.

Sixth – and finally – we ask the model for test-set predictions and have accuracy computed.

preds <- test %>% 
  bind_cols(predict(fitted_model, test))

yardstick::accuracy(preds, class, .pred_class)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.672

We didn’t quite arrive at the accuracy reported in the TabNet paper (0.783), but then, we only trained for a tiny fraction of the time.

In case you’re thinking: well, that was a nice and effortless way of training a neural network! – just wait and see how easy hyperparameter tuning can get. In fact, no need to wait, we’ll take a look right now.

TabNet tuning

For hyperparameter tuning, the tidymodels framework makes use of cross-validation. With a dataset of considerable size, some time and patience is needed; for the purpose of this post, I’ll use 1/1,000 of observations.

Changes to the above workflow start at model specification. Let’s say we’ll leave most settings fixed, but vary the TabNet-specific hyperparameters decision_width, attention_width, and num_steps, as well as the learning rate:3

mod <- tabnet(epochs = 1, batch_size = 16384, decision_width = tune(), attention_width = tune(),
              num_steps = tune(), penalty = 0.000001, virtual_batch_size = 512, momentum = 0.6,
              feature_reusage = 1.5, learn_rate = tune()) %>%
  set_engine("torch", verbose = TRUE) %>%
  set_mode("classification")

Workflow creation looks the same as before:

wf <- workflow() %>%
  add_model(mod) %>%
  add_recipe(rec)

Next, we specify the hyperparameter ranges we’re interested in, and call one of the grid construction functions from the dials package to build one for us. If it wasn’t for demonstration purposes, we’d probably want to have more than eight alternatives though, and pass a higher size to grid_max_entropy() .

grid <-
  wf %>%
  parameters() %>%
  update(
    decision_width = decision_width(range = c(20, 40)),
    attention_width = attention_width(range = c(20, 40)),
    num_steps = num_steps(range = c(4, 6)),
    learn_rate = learn_rate(range = c(-2.5, -1))
  ) %>%
  grid_max_entropy(size = 8)

grid
# A tibble: 8 x 4
  learn_rate decision_width attention_width num_steps
       <dbl>          <int>           <int>     <int>
1    0.00529             28              25         5
2    0.0858              24              34         5
3    0.0230              38              36         4
4    0.0968              27              23         6
5    0.0825              26              30         4
6    0.0286              36              25         5
7    0.0230              31              37         5
8    0.00341             39              23         5

To search the space, we use tune_race_anova() from the new finetune package, making use of five-fold cross-validation:

ctrl <- control_race(verbose_elim = TRUE)
folds <- vfold_cv(train, v = 5)
set.seed(777)

res <- wf %>% 
    tune_race_anova(
    resamples = folds, 
    grid = grid,
    control = ctrl
  )

We can now extract the best hyperparameter combinations:

res %>% show_best("accuracy") %>% select(- c(.estimator, .config))
# A tibble: 5 x 8
  learn_rate decision_width attention_width num_steps .metric   mean     n std_err
       <dbl>          <int>           <int>     <int> <chr>    <dbl> <int>   <dbl>
1     0.0858             24              34         5 accuracy 0.516     5 0.00370
2     0.0230             38              36         4 accuracy 0.510     5 0.00786
3     0.0230             31              37         5 accuracy 0.510     5 0.00601
4     0.0286             36              25         5 accuracy 0.510     5 0.0136 
5     0.0968             27              23         6 accuracy 0.498     5 0.00835

It’s hard to imagine how tuning could be more convenient!

Now, we circle back to the original training workflow, and inspect TabNet’s interpretability features.

TabNet interpretability features

TabNet’s most prominent characteristic is the way – inspired by decision trees – it executes in distinct steps. At each step, it again looks at the original input features, and decides which of those to consider based on lessons learned in prior steps. Concretely, it uses an attention mechanism to learn sparse masks which are then applied to the features.

Now, these masks being “just” model weights means we can extract them and draw conclusions about feature importance. Depending on how we proceed, we can either

  • aggregate mask weights over steps, resulting in global per-feature importances;

  • run the model on a few test samples and aggregate over steps, resulting in observation-wise feature importances; or

  • run the model on a few test samples and extract individual weights observation- as well as step-wise.

This is how to accomplish the above with tabnet.

Per-feature importances

We continue with the fitted_model workflow object we ended up with at the end of part 1. vip::vip is able to display feature importances directly from the parsnip model:

fit <- pull_workflow_fit(fitted_model)
vip(fit) + theme_minimal()
Global feature importances.

(#fig:unnamed-chunk-16)Global feature importances.

Together, two high-level features dominate, accounting for nearly 50% of overall attention. Along with a third high-level feature, ranked in place four, they occupy about 60% of “importance space”.

Observation-level feature importances

We choose the first hundred observations in the test set to extract feature importances. Due to how TabNet enforces sparsity, we see that many features have not been made use of:

ex_fit <- tabnet_explain(fit$fit, test[1:100, ])

ex_fit$M_explain %>%
  mutate(observation = row_number()) %>%
  pivot_longer(-observation, names_to = "variable", values_to = "m_agg") %>%
  ggplot(aes(x = observation, y = variable, fill = m_agg)) +
  geom_tile() +
  theme_minimal() + 
  scale_fill_viridis_c()
Per-observation feature importances.

(#fig:unnamed-chunk-18)Per-observation feature importances.

Per-step, observation-level feature importances

Finally and on the same selection of observations, we again inspect the masks, but this time, per decision step:

ex_fit$masks %>% 
  imap_dfr(~mutate(
    .x, 
    step = sprintf("Step %d", .y),
    observation = row_number()
  )) %>% 
  pivot_longer(-c(observation, step), names_to = "variable", values_to = "m_agg") %>% 
  ggplot(aes(x = observation, y = variable, fill = m_agg)) +
  geom_tile() +
  theme_minimal() + 
  theme(axis.text = element_text(size = 5)) +
  scale_fill_viridis_c() +
  facet_wrap(~step)
Per-observation, per-step feature importances.

(#fig:unnamed-chunk-20)Per-observation, per-step feature importances.

This is nice: We clearly see how TabNet makes use of different features at different times.

So what do we make of this? It depends. Given the enormous societal importance of this topic – call it interpretability, explainability, or whatever – let’s finish this post with a short discussion.

Interpretable, explainable, …? Beyond the arbitrariness of definitions

An internet search for “interpretable vs. explainable ML” immediately turns up a number of sites confidently stating “interpretable ML is …” and “explainable ML is …”, as though there were no arbitrariness in common-speech definitions. Going deeper, you find articles such as Cynthia Rudin’s “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead” (Rudin (2018)) that present you with a clear-cut, deliberate, instrumentalizable distinction that can actually be used in real-world scenarios.

In a nutshell, what she decides to call explainability is: approximate a black-box model by a simpler (e.g., linear) model and, starting from the simple model, make inferences about how the black-box model works. One of the examples she gives for how this could fail is so striking I’d like to fully cite it:

Even an explanation model that performs almost identically to a black box model might use completely different features, and is thus not faithful to the computation of the black box. Consider a black box model for criminal recidivism prediction, where the goal is to predict whether someone will be arrested within a certain time after being released from jail/prison. Most recidivism prediction models depend explicitly on age and criminal history, but do not explicitly depend on race. Since criminal history and age are correlated with race in all of our datasets, a fairly accurate explanation model could construct a rule such as “This person is predicted to be arrested because they are black.” This might be an accurate explanation model since it correctly mimics the predictions of the original model, but it would not be faithful to what the original model computes.

What she calls interpretability, in contrast, is deeply related to domain knowledge:

Interpretability is a domain-specific notion […] Usually, however, an interpretable machine learning model is constrained in model form so that it is either useful to someone, or obeys structural knowledge of the domain, such as monotonicity [e.g.,8], causality, structural (generative) constraints, additivity [9], or physical constraints that come from domain knowledge. Often for structured data, sparsity is a useful measure of interpretability […]. Sparse models allow a view of how variables interact jointly rather than individually. […] e.g., in some domains, sparsity is useful,and in others is it not.

If we accept these well-thought-out definitions, what can we say about TabNet? Is looking at attention masks more like constructing a post-hoc model or more like having domain knowledge incorporated? I believe Rudin would argue the former, since

  • the image-classification example she uses to point out weaknesses of explainability techniques employs saliency maps, a technical device comparable, in some ontological sense, to attention masks;

  • the sparsity enforced by TabNet is a technical, not a domain-related constraint;

  • we only know what features were used by TabNet, not how it used them.

On the other hand, one could disagree with Rudin (and others) about the premises. Do explanations have to be modeled after human cognition to be considered valid? Personally, I guess I’m not sure, and to cite from a post by Keith O’Rourke on just this topic of interpretability,

As with any critically-thinking inquirer, the views behind these deliberations are always subject to rethinking and revision at any time.

In any case though, we can be sure that this topic’s importance will only grow with time. While in the very early days of the GDPR (the EU General Data Protection Regulation) it was said that Article 22 (on automated decision-making) would have significant impact on how ML is used4, unfortunately the current view seems to be that its wordings are far too vague to have immediate consequences (e.g., Wachter, Mittelstadt, and Floridi (2017)). But this will be a fascinating topic to follow, from a technical as well as a political point of view.

Thanks for reading!

Arik, Sercan O., and Tomas Pfister. 2020. “TabNet: Attentive Interpretable Tabular Learning.” http://arxiv.org/abs/1908.07442.

Baldi, P., P. Sadowski, and D. Whiteson. 2014. “Searching for exotic particles in high-energy physics with deep learning.” Nature Communications 5 (July): 4308. https://doi.org/10.1038/ncomms5308.

Rudin, Cynthia. 2018. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” http://arxiv.org/abs/1811.10154.

Wachter, Sandra, Brent Mittelstadt, and Luciano Floridi. 2017. “Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation.” International Data Privacy Law 7 (2): 76–99. https://doi.org/10.1093/idpl/ipx005.

  1. I’m using the term “naively” here. For a short discussion, see the final section, [Interpretable, explainable, you tell me – beyond arbitrary definitions].↩︎

  2. Apart from the number of epochs, that is.↩︎

  3. The number of epochs is set to one for demonstration purposes only; in reality, you will want to tune this as well.↩︎

  4. See, e.g., <http://www.odbms.org/2018/07/ai-machine-learning-and-the-gdpr-are-the-wild-west-days-of-advanced-analytics-over/>.↩︎

Categories
Offsites

The ants and the pheromones

TLDR; this is the last edition of The Morning Paper for now. Plus: one strand of research you won’t want to miss!

I was listening to a BBC Radio 4 podcast recently (More or Less: Behind the Stats – Ants and Algorithms) in which the host Tim Harford is interviewing David Sumpter about his recent book, ‘The ten equations that rule the world.’ One of those equations, the ‘reward equation’ models how ants communicate using pheromones, and our own brains keep track of rewards using dopamine.

About 4 and a half minutes into the podcast Tim asks a fascinating question: the reward equation includes a decay or ‘forgetting’ parameter, so what happens if you disrupt established solutions for long enough that their hold is broken? For example, the complete disruption to our established routines that Covid has caused over the last year? The answer for ants, if you disrupt all of the pheromone trails around their nest, is that they converge on a new solution in the environment, but it won’t necessarily look the same as the one they had before the disruption. (If you’re interested in the amazing problem-solving skills of ants and how we can learn from them in computer science, I covered ‘Ant algorithms for discrete optimization’ in a previous edition of The Morning Paper). It’s highly likely that the same thing will happen to us when we can eventually return to normal – the patterns that we establish won’t necessarily be the same as the ones we had before the series of lockdowns began.

The lockdowns (as I write this, we’re in another strict lockdown in England, with no end date given) have certainly disrupted my own routines. I’ve lost the time and space that I depended on for studying and writing The Morning Paper – the one-hour each way train journey on my morning commute, and more crucially with two older children both full-time studying from home, the time and space within the home for the many hours of concentrated work required. I don’t think my love of learning will ever leave me though, and at the same time I’ve been branching out and studying other things: philosophy, ethics, physics, a little maths, a little biology,… I’m really enjoying that. My love of computer science remains of course, but when we finally get to lay down our new pheromone trails and establish a new normal, I’m not sure I’m going to want to focus on computer science to the exclusion of all else. It’s been an intense six-and-a-half years doing largely that while writing the blog. For the time being then, I’m putting The Morning Paper back on pause.

Before I wrap up though, I can’t resist pointing you in the direction of one incredibly exciting research project from the Hydro team at Berkeley’s RISELab. Joe Hellerstein recently posted a whole bunch of links and resources in this Twitter thread:

The “PACT” paper is here: New Directions in Cloud Programming, Cheung et al., CIDR 2021.