Google AI Blog

Capturing Special Video Moments with Google Photos

Wednesday, April 3, 2019

Posted by Sudheendra Vijayanarasimhan and David Ross, Software EngineersRethinking the Faster R-CNN Architecture for Temporal Action Localizationtemporal action localization networkFaster R-CNN

An example of the detected action "blowing out candles"

Identifying Actions for Model TrainingComparison to Object Detectiontemporal action localization,object detectionfaster R-CNNfeatureconvolutional neural networkregion proposal network

Faster R-CNN architecture for object detection

Temporal Action Localizationsegment proposal network

Architecture for temporal action localization

Special Considerations for Temporal Action Localization

Actions have much larger variations in durations
The temporal extent of actions varies dramatically—from a fraction of a second to minutes. For long actions, it is not important to understand each and every frame of the action. Instead, we can get a better handle on the action by skimming quickly through the video, using dilated temporal convolutions. This approach allows TALNet to search the video for temporal patterns, while skipping over alternate frames based on a given dilation rate. Analysing the video with several different rates that are selected automatically according to the anchor segment's length enables efficient identification of actions as large as the entire video or as short as a second.

The context before and after an action are important
The moments preceding and following an action instance contain critical information for localization and classification, arguably more so than the spatial context of an object. Therefore, we explicitly encode the temporal context by extending the length of proposal segments on both the left and right by a fixed percentage of the segment's length in both the proposal generation stage and the classification stage.

Actions require multi-modal input
Actions are defined by appearance, motion and sometimes even audio information. Therefore, it is important to consider multiple modalities of features for the best results. We use a late fusion scheme for both the proposal generation network and the classification network, in which each modality has a separate proposal generation network whose outputs are combined together to obtain the final set of proposals. These proposals are classified using separate classification networks for each modality, which are then averaged to obtain the final predictions.

TALNet in ActionTHUMOS'14ActivityNet

An example of the detected action "sliding down a slide"

An example of the detected actions "jump into the pool" (left), "twirl in a dress" (center) and "feed baby a spoonful" (right).

Next stepsprecision and recallAcknowledgementsSpecial thanks Tim Novikoff and Yu-Wei Chao, as well as Bryan Seybold, Lily Kharevych, Siyu Gu, Tracy Gu, Tracy Utley, Yael Marzan, Jingyu Cui, Balakrishnan Varadarajan, Paul Natsev for their critical contributions to this project.

Using Deep Learning to Improve Usability on Mobile Devices

Tuesday, April 2, 2019

Posted by Yang Li, Research Scientist, Google AIgraphical user interfacesfalse affordancesCHI'19Modeling Mobile Interface Tappability Using Crowdsourcing and Deep Learning

Predicting Tappability with Deep Learningthe blue color and underline of a linktypelocationsizecolorwordstypelocation

Heatmaps displaying the accuracy of tappable and non-tappable elements by location, where warmer colors represent areas of higher accuracy. Users labeled non-tappable elements more accurately towards the upper center of the interface, and tappable elements towards the bottom center of the interface.

sizecolorswordword deep neural networklocationwords typesizeconvolutional neural networkEvaluation of the Model

The scatterplot of the tappability probability predicted by the model (the Y axis) versus the consistency in the human user labels (the X axis) for each element in the consistency dataset.

precisionrecallAcknowledgementsThis research was a joint work of Amanda Swangson, summer intern at Google, and Yang Li, a Research Scientist in Deep Learning and Human Computer Interaction.

Unifying Physics and Deep Learning with TossingBot

Tuesday, March 26, 2019

Posted by Andy Zeng, Student Researcher, Robotics at Googlegrasp objects efficientlyvisually self adaptlearn from real-world experiencespicking robot

previous systemspreprintThe Challenges

Throwing depends on many factors: from how you picked it up, to object properties and dynamics.

pushgraspUnifying Physics and Deep Learning

TossingBot starts out performing poorly (left), but progressively learns to grasp and toss overnight (right).

Generalizing to New Scenariosactual

TossingBot can generalize to new objects, and is more accurate at throwing than the average Googler.

TossingBot uses Residual Physics to throw objects to unforeseen locations.

Emerging Semantics from Interaction

TossingBot learns deep features that distinguish object categories without explicit supervision.

Limitations and Future WorkAcknowledgementsThis research was done by Andy Zeng, Shuran Song (faculty at Columbia University), Johnny Lee, Alberto Rodriguez (faculty at MIT), and Thomas Funkhouser (faculty at Princeton University), with special thanks to Ryan Hickman for valuable managerial support, Ivan Krasin and Stefan Welker for fruitful technical discussions, Brandon Hurd and Julian Salazar and Sean Snyder for hardware support, Chad Richards and Jason Freidenfelds for helpful feedback on writing, Erwin Coumans for advice on PyBullet, Laura Graesser for video narration, and Regina Hickman for photography. An early preprint is available on arXiv.

Simulated Policy Learning in Video Models

Monday, March 25, 2019

Posted by Łukasz Kaiser and Dumitru Erhan, Research Scientists, Google AIreinforcement learningclassic Atari 2600 gamesRecent workexploration regimesMontezuma's RevengeModel-Based Reinforcement Learning for Ataricodetensor2tensorLearning a SimPLe World Modelwell establishedrecentmodel-basedreinforcementlearningmethods

Main loop of SimPLe. 1) The agent starts interacting with the real environment. 2) The collected observations are used to update the current world model. 3) The agent updates the policy by learning inside the world model.

pixel spacetrajectoriesrealfeedforward convolutional networkPongpreviousworkOne example of an issue arising from stochasticity is seen when the SimPLe model is applied to Kung Fu Master. In the animation, the left is the output of the model, the middle is the groundtruth, and the right panel is the pixel-wise difference between the two. Here the model's predictions deviate from the real game by spawning a different number of opponents.Proximal Policy Optimization (PPO)FreewaySimPLe Efficiency Rainbow

The number of interactions needed by the respective model-free algorithms (left - Rainbow; right - PPO) to match the score achieved using our SimPLe training method. The red line indicates the number of interactions used by our method.

SimPLe SuccessFreewayBreakoutNearly pixel perfect predictions can be made by SimPLe, on Breakout (top) and Freeway (bottom). In each animation, the left is the output of the model, the middle is the groundtruth, and the right pane is the pixel-wise difference between the two.SimPLe SurprisesAtlantisBattlezonePrivate EyeIn Battlezone, we find the model struggles with predicting small, relevant parts, such as the bullet.ConclusionrepositorycolabAcknowledgementsThis work was done in collaboration with the University of Illinois at Urbana-Champaign, the University of Warsaw and deepsense.ai. We would like to give special recognition to paper co-authors Mohammad Babaeizadeh, Piotr Miłos, Błażej Osiński, Roy H Campbell, Konrad Czechowski, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Ryan Sepassi, George Tucker and Henryk Michalewski.

Reducing the Need for Labeled Data in Generative Adversarial Networks

Wednesday, March 20, 2019

Posted by Mario Lučić, Research Scientist and Marvin Ritter, Software Engineer, Google AI ZürichGenerative adversarial networksGANsthe generator,discriminator,high-fidelity natural image synthesisimproving learned image compression

Evolution of the generated samples as training progresses on ImageNet. The generator network is conditioned on the class (e.g., "great gray owl" or "golden retriever").

conditional GANsHigh-Fidelity Image Generation With Fewer Labels10x fewer labelsCompare GAN libraryImprovements via Semi-supervision and Self-supervisionself-supervisionrecently introduced

An unlabeled image is randomly rotated and the network is tasked with predicting the rotation angle. Successful models need to capture semantically meaningful image features which can then be used for other vision tasks.

previouslyFréchet Inception Distance

Given a latent vector the generator network produces an image. In each row, linear interpolation between the latent codes of the leftmost and the rightmost image results in a semantic interpolation in the image space.

Compare GAN: A Library for Training and Evaluating GANsCompare GAN

Training on GPUs and TPUs.

Lightweight configuration via Gin (examples).

A plethora of data sets via the TensorFlow datasets library.

Conclusions and Future Workincreasingly importantAcknowledgmentsWork conducted in collaboration with colleagues on the Google Brain team in Zürich, ETH Zürich and UCLA. We would like to thank our paper co-authors Michael Tschannen, Xiaohua Zhai, Olivier Bachem and Sylvain Gelly for their input and feedback. We would like to thank Alexander Kolesnikov, Lucas Beyer and Avital Oliver for helpful discussion on self-supervised learning and semi-supervised learning. We would like to thank Karol Kurach and Marcin Michalski for their major contributions to the Compare GAN library. We would also like to thank Andy Brock, Jeff Donahue and Karen Simonyan for their insights into training GANs on TPUs. The work described in this post also builds upon our work on “Self-Supervised Generative Adversarial Networks” with Ting Chen and Neil Houlsby.

Measuring the Limits of Data Parallel Training for Neural Networks

Tuesday, March 19, 2019

Posted by Chris Shallue, Senior Software Engineer and George Dahl, Senior Research Scientist, Google AIimage classificationmachine translationspeech recognitionCloud TPU Podsdistribute computationsmodel parallelismdata parallelismstochastic gradient descentbatch sizeMeasuring the Effects of Data Parallelism in Neural Network Trainingshare our raw data our paperUniversal Relationship Between Batch Size and Training Timestepsout-of-sample error

For all workloads we tested, we observed a universal relationship between batch size and training speed with three distinct regimes: perfect scaling (following the dashed line), diminishing returns (diverging from the dashed line), and maximal data parallelism (where the trend plateaus). The transition points between the regimes vary dramatically between different workloads.

Cloud TPU PodsResNet-8CIFAR-10ResNet-50ImageNetOptimizing Workloads

Left: A transformer neural network scales to much larger batch sizes than an LSTM neural network on the LM1B dataset. Right: The Common Crawl dataset does not benefit from larger batch sizes than the LM1B dataset, even though it is 1,000 times the size.

momentumFuture WorkAcknowledgementsThe authors of this study were Chris Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig and George Dahl (Chris and Jaehoon contributed equally). Many researchers have done work in this area that we have built on, so please see our paper for a full discussion of related work.

A Summary of the Google Flood Forecasting Meets Machine Learning Workshop

Monday, March 18, 2019

Posted by Sella Nevo, Senior Software Engineer and Rainier Aliment, Program ManagerGoogle Flood Forecasting Meets Machine Learningour belief

Panel on challenges and opportunities in flood forecasting, featuring (from left to right): Prof. Paolo Burlando (ETH Zürich), Dr. Tyler Erickson (Google Earth Engine), Dr. Peter Salamon (Joint Research Centre) and Prof. Dawei Han (University of Bristol).

Yossi Matiasrecentmachinelearningflood forecastingcrisis responseAI for Social GoodProf. Peter MolnarProf. Yishay MansourGooglefascinating talksposters

An overview of research areas in flood forecasting addressed in the workshop.

Dr. Dhanya C. T. of IIT Delhi gave a talk on satellite precipitation error characterization.

Adarsh M. S., Assistant Director of the Indian Ministry of Water Resources presented India's Central Water Commission's role and challenges.

Prof. Andras Bardossy of the University of Stuttgart discussed variation in discharge series and the challenges this presents.

Frederik Kratzert of Johannes Kepler University presented recent work on hydrologic modeling using LSTMs.

Prof. Paul Bates of the University of Bristol gave a keynote on the potential uses of machine learning in inundation modelling.

Prof. Emmanouil Anagnostou of the University of Connecticut spoke about hyper-resolution hydrologic simulations at global-scale.

Prof. Efrat Morin of the Hebrew University highlighted flood prediction challenges in dry climate regions.

Dr. Zachary Flamig of the University of Chicago presented NASA's new global flash flood prediction project.

Vova Anisimov presented our progress in hydraulic modeling.

Ami Weisel presented our research on remote discharge estimation.

Stephan Hoyer presented our work on data-driven discretization approach to solving partial differential equations.

Jason Hickey presented our efforts using machine learning for precipitation prediction.

Avinatan Hassidim presented lessons learned from previous projects in Google, and how they apply to our flood forecasting efforts.

Prof. Paolo BurlandoProf. Dawei HanDr. Peter SalamonDr. Tyler EricksonAI for Social Goodour continued engagementAcknowledgementsWe would like to thank Avinatan Hassidim, Carla Bromberg, Doron Kukliansky, Efrat Morin, Gal Elidan, Guy Shalev, Jennifer Ye, Nadav Rabani and Sasha Goldshtein for their contributions to making this workshop happen.

Google Faculty Research Awards 2018

Friday, March 15, 2019

Posted by Maggie Johnson, VP, Education and Negar Saei, Program Manager, University RelationsGoogle Faculty Research Awardshuman computer interactionmachine learningmachine perceptionsystemsrecipientsour websitehere

Harnessing Organizational Knowledge for Machine Learning

Thursday, March 14, 2019

Posted by Alex Ratner, Stanford University and Cassandra Xia, Google AISnorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scaleweak supervisionSnorkelorganizational knowledge resources—labeling functionsnamed-entity recognition

In our example of a labeling function, rather than hand-labeling a data point (1), one utilizes an existing knowledge resource—in this case, a NER model (2)—together with some simple logic expressed in code (3) to heuristically label data.

generative modeling techniquecovariance matrixusing a new matrix completion-style approachTensorFlowUsing Diverse Knowledge Sources as Weak Supervision

Heuristics and rules: e.g. existing human-authored rules about the target domain.

Topic models, taggers, and classifiers: e.g. machine learning models about the target domain or a related domain.

Aggregate statistics: e.g. tracked metrics about the target domain.

Knowledge or entity graphs: e.g. databases of facts about the target domain.

In Snorkel DryBell, the goal is to train a machine learning model (C), for example to do content or event classification over web data. Rather than hand-labeling training data to do this, in Snorkel DryBell users write labeling functions that express various organizational knowledge resources (A), which are then automatically reweighted and combined (B).

MapReduceModeling the Accuracies to Combine & Repurpose Existing Sourcesgenerative modeling techniquestatistically estimatedTransferring Non-Servable Knowledge to Servable Modelsservable non-servable

In many settings, users write labeling functions that leverage organizational knowledge resources that are not servable in production (a)—e.g. aggregate statistics, internal models, or knowledge graphs that are too slow or expensive to use in production—in order to train models that are only defined over production-servable features (b), e.g. cheap, real-time web signals.

servablecross-feature Next Stepsour papersnorkel.stanford.eduAcknowledgmentsThis research was done in collaboration between Google, Stanford, and Brown. We would like to thank all the people who were involved, including Stephen Bach (Brown), Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Souvik Sen, Braden Hancock (Stanford), Houman Alborzi, Rahul Kuchhal, Christopher Ré (Stanford), Rob Malkin.

An All-Neural On-Device Speech Recognizer

Tuesday, March 12, 2019

Posted by Johan Schalkwyk, Google Fellow, Speech Team significant accuracy improvements with deep learningVoice Searchdeep neural networksrecurrent neural networkslong short-term memory networksconvolutional networksStreaming End-to-End Speech Recognition for Mobile DevicesRNN transducer

This video compares the production, server-side speech recognizer (left panel) to the new on-device recognizer (right panel) when recognizing the same spoken sentence. Video credit: Akshay Kannan and Elnaz Sarbar

A Bit of Historysingle attention-basedlisten-attend-spellgreat promiseconnectionist temporal classification (CTC)halve the latencyRecurrent Neural Network Transducers

Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as y_u-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network (paper). The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Encoder Network comprises 8 such layers. Image credit: Chris Thornton

training techniqueparallel implementationTPU v2Offline Recognitionsearch graphFinite State Transducerbeam search through a single neural networkparameter quantization and hybrid kernel techniquesmodel optimization toolkitTensorFlow LiteAcknowledgements:Raziel Alvarez, Michiel Bacchiani, Tom Bagby, Françoise Beaufays, Deepti Bhatia, Shuo-yiin Chang, Zhifeng Chen, Chung-Chen Chiu, Yanzhang He, Alex Gruenstein, Anjuli Kannan, Bo Li, Wei Li, Qiao Liang, Ian McGraw, Patrick Nguyen, Ruoming Pang, Rohit Prabhavalkar, Golan Pundak, Kanishka Rao, David Rybach, Tara Sainath, Haşim Sak, June Yuan Shangguan, Matt Shannon, Mohammadinamul Sheik, Khe Chai Sim, Gabor Simko, Trevor Strohman, Mirkó Visontai, Ron Weiss, Yonghui Wu, Ding Zhao, Dan Zivkovic, and Yu Zhang.

Blog