Pierre Sermanet

Research Scientist @ Google Brain
Self-Supervised Learning, Computer Vision & Robotics

Grounding Language in Play

Corey Lynch and Pierre Sermanet

We present a simple and scalable approach for controlling robots with natural language: play through teleoperation, then answer “how do I go from start to finish?” for random episodes. We can then type in commands in real time.

By hooking up our English-trained model with a pre-trained language embedding trained on lots of text and different languages, it not only improves control but also allows commanding the model in 16 languages.

Combining natural language with play provides a breadth of skills while having no tasks determined in advance. This yields flexible specification of tasks, for example we can compose tasks on the fly: “pick up the object”, then “put the object in the trash”.
Learning Latent Plans from Play

Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, Pierre Sermanet

How to scale-up multi-task learning?
Self-supervise plan representations from lots of cheap unlabeled play data (no RL was used).
Self-Supervised Actionable Representations

Debidatta Dwibedi, Jonathan Tompson, Corey Lynch, Pierre Sermanet @ IROS 2018

We learn continuous control entirely from raw pixels.
We use a multi-frame TCN to self-supervise task-agnostic representations from vision only, using 2 slightly different views of the cheetah.
Then using RL on top of our embeddings we learn the cheetah task almost as well as if we were using the true proprioceptive states of the cheetah.

Time-Contrastive Networks (TCN)

Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine @ ICRA 2018

We propose a general self-supervised method for learning representations from raw unlabeled videos.
We show that the self-supervised representations are rich enough to perform robotic tasks.

We use the distance in our learned embedding space to a video demonstration as a reward. An RL algorithm can learn to perform a pouring task using this reward. The robot has learned to pour in only 9 iterations using a single video demonstration, while never receiving any labels.

We also show that a robot can teach itself how to imitate people: by training a single TCN on videos of both humans and robots peforming random motions, the TCN model is able to find correspondences between humans and robots, despite never being given any label correspondences.

Unsupervised Perceptual Rewards

Pierre Sermanet, Kelvin Xu, Sergey Levine @ RSS 2017

We propose learning unsupervised perceptual rewards that can be fed to an RL system and show it is able to learn a robotic task such as door opening from a few human demonstrations.
Visual Attention

Pierre Sermanet, Andrea Frome, Esteban Real @ ICLR 2015 (workshop)

We demonstrate a foveated attention RNN that is able to perform fine-grained classification.
Tracking naturally emerges from our fovated model when ran on videos, even though it was only trained on still images.
Inception / GoogLeNet

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich @ CVPR 2015

A deep architecture for computer vision. Our model obtained 1st place for the classification and detection tasks in the 2014 ImageNet Challenge.
Dogs vs. Cats Kaggle challenge

Pierre Sermanet (2014)

1st place in an image classification Kaggle challenge between dog and cat images. Most of the top entries are based on our OverFeat model.
OverFeat

Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun @ ICLR 2014

This model obtained 1st place in the 2013 ImageNet object localization challenge. The model and pre-trained features were later released to the public.

Overfeat has been used by Apple for on-device face detection in iPhones: blogpost
Pedestrian Detection

Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann Lecun @ CVPR 2013

State of the art results on pedestrian detection datasets using deep ConvNets in the EBLearn framework.
Convolutional Neural Networks Applied to House Numbers Digit Classification

Pierre Sermanet, Soumith Chintala, Yann LeCun @ ICPR 2012

State of the art results in house numbers classification using deep ConvNets.

Traffic Sign Recognition

Pierre Sermanet, Yann LeCun @ IJCNN 2011

This deep model obtained 2nd place in a traffic sign recognition challenge using the EBLearn framework. It uses skip connections in deep ConvNets to better combine low-lvel and high-level learned features.
Unsupervised Convolutional Feature Hierarchies

Koray Kavukcuoglu, Pierre Sermanet, Y-Lan Boureau, Karol Gregor, Michael Mathieu, Yann LeCun @ NIPS 2010

An unsupervised method for learning multi-stage hierarchies of sparse convolutional features. One of the few instances of this period where unsupervised pretraining improved results in a supervised task.
EBLearn

Pierre Sermanet, Koray Kavukcuoglu, Yann LeCun @ ICTAI 2009
Additional help from Soumith Chintala.

A C++ deep learning framework similar to Torch and used for multiple state of the art results in computer vision.
Teaching Assitant for NYU Robotics class

Pierre Sermanet, Yann LeCun (2009)

LAGR: Learning Applied to Ground Robots

Yann LeCun, Urs Muller, Pierre Sermanet, Marco Scoffier, Chris Crudelle, Beat Flepp, Ayse Erkan, Matt Grimes, Raia Hadsell, Koray Kavakcuoglu, Marc'Aurelio Ranzato, Jan Ben, Sumit Chopra, Jeff Han, Marc Peyote, Marc'Aurelio Ranzato, Ilya Rosenberg, Yury Sulsky

A DARPA challenge where the NYU-NetScale team developed ConvNets for long-range off-road navigation from 2004 to 2008.
Learning Long-Range Vision for Autonomous Off-Road Driving

Raia Hadsell, Pierre Sermanet, Jan Ben, Ayse Erkan, Marco Scoffier, Koray Kavukcuoglu, Urs Muller, Yann LeCun @ JFR 2009

An overview paper of our self-supervised deep learning vision model.
Collision-Free Off-Road Robot Navigation

Pierre Sermanet, Raia Hadsell, Marco Scoffier, Matt Grimes, Jan Ben, Ayse Erkan, Chris Crudele, Urs Muller, Yann LeCun @ JFR 2009

An overview paper of our navigation system designed to naturally handle errors and outputs coming out of a deep vision model. This model decouples the fast and short-range navigation from the slow and long-range navigation to achieve robustness.
Learning Maneuver Dictionaries for Ground Robot Planning

Pierre Sermanet, Marco Scoffier, Chris Crudele, Urs Muller, Yann LeCun @ ISR 2008

Instead of computing the theoretical dynamics of a vehicle, we propose to simply record the observed dynamics while a human operator "plays" with the robot, essentially trying all possible moves. At test time, the model has a bank of observed possible trajectories for every state of the motors. Trajectories leading to collisions are discarded, while the fastest available trajectory is selected. While we observed many collisions using the baseline system, we did not observe collisions after introducing this model.
Mapping and Planning under Uncertainty in Mobile Robots with Long-Range Perception

Pierre Sermanet, Raia Hadsell, Marco Scoffier, Urs Muller, Yann LeCun @ IROS 2008

A hyperbolic-polar coordinate mapping system that is naturally suited to handle imprecisions in long-range visual navigation.
Deep Belief Net Learning in a Long-Range Vision System

Raia Hadsell, Ayse Erkan, Pierre Sermanet, Marco Scoffier, Urs Muller, Yann LeCun @ IROS 2008

Self-supervised long-range visual navigation with deep ConvNets.
Online Learning for Offroad Robots

Raia Hadsell, Pierre Sermanet, Ayse Naz Erkan, Jan Ben, Jefferson Han, Beat Flepp, Urs Muller, Yann LeCun @ RSS 2007

Online adaptation of long-range vision by self-supervising with short-range stereo vision.
EUROBOT 2004 Competition

Computer vision, navigation and behaviors by Pierre Sermanet, Philippe Rambert, Jean-Baptiste Mouret
Entire team: Evolutek

Vision-based behaviors in a robot-rugby challenge.

Acknowledgments

This article was prepared using the Distill template.