In this podcast, Prof. Sergey Levine talked about the bottlenecks to generalization in reinforcement learning, why simulation is doomed to succeed, and how to pick good research problems.
Transcipt generated by Whisper
I do think that in science, it is a really good idea to sometimes see how extreme of
a design can still work because you learn a lot from doing that.
And this is by the way, something I get a lot of comments on this, like, you know, I'll
be talking to people and they'll be like, well, we know how to do like robotic grasping
and we know how to do inverse kinematics and we know how to do this and this.
So why don't you like use those parts?
And it's like, yeah, you could, but if you want to understand the utility, the value
of some particular new design, it kind of makes sense to really zoom in on that and
really isolated and really just understand its value instead of trying to put in all
these cartridges to compensate for all the parts where we might have better existing
kind of ideas.
Hey there, I'm your host, Kan Jun, and we are Generally Intelligent, an independent
research lab developing AI agents that mirror the fundamentals of human-like intelligence
and that can learn to safely solve problems in the real world.
On our podcast, we interview researchers about their behind the scenes ideas, opinions,
and intuitions that are hard to share in papers and talks.
We hope you learn as much as we have in our quest to understand and build the mind.
In this episode, we're interviewing Sergei Levin, one of the pioneers of modern deep
He's an assistant professor of EECS at UC Berkeley, and his research focuses on developing
general purpose algorithms for autonomous agents to be able to learn to solve any task.
We're really excited having Sergei.
Welcome to the podcast.
We always start with, how did you develop your initial research interests and how have
they evolved over time?
I know you took a class from Andrew Yang, where you decided to switch mid-grad school
to machine learning.
What were you doing before, and then what happened after?
When I started graduate school, I wanted to work on computer graphics.
I was always very interested in virtual worlds, CG film, video games, that sort of thing.
And it was just really fascinating to me how you could essentially create kind of a synthetic
environment in a computer.
So I really wanted to figure out how we could advance the technology of doing that.
When I thought about which area of computer graphics to concentrate in, I decided on computer
animation, specifically character animation, because by that point, at least in 2009, we
already had pretty good technology for simulating physics, for simulating how inanimate objects
And really the big challenge was getting plausible behavior out of virtual humans.
And when I started working on this, I pretty quickly discovered that, you know, essentially
the bottleneck with virtual humans is simulating their minds.
All the way from very basic things like, you know, how do you decide how to move your legs
if you want to climb some stairs, to more complex things like, how should you conduct
yourself if you're playing a soccer game with teammates and opponents?
And this naturally leads us to think about decision-making and AI, you know, in my case,
initially in service to creating plausible virtual humans, but once I realized how big
that problem really was, it became natural to think of it more as just developing artificial
And certainly the initial wave of the deep learning revolution, which started around
that same time, was, you know, a really big part of what got me to switch over from pure
computer graphics research into things that involved combination control and machine learning.
Was that in 2011, 2012?
So my first paper on, well, my first machine learning paper was actually earlier than that,
but my first paper that involved something that we might call deep learning was in 2012.
And it was actually right around the same time as the DeepMind Atari work came out.
It was actually a little before that.
And it focused on using what today we would call deep reinforcement learning, which back
then was not really a term that was widely used for locomotion behaviors for 3D humanoid.
And then what happened after that?
I worked on this problem for a little while for the last couple of years in grad school.
And then after I finished my PhD, I started looking for postdoc jobs because, you know,
I was really only about partway through switching from graphics to machine learning.
So I wasn't particularly well established really in either community at that point.
This is perhaps a lesson for the PhD students that are listening to this, that if you want
to switch and do it in like the fourth year of your PhD is a little chancy because I sort
of ended up with one foot on either side of that threshold and nobody really knowing me
So I decided to do a postdoc in some area where I could get a little bit more established
in machine learning.
So, and I wanted to stay in the Bay Area for personal reasons.
So I got in touch with Professor Peter Abbeel, who's now my colleague here at UC Berkeley,
about a postdoc position.
It was kind of interesting because I interviewed for this job and I thought the interview went
horribly because, well, really it wasn't my fault because when I showed up for the interview
at UC Berkeley with Peter's lab, they had moved the deadline for IROS, which was a major
It was supposed to be earlier.
So after the deadline and then all, you know, all the students would listen to my talk and
Peter presumably would be a little relaxed.
They moved the deadline to be that evening.
So everyone listening to my talk was kind of stressed out.
I could tell that like they, their, their mind were elsewhere afterwards.
There was a certain remark, I'm sure Peter won't mind me sharing this, but he mentioned
something to the effect of like, Oh, you know, I don't think that I want my lab working on
all this like animation stuff.
So I kind of felt like I really grew it, but he gave me a call a few weeks later and offered
me the job, which was fantastic.
And I guess it was kind of generous in this part because at the time he was presumed taking
a little bit of a chance, but it worked out really well.
And I switched over to robotics and that was actually a very positive change in that a
lot of the things that I was trying to figure out in computer animation, they would be tested
more rigorously and more thoroughly in the robotics domain because there you really had
to deal with the complexity of the real world.
That makes sense.
Now, how do you think about all the kind of progress with generative environments and
Do you feel like the original problems you're working on in animation are largely solved
or do you feel like there's a lot more to do there?
Yeah, that's a good question.
So I took a break from the computer graphics world for a while, but then over the last
five years, there was actually a student in my lab, Jason Peng, who's now a professor
at Simon Fraser University in Canada.
He just graduated last year and he more or less in his PhD, I would say basically solved
the problems that I had tried in my own PhD a decade prior.
I think he did a much better job with it than I ever did.
So he had several works that essentially took deep RL techniques and combined them with
large scale generative adversarial networks to more or less provide a pretty comprehensive
solution to the computer animation problem.
So his latest work, which was actually done in collaboration with NVIDIA, the kind of
approach that he adopts is he takes a large data set of motion capture data.
You can kind of think of it as like old motion capture data we can get our hands on.
It trains a latent variable GAN on it that will generate human-like motion and embed
it into a latent space that will provide a kind of a higher level space for control.
So you can sort of make this method as producing this model where you feed it in random numbers
and for every random number, it'll produce some natural motion, running, jumping, whatever.
And then those random numbers serve as a higher level action space, so that latent space now
everything in that latent plausible motion, and then you can train some higher level policy
that will steer it in that latent space.
And that actually turns out to be a really effective way to do animation because once
you get that latent space, now you can forget about worrying about whether the motion is
It'll always be plausible and realistic, and now you can just be entirely goal-driven in
that latent space.
That's very clever.
He has a demo in SIGGRAPH this past year where he has these virtual characters, you know,
doing sword fighting and jumping around and so on.
This is what, in my PhD, if someone showed it to me, I would have said this is like science
It was kind of like the dream of the computer graphics community for a long time.
And I think Jason really did a fantastic job of it.
So if anyone listening to this is interested in computer animation, Jason Fang, his work
is the work of Jason.
I think he'll do really well going forward.
He's part-time at NVIDIA too, so he's doing some mysterious things that he hasn't, he
has been very cagey with the details on it, but I think there might be something big coming
out in the imminent future with that.
That's really interesting.
So you feel like your PhD work is at least solved by Jason?
Yeah, I think he kind of, yeah, he kind of took care of that one.
When you first thought started in robotics, what did you feel like were the important
Robotics traditionally is thought of as very much like a geometry problem plus a physics
If you open up like the Springer Handbook on Robotics or a more kind of classical advanced
robotics course textbook, a lot of what you will learn about has to do with understanding
the geometries of objects and modeling the mechanics of articulated rigid body systems.
And this approach took us very far from the earliest days of robotics in the 50s and 60s
all the way to the kind of robots that are used in manufacturing all the time today.
In some ways, the history of robotic technology is one of building abstractions, taking those
abstractions as far as we can take them and then hitting some really, really difficult
And the really difficult wall that robotics generally hits with this kind of approach
has to do with situations that are not as fully and cleanly structured as this rigid
body structure would have us believe.
Not just in their physical phenomena that are outside of this model, but also because
they have challenges having to do with characterization and identifiability.
So if you're, you know, let's say you have a robot in your home that's supposed to just
like clean up your home and put away all the objects, even if those are rigid objects that
are in principle within that abstraction, you don't know exactly what shape they are,
what their mass distribution is and all this stuff and perception and things like that.
So all of those things taken together more or less put us in this place where the clean
abstraction kind of like really doesn't give us anything.
The analogy here is in the earliest days of computer vision, the first thing that people
basically thought of when they thought about how to do computer vision is that, well, computer
vision is like the inverse graphics problem.
So if you believe the world is made out of shapes, you know, they have geometry, let's
figure out their vertices and their edges and so on.
And people kind of tried to do this for a while and it was very reasonable and very
sensible from an engineering perspective until in 2012, Alice Krzyzewski had a solution to
the ImageNet challenge that didn't use any of that stuff whatsoever and just used the
giant neural net.
So I kind of suspect that like the robotics world is kind of just getting to that point
like right around in the last half a decade or so.
And so when you first joined Peter Veal's lab as a postdoc, you kind of saw the world
of robotics where everything was these like rigid body abstractions and what were you
Like were you like, okay, well, it seems like, you know, nobody's really using deep learning.
No one's really doing end to end learning.
So were you like, I'm going to do that or kind of how did you kind of?
So actually I started working with a student who his most recent accomplishment was to
actually take ideas that were basically rooted in this kind of geometric approach to robotics
and extend them somewhat so they could accommodate deformable objects, robes and cloth, that
sort of thing.
So they had been doing laundry folding and knot tying.
And I won't go too much into the technical details, but it was kind of in the same wheelhouse
as these geometry based methods that have been pioneered for rigid objects and grasping
in the decade prior and with some clever extensions, they could do some knot tying
and things like that.
And I started working on how we could kind of more or less throw all that out and replace
it with end to end learning from deep nets.
And I intentionally want to make it like a little bit extreme.
So instead of trying to like gently turn these geometry based methods, the ones that use
learning more and more, we actually decided that we would actually just completely do,
you know, the maximally end to end thing.
The student in question was John Shulman, and he ended up doing his PhD on end to end
deep reinforcement learning and later on developed the most widely used reinforced learning method
today, which is PPO.
So he now works at OpenAI and perhaps his most recent accomplishment is something that
some of your viewers might have heard about.
It's called ChatGPT.
But that's maybe a story for another time.
So we did some algorithms work there.
And then in terms of robotics applications, I worked with another student that some of
you are listening to might also know, Chelsea Finn, she's now a professor at Stanford.
And there we wanted to see if we could introduce kind of the latest and greatest convolutional
neural network techniques to directly control robot motion.
And again, we chose to have a very end to end design there where we took the PR2 robot
and we basically looked through the PR2 manual and we found the lowest level of control you
could possibly have.
You can't command motor torques exactly, but you can command what's called motor effort,
which apparently is roughly proportional to current on the electric motors.
So I did a little bit of coding to set up a controller that would directly command these
efforts at some reasonable frequency.
Chelsea coded up the ComNet component, we wired it all together, managed to get a training
end to end.
And then we set up a set of experiments that were really intentionally meant to evaluate
whether the end to end part really mattered.
So this was not, you know, these days, this would be something that people would more
or less take for granted.
It's like, yeah, of course, end to end is better than plugging in a bunch of geometric
But we really wanted to convince people this was true.
So we ran experiments where we separated out localization from control.
We had like more traditional computer vision techniques, geometry based techniques.
And we tried to basically see whether going directly from raw pixels all the way to these
motor effort commands could do better.
And we set up some experiments that would actually validate that.
So we had these experiments where the robot would take a little colored shape and insert
it into a shape sorting cube.
So it's a children's toy.
We're supposed to match the shape to the shape of the whole.
And one of the things that we were able to actually demonstrate is that the end to end
approach was in fact better because essentially it could trade off errors more smartly.
So if you're inserting this shape into a hole, you don't really need to be very, very accurate
and figure out where the hole is vertically because you'll just be pushing it down all
So that's more robust to an accuracy.
But then in terms of errors in the other direction, there, it's a little more sensitive.
So we could show that we could actually do better with end to end training than if we
had localized the whole separate plan and commanded a separate controller.
That work actually resulted in a paper that was basically the first deep reinforcement
learning paper for image-based real world robotic manipulation.
It was also rejected numerous times by robotics reviewers because at the time this was a little
bit of a taboo to do too many neural nets, but eventually it ended up working out.
One thing I'm really curious about with this end to end experiment is, did it just work?
Like you set everything up, you coded the CNN.
Was it really tricky to get working or did it work much better than you expected?
It's always very difficult to disentangle these things in science because like obviously
it didn't just work on the first try, but a big part of why it didn't work on the first
try had to do with a bunch of coding things.
For example, one of the things that this was sort of before there were really, really nice
clean tools for GPU based acceleration of dominance.
So back then cafe was one of those things that everybody would use and it was very difficult
for us to get this running on board on the robot.
So we actually had some fairly complicated system where the continent would actually
run on one machine.
Then in the middle of the network, it would send the activations over to a different machine
on board the robot for the real time controller.
So it was still an end to end neural net, but like half the neural net was running on
one computer, half of it was running on another computer and then the gradients would have
to get passed back.
So it was like, it was a little complicated and the bulk of the challenges we had had
more to do with systems design and that sort of thing.
But part of why it did basically work once we debug things was that the algorithm itself
was based on things that I had developed for previous projects just without the computer
So going from low dimensional inputs to actions was something that had already been developed
and basically worked.
This was a continuation of my PhD work.
So a lot of the challenges that we had had to do with getting the systems parts right.
It may also have to do with getting the design of a continent to be effective in relatively
lower data regimes because these robot experimental, they would collect maybe four or five hours
So one of the things that Chelsea had to figure out is how to get a neural net architecture
that could be relatively efficient and she basically used a proxy task that we designed
where instead of actually iterating on the full control task on the real robot, we would
have a little pose detection task that we would use to just protect the network and
that she could iterate on just entirely offline.
So she would test out the comment on that, get it working properly.
And then once we knew that it worked for this task, then we kind of knew that it was roughly
good enough in terms of sample efficiency and then we just retrained it with the intent.
That makes sense.
So the moral of the story to folks who might be listening and working on these kinds of
robotic learning systems, it does actually help to break it up into components, even
if you're doing intending in the end, because you can kind of get the individual neural
net components all working nicely and then just redo it with the end to end thing.
And that does tend to take out a lot of the pain.
It sounds like you kind of got the components working first.
It's interesting you made this comment about just making the problem a lot more extreme
when you were talking about a student using thin plate blinds.
And I'm curious, is this an approach you've used elsewhere, kind of making the problem
much more extreme and throwing out everything?
I think it's a good approach.
I mean, it depends a little bit on what you want to do, because if you really want to
build a system that works really well, then of course you want to sort of put everything
in the kitchen sink in there and just like use the best tools for every piece of it.
But I do think that in science, it is a really good idea to sometimes see how extreme of
a design can still work because you learn a lot from doing that.
And this is, by the way, something I get a lot of comments on this, like, you know, I'll
be talking to people and they'll be like, well, we know how to do like robotic grasping
and we know how to do inverse kinematics and we know how to do this and this.
So why don't you like use those parts?
And it's like, yeah, you could.
But if you want to understand the utility, the value of some particular new design, it
kind of makes sense to really zoom in on that and really isolate it and really just understand
its value instead of trying to put in all these car just to compensate for all the parts
where we might have better existing kind of ideas, you know, as an analogy, if you want
to design better engines for electric cars, like maybe you do build just like not like
a fancy hybrid car, but really just like an electric race car or something, just see like
how fast can it go.
And then whatever knowledge you develop there, like, yeah, you can then put it in, you know,
combine it with all these pragmatic and very sober decisions and make it work afterwards.
That's really interesting.
So kind of do the hardest thing first, do the most extreme thing first.
So after you publish this extremely controversial paper that gets rejected everywhere, what
What were you interested in next?
There were a few things that we wanted to do there, but perhaps the most important one
that we came to realize is, and this is going to lead to things that in some ways I'm still
working on, is that, of course, we don't really want end-to-end robotic deep learning systems
that just train with like four or five hours of data.
The full power of learning is really only realized once you have very large amounts
of data that can enable broad generalization.
So this was a nice technology demo in that it showed that deep nets could work with robots
And of course, you know, many people took that up and there's a lot more work when using
deep nets for manipulation now, but it didn't realize the full promise of deep learning
because the full promise of deep learning required large data sets.
And that was really the next big frontier.
So what I ended up working on after this was some work that was done at Google.
So I started at Google in 2015 and there we wanted to basically scale up deep robotic
And what we did is we, again, we took a fairly extreme approach.
We intentionally chose not to do all sorts of fancy transfer learning and so on.
We went for like the pure brute force thing and we put 18 robots in a room and we turned
them on for months and months and months and had them collect enormous amounts of data
And that led to sometimes referred to as the arm farm project.
It might've actually been Jeff Dean who coined that term.
At one point we wanted to call it the arm pit, but I think people didn't really like
For this project, we wanted to pick a robotic task that was kind of basic in the sense that
it was something that like everybody would want and it was fairly broad that like all
robots should have that capability.
And it was something that could be applied to large sets of objects or something that
really needed generalization.
So we went with robotic grasping, like basically bin picking, because that's not the most glamorous
thing, but it is something that really needs to generalize because you can pick all sorts
of different objects.
It's something that every robot needs to have and it's something that we could scale up.
So we went for that because that seemed like the right target for this kind of very extreme
purist brute force approach.
And basically what we did is we went down to Costco and Walmart and we bought tons of
plastic junk and we would put it in front of these robots and just like day after day,
we would load up the bins in front of them and they would just run basically as much
One of the things that I spend a lot of time on is just like getting the uptime on the
robots to be as high as it could be.
So Peter Pastor, a roboticist at Google and I, we basically did a lot of work to increase
And of course, with a great team that was all supporting this effort, Peter Pastor was
probably the main one who did a lot of that stuff.
After several months, it got to a point where actually relatively simple techniques could
acquire very effective robotic grasping policies.
The interesting anecdote here is we were doing this work.
It took us a while to do it.
So it came out in 2016, just a few months actually after AlphaGo was announced.
Alex Krzyzewski, who was working with us on the ComNet design, when AlphaGo was announced,
he actually told me something to the effect of like, oh, you know, for AlphaGo, they have
like a billion something games and you gave me only a hundred thousand grasping episodes.
It seemed like this was going to work.
So I remember I had some snarky rhetoric where I said, well, yeah, they have like a billion
games, but they still can't pick up the go pieces.
But on a more serious note, like around this time, I was actually starting to get kind
of disappointed because this thing didn't really work very well.
And I, I think some of this robotics wisdom had rubbed off on me.
So I was saying, well, like, OK, maybe we should like put in some more like domain
knowledge about the shapes of objects.
And so, and I remember Alex also told me like, oh, no, no, no, just like be patient, just
like add more data to it.
So I, I, I heeded that advice.
And after a few more months, it took a, it took a little while, but after a few more
months, basically the same things that he, he had been trying back then just started
working once there was enough critical mass.
Obviously there were a few careful design decisions in there, but we did more or less
succeed in this fairly extreme kind of purist way of tackling the problem, which again, it
was not by any means the absolute best way to build a grasping system.
And actually, since then, people have developed more hybrid grasping systems that use
depth and 3D and simulation and also use deep learning.
And I think it's fair to say that they do work better, but it was a pretty interesting
experience for us that just getting robots in a room for several months with some
simple, but careful design choices could result in a very effective grasping system.
That's one of the things that's interesting for me is that the scale of that data at
today's point about, you know, like a billion go games or like GPT-3, like the
amount of data that is trained on the scale of these robotics things is just so much
smaller, like a few months.
And what was the total number of months in R like the total amount of time in that
data set is only on the order of years, right?
So it's a little hard to judge because obviously the uptime for the robots is not
a hundred percent, but roughly speaking, yeah, it's, if I do a little bit of quick
mental math, it would be on the order of a couple of years of robot time and the total
size of the data set was on the order of several hundred thousand trials, which
amounts to about 10 million images.
But of course the, you know, the images are all age and time.
So basically it's roughly like ImageNet sized, but not much bigger than that.
And the images are much less diverse than ImageNet.
It's surprising that it worked at all given how small the data set was.
Yeah, that's not a crazy huge data set.
Well, although one thing I will say on this topic is that I think a lot of people are
very concerned that large data sets in robotics might be impractical.
And there's a lot of work, a lot of very good work, I should say, on all sorts of
transfer learning ideas.
But I do think that it's perhaps instructive to think about the problem as a prototype
for a larger system, because if someone actually builds, let's say a home robot.
And let's say that one in a hundred people in America buy this robot and put it in
their homes, that's on the order of 3 million people, 3 million robots.
And if those 3 million robots do things for even one month in those
homes, that is a lot of data.
So the thing is robots, if they're autonomous robots, they should be
collecting data way more cheaply and a way larger scale than data that we
So for this reason, I actually think that robotics in the long run may actually be
at a huge advantage in terms of its ability to collect data.
We're just not seeing that huge advantage now in robotic manipulation because we're
stuck at the smaller scale, more due to economics rather than, I would say, science.
And by the way, here's an example that maybe hammers this point home.
If you work at Tesla, you probably don't worry about the size of your data set.
You might worry about the number of labels.
You're not going to worry about the number of images you've got because that robot is
actually used by many people.
So if robotic arms get to the same point, we won't worry about how many images we're
I'm curious what your ideal robot to deploy would be.
Like, what do you think about the humanoid robot versus some other robot type?
Yeah, that's a great question.
If I was more practically minded, if I was a little more entrepreneurial, I would
probably give Midia a more compelling answer.
But to be honest, I actually think that the most interesting kinds of robots to
deploy, especially with reinforcement technology, might actually be robots that
are very unlike humans.
Of course, it's very tempting, like from science fiction stories and so on, to think
like, OK, well, robots, they'll be like Rosie from the Jetsons or like, you know,
Commander Data from Star Trek or something.
They'll look like people and they will kind of do things like people and maybe they
will. That's fine.
There's nothing wrong with that. And that's kind of exciting.
But perhaps even more exciting is the possibility that we're going to have
morphologies that are so unlike us that we wouldn't even know how these things could
do stuff. You know, maybe your home robot will be a swarm of a hundred quadrotors
that just like fly around like little flies and like clean up your house.
Right. So they will actually behave in ways that we would not have been able to
design manually and where good reinforcement learning methods would actually figure out
ways to control these bizarre morphologies in ways that are actually really
Huh. That's really interesting.
It'll be interesting to see happen.
I think maybe one other, I mean, there's lots of things against the humanoid
structure. One thing that it does have going for it is most of the world is currently
made for people like to open this door, right?
And this sliding door is like kind of heavy.
It's almost impossible with a quadrotor.
It doesn't matter how clever it is because it just doesn't have enough force.
But yeah, it would be interesting to think about like what kind of crazy strategies
they might come up with.
You worked on the Google Arm Farm project for a while and eventually it seems like
enough data allows you to use relatively simple algorithms to be able to solve the
grasping problem in this kind of extreme.
So what were you thinking about after that?
After that, the next frontier that we need to address is to have systems that can
handle a wide range of tasks.
So grasping is great, but it's a little special.
It's special in the sense that one very compact task definition, which is like, are
you holding an object in your gripper, can encompass a great deal of complexity.
Most tasks aren't like that.
For most tasks, you need to really specify what it is that you want the robot to do.
And it needs to be deliberate about pursuing that specific goal and not some other
So that leads us into things like multitask learning.
It leads us into things like goal specification and instructions.
One of the things that my students and I worked on when I started as a professor at UC
Berkeley is trying to figure out how we can get goal conditioned reinforcement learning
to work really well.
So we sat down and we thought, well, like this grasping thing, that was great because
like one very concise task definition leads to a lot of complexity.
So you can define like a very simple thing, like are you holding an object and lots of
complexity emerges from that just through kind of autonomous interaction.
So can we have something like that, some very compact definition that encompasses a wide
range of different behaviors?
The thing that we settled on to start with was goal conditioned reinforcement learning,
where essentially the robot gets in the early days, literally a picture of what the
environment should be.
And it tries to manipulate the environment until it matches that picture.
Of course, you can do goal conditioned reinforcement learning in other ways.
For example, more recently, the way that we and many others have been approaching it is
by defining the goal through language, but just defining it through pictures is fine to
get started because there you kind of just focus on just the visual and the control aspect
of the problem.
The very first work that we had on this image goal conditioned reinforcement learning,
this was work done by two students, Vichy Pong and Ashwin Nair, who both sit down to
They work at OpenAI now, but back then they were working on this image based robotic
The robot could do very simple things.
It was like put an upside down blue bowl, like five inches across the table, right?
That was the task.
But that was the first ever demonstration of an image based goal conditioned RL system.
So other people had done, not an image based goal conditioned RL things, but in the real
world with images, that was the first demonstration.
And yes, pushing an upside down blue bowl five inches across the table is kind of lame,
but it was a milestone that got things rolling.
From there, they did other things that were a little more sophisticated.
One of the experiments that really stands out in my mind that I thought was pretty neat
is we had set up a robot in front of a little cabinet with a door and Vichy and Ashwin
that developed an exploration algorithm where the robot would actually directly imagine
It used the generative model, it was a VAE based model that would literally hypothesize
the kinds of images that it could accomplish in this environment, attempt to reach them
and then update its model.
So it was like the robot is sort of like dreaming up what it could do, attempting to see
if it actually works.
And if it doesn't work, imagine something else.
And they ran this experiment, it was obviously a smaller scale experiment than the R farm.
They ran it over just one day, but within 24 hours, you would actually first figure out
how to move the gripper around because it was really interesting.
The gripper moved, but then once it started touching the door, it saw that, oh, actually
like the door starts swinging open.
So now it imagines lots of different angles for the open door.
And from there, it starts actually manipulating it.
And then it learns how to open the door to any desired angle at the end.
And that was entirely autonomous, right?
You just put it in front of the door and wait.
So that was a pretty neat kind of sign of things to come.
Obviously, at a much smaller scale, it suggests that if you have this kind of goal image
thing, then you could push it further and further.
And of course, since then, we and many others have pushed this further.
In terms of more recent work on this topic, there's a very nice paper from Google called
Actionable Models, where he actually combines this with offline reinforcement learning,
using a bunch of these large multi-robot data sets that have been collected at Google to
learn very general goal-conditioned policies that could do things like rearrange things
on a table and stuff like that.
So this stuff has come a long way since then.
For the goal condition on language, like from an image perspective, it's easy to tell, like,
is this image the image that I wanted?
But from language, like, what sort of techniques are you excited about for evaluating whether
this goal has actually been accomplished?
There's a lot of interesting work going on in this area right now, and some of it, my
colleagues and I at Google are working on.
There are many other groups that are working on this, like Dieter Fox's lab is doing
wonderful work in this area with NVIDIA.
And well, so this is something that people have had on their mind for a while.
But I think that most recently, the thing that has really stimulated a lot of research
in this area is the advent of vision language models like CLIP that actually work.
And in some ways, I feel a certain degree of vindication myself in focusing on just
the image part of the problem for so long, because I think one of the things that good
vision language models allow you to do is kind of not worry about the language so much,
because if you have good visual goal models, then you can plug them in with vision
And the visual language model almost acts like a front end for interfacing these
non-linguistic robotic controllers with language.
As a very kind of simple example of this, my student Dhruv Shah has a paper called LM
Nav that basically does this for navigation.
So Dhruv had been working on just purely image based navigation, kind of in a similar
regime where you specify an image goal.
And then together with Brian Echter from Google and Blage from University of Warsaw,
they have a recent paper where they basically just kind of do the obvious thing.
They take a vision language model, they take CLIP and they just weld it onto this thing
as a language front end.
So everything underneath is just purely image based.
And then CLIP just says like, OK, among these images, which one matches the
instruction these are provided?
And that basically does the job.
And it's kind of nice that now progress on visual language models, which can take place
entirely outside of robotics, would basically lead to better and better language front
ends for really visual goal condition systems.
How far do you feel like visual goal condition systems can go, especially with
I think they can go pretty far, actually.
And I think that the important thing there, though, is to kind of think about it the right
way. Like, I think we shouldn't take the whole like matching pixels thing a little too
literally. It's really more like the robots go.
There's kind of a funny version of this that actually came up in a project on robotic
navigation that Drew and I were doing where we had data robots driving around at
different times of day. And there's almost like a philosophical problem.
You give it a picture of a building at night and it's currently during the day.
So should it drive to the building and wait until it's night or should it like, you know, wait
around until it gets dark?
Because that's most of them.
Yeah. So you kind of have to be able to learn representations that abstract away all of
these kind of non-functional things.
But if you're reaching your goal in a reasonable representation space, then it actually
does make sense. And fortunately, with deep learning, there are a lot of ways to learn good
representation. So as long as we don't take the whole thing too literally and we use
appropriate representation learning methods, it's actually a fairly solid approach.
Right. That makes sense.
That's actually a really interesting question.
Kind of if you give a picture of a building at night and it's daytime, it doesn't matter in some
situations. But in other situations, it really does matter.
It really depends on kind of what the higher level goal is.
But it doesn't have that concept of higher level goal yet.
Yeah. So in reinforcement learning, people have thought about these problems a bit.
So from a very technical standpoint, goal condition policies do not represent all possible
tasks that an agent can perform.
But the set of state distributions does define the set of all possible.
So if you can somehow lift it up from just conditioning on a single goal state to
conditioning on a distribution over states, then that provably allows you to represent all
tasks that could possibly be done.
And there are different ways that people have approached this problem that are very
interesting. They've approached it from the standpoint of these things called successor
features, which are based on successor representations.
You can roughly think of these as low dimensional projections of state distributions.
More recently, there's some really interesting work that I've seen out of FAIR.
This is by a fellow named Ahmed Touadi and Yann Olivier, two researchers at META that are
developing techniques for unsupervised acquisition of these kind of feature spaces where
you can project state representations and get policies that are sort of conditional on any
possible tasks. So there's a lot of active research in this area.
It's something I'm really interested in.
I think it's possible to kind of take these goal conditional things a little further and
really conditional on any notion of a task.
When you're thinking about what directions to pursue and especially given the number of
people that you collaborate with, the number of students and things like that, how do you
think about picking which research questions to answer and how has that evolved over the
years? There are a couple of things I could say here.
Obviously, the right way to pick research questions really depends a lot on one's
research values and what they want out of their research.
But for me, I think that something that serves as a really good compass is to think about
some very distant end goal that I would really like to see, like generally capable robotic
systems, generally capable AI systems, AI systems that could do anything that humans can
do. And then when thinking about research questions, I ask myself, if a research project
that I do is wildly successful, the most optimistic sort of upper confidence bound
estimate of success, will it make substantive progress towards this very distant end goal?
You really want to be optimistic when making that gauge because obviously the expected
outcome of any research project is failure.
Like research is failure.
That's kind of the truth of it.
But if the most optimistic outcome for your research project is not making progress on
your long term goals, then something is wrong.
So I always make sure to look at whether the most optimistic guess of the outcome makes
substantial progress towards the most distant and most ambitious goal that I have in mind.
Has your distant end goal changed over time?
Not in a huge way, but I think it's easy to have a goal that doesn't change much over
time if it's distant enough and big enough.
If your end goal is something as broad as like, well, I just want generally capable AI
systems that can do anything a person can do.
It's like, well, yeah, I mean, that may be a very far away target to hit, but it's also
such a big target to hit that it's probably going to be reasonably conservative.
That makes sense.
And that's yours is to make general purpose.
What do you feel like are the most interesting questions to you right now?
One thing I maybe that I can mention here is that I think that especially over the last
one or two years, there has been a lot of advances in machine learning systems, both
in robotics and in other areas like vision and language that do a really good job of
emulating people through imitation learning, through supervised learning.
That's what language models do essentially, right?
They're trained to imitate huge amounts of human produced data.
Imitation learning and robotics have been tremendously successful.
But I think that ultimately we really need machine learning systems that do a good job
of going beyond the best that people can do.
And that's really the promise of reinforcement learning.
And if we were to chart the course of this kind of research, it was something like,
well, about five years back when there was a lot of excitement about reinforcement
learning, things like AlphaGo, a really exciting prospect there was that emerging
capabilities from these algorithms could lead to machines that are superhuman, that
are significantly more capable than people at certain tasks.
But it turned out that it was very difficult to make that recipe by itself scale because
a lot of the most capable RL systems relied in a really strong way on simulation.
So in the last few years, a lot of the major advances have taken a step back from that
and instead focused on ways to bring in even more data, which is great because at least
a really good generalization.
But when using purely supervised methods with that, you get at best an emulation of
human behavior, which in some cases, like with language models, is tremendously
powerful because, well, if you have the equivalent or even a loose approximation of
human behavior for typing text, that's tremendously useful.
But I do think that we need to figure out how to take these advances and combine them
with reinforcement learning methods, because that's the only way that we'll get to
above human behavior to actually have inorgan behavior that improves on the typical
human. And I think that's actually that's actually where there's a major open question
how to combine not the simulation based, but the data driven approach with reinforcement
learning in a very effective way.
That's interesting. Do you feel like you have any thoughts on how to do that
combination? In my group at Berkeley, we're focusing a lot on what we call offline
reinforcement learning algorithms.
And the idea is that traditionally reinforcement learning is thought of as a very
online and interactive learning regime.
Right. So if you open up the classic Sutton and Bardo textbook, the most canonical
diagram that everyone remembers, the cycle where the agent interacts with the
environment and then produce an action, the environment produce some state and it all
goes in a loop. It's a very online, interactive picture of the world.
But the most successful large scale machine learning systems, language models, giant
conv nets, clip, et cetera, they're all trained on data sets that have been collected
and that are stored to disk and then reused repeatedly.
Because if you're going to train on billions and billions of images or billions of
documents of text, you don't want to recollect those interactive each time you
retrain your system. So the idea in offline reinforcement learning is to take a large
data set like that and extract a policy by analyzing the data set, not by interacting
directly with a simulator or a physical process.
You could have some fine tuning afterwards, a little bit of interaction, but the bulk of
your understanding of the world should come from a static data set because that's much
more scalable. That's the premise behind offline reinforcement learning.
And we've actually come a long way in developing algorithms that are effective for
this. So when we started on this research in 2019, it was basically like nothing
worked. And you would take algorithms that work great for online RL and in the
offline regime, they just didn't do anything.
Whereas networks show like pretty respective algorithms for doing this and we're
starting to apply them, including to RL training of language models.
We had a paper called Implicit Language Q-Learning on this earlier this year, as well
as pre-training large models for robotic control.
So that stuff is really just starting to work now.
And I think that's one of the things that we'll see a lot of progress on very
It's interesting when you first started working on offline RL, what were the problems
that you felt like needed to be solved in order to get offline RL to work at all?
So the basic problem with offline RL, which people well, I can step back a little bit.
So in the past, people thought that offline RL really wasn't that different from
kind of traditional value based methods like Q-Learning.
And you just needed to kind of come up with appropriate objectives and
representations. And then, you know, whatever you do to fit Q functions from online
interaction, maybe you could just do the same thing with static data and that would
kind of work. And it actually did work in the olden days when everyone was using
linear function approximators because linear function approximators are fairly
low dimensional and you can run them on offline data.
And they kind of do more or less the same thing that they would do with online data,
which is not much, to be honest.
But then with deep neural nets, when you run them with offline data, you get a problem
because deep nets do a really good job of fitting to the distribution they're trained
on. And the trouble is that if you're doing offline RL, the whole point is to change
your policy. And when you change your policy, then the distribution that you will see
when you run that policy is different from the one you're trained on.
And because neural nets are good at fitting to the training distribution, that strength
becomes a weakness when the distribution changes.
And it turns out this is something that people always started realizing a couple of
years back, but now is a very widely accepted notion that this distributional shift is a
very fundamental challenge in offline reinforcement learning.
And it really deeply connects to counterfactual influence.
Reinforcement learning is really about counterfactual.
It's about saying, well, I saw you do this and that was the outcome.
I saw you do that and that was the outcome.
What if you did something different?
Would the outcome be better or worse?
That's the basic question that reinforcement learning asks.
And that is a counterfactual question.
And with counterfactual questions, you have to be very careful because some questions
you simply cannot answer.
So if you've only seen cars driving on a road and you've never seen them swerve off the
road and go into the ditch, you actually can't answer the question, what would happen if
you go into the ditch? The data is simply not enough to tell you.
So in offline RL, the correct answer to that is don't do it because you don't know what
will happen. Avoid the distributional shift for which there is no way for you to produce
a reasonable answer.
But at the same time, you still have to permit the model to generalize.
If there's something new that you can do that is sufficiently indistributional that you
do believe you can produce an accurate estimate of the outcome, then you should do that
because you need generalization to improve over the behavior that you saw in the data
set. And that's a very delicate balance to strike.
Is there a principled answer to that or is it just a sort of like heuristic?
I would just pick something in the middle and it kind of works sometimes.
There are multiple principled answers, but one answer that seems pretty simple and seems
to work very well for us.
This was it was developed in a few different concurrent papers, but in terms of the
algorithms that people tend to use today, probably one of the most widely used
formulations, it was in a paper called Conservative Q-Learning by Abir Al-Kumar, one of
my students here. The answer was, well, be pessimistic.
So essentially, if you are evaluating the value of some action and that actually looks a
little bit unfamiliar, give it a lower value than your network thinks it has.
And the more unfamiliar it is, the lower the value you should give it.
And if you're pessimistic in just the right way, that pessimism will cancel out any
erroneous overestimation that you would get from mistakes in your neural network.
And that actually tends to work.
And it's simple.
It doesn't require very sophisticated uncertainty estimation.
And it essentially harnesses the network's own generalization abilities because this
pessimism, it affects the labels for the network and then the network will generalize
from those labels. So in a sense, the degree to which it penalizes unfamiliar actions is
very closely linked to how it's generalizing.
So that actually allows it to still make use of generalization while avoiding the really
weird stuff that it should just not do.
And then in offline RL, thinking about techniques for going forward, do you feel like
there's a lot left to be done in offline RL or are we sort of at the point where like we
have decent techniques, we're learning a lot from these data sets that we have, and we
sort of need something else to move forward and actually make systems that are
significantly better than what's in the data already?
Yeah, I think we've made a lot of progress on offline RL.
I do think there are major challenges still to address.
And I would say that these major challenges fall into two broad categories.
So the first category has to do with something that's not really unique to offline RL,
actually, like it's a problem for all RL methods.
And that has to do with their stability and scalability.
So RL methods, not just offline RL, all of them, are harder to use than supervised
And a big part of why they're harder to use is that, for example, with value-based
methods like Q-learning, they are not actually equivalent to gradient descent.
So gradient descent is really easy to do.
Gradient descent plus backprop, supervised learning, cross-entropy loss, great.
Like fair to say that that's kind of at a point where it's a turnkey thing.
You just, you know, you code it up, by George Jacks, whatever, it works.
Value-based RL is not gradient descent, it's fixed-point iteration disguised as
And because of that, a lot of the nice things that make gradient descent so simple
and easy to use start going a little awry when you're doing Q-learning or value
iteration type methods.
And we've actually made some progress in understanding this.
There's work on this in my group.
There's work on this in several other groups, including, for example, Shimon
Whiteson's group at Oxford and many others, that just recently we've sort of
started to scratch the surface for what is it that really goes wrong when you use
Q-learning style methods, these fixed-point iteration methods, rather than
And the answer seems to be, and this is kind of preliminary, but the answer seems
to be that some of the things that make supervised deep learning so easy actually
make RL hard.
So let me unpack this a little bit.
If you told somebody who's like a machine learning theorist in, let's say, early
2000s, that you're going to train a neural net with like a billion parameters
with gradient descent for like image recognition, they would probably tell
you, well, yeah, that's really dumb because you're going to overfit and it's
going to sell.
So like, why are you even doing this?
And based on the theory at the time, they would have been completely right.
The surprising thing that happens is that when we train with supervised learning
with gradient descent, there's some kind of magical, mysterious fairy that comes
in and applies some magic regularization that makes it not overfit.
And in machine learning theory, one of the really active areas of research has been
to understand like, who is that fairy and what is the magic and how does it work out?
And there are a number of hypotheses that have been put forward that are pretty
interesting that all have to do with some kind of regularizing effect that basically
makes it so it's a giant overparameterized neural net that actually somehow comes up
with a simple solution rather than an overly complex one.
And this is sometimes referred to as implicit regularization.
Implicit in the sense that it emerges implicitly from the interplay of deep nets
and stochastic gradient descent.
And it's really good.
Like that's kind of what saves our bacon when we use these giant networks.
And it seems to be that for reinforcement learning, because it's not exactly gradient
descent, that implicit regularization effect actually sometimes doesn't play in our
Like sometimes it's not actually a fairy, it's like an evil demon that comes in and
like screws up your network.
And that's really worrying, right?
Because like we have this like mysterious thing that seems to have been like really
helping us for supervised learning.
And now suddenly doing RL, it comes in and hurts us instead.
And at least to a degree, that seems to be part of what's happening.
So now that there was a slightly better understanding of that question, and I don't
want to overclaim how good our understanding of that is because there's major holes in
So there's a lot to do there, but at least we have an inkling.
We have a suspect, so to speak, even if we can't prove that they did it, we can start
trying to solve the problem.
We can try, for example, inserting explicit regularization methods that could counteract
some of the ill effects of this no longer helpful implicit regularization.
We can start designing architectures that are maybe more resilient to these kinds of
effects. So that's something that's happening now.
And it's not by any means like a solved thing, but that's where we could look for
potential solutions to these kind of instability issues that seem to afflict
What's the intuition behind why implicit regularization seems to help in
supervised networks, but be harmful in RL?
The intuition is roughly that given a wide range of possible solutions, a wide range
of different assignments to the weights of a neural net, you would select the one that
is simpler, that results in the simpler function.
So there are many possible values of neural net weights that would all give you a low
training loss, but many of them are bad because they overfit.
And implicit regularization leads to selecting those assignments to the weights that
result in simpler functions and yet still fit your training data and therefore generalize
better. And so the intuition for RL is, OK, for whatever reason, implicit regularization
results in learning simpler functions, but actually those simpler functions are worse
in an RL regime.
Yeah, so in RL, it seems that you get one of two things.
You either get that whole thing kind of fails entirely and you get really, really
And roughly speaking, that's like overfitting to your target values.
Basically, your target values are incorrect in the early stages.
So you overfit to them and you get some crazy function.
Essentially, you get like a little bit of noise in your value estimates and that noise
exacerbated more and more and more until all you've got is noise.
Or on the other hand, the other thing that seems to sometimes happen and
experimentally, this actually seems fairly common, is that this thing goes into
overdrive and you actually discard too much of the detail and then you get an overly
simple function. But somehow, you know, it seems hard to hit that sweet spot.
The kind of sweet spot that you hit every time with supervised learning seems
annoyingly hard with reinforcement learning.
That's interesting. How much does data diversity help?
Like if you were to add a lot more offloading data of various types, does that seem
to do anything to this problem or not really?
We actually have a recent study on this.
This was done by some of my students together, actually in collaboration with Google on
large scale offline RL, actually for Atari games.
And there we study what happens when you have lots of data and also large networks.
And it seems like the conclusion that we reached is actually that if you're careful in
your choice of architecture, basically select architectures that are very easy to
optimize, like ResNets, for example, and you use larger models than you think would be
appropriate, larger than what you would need even for supervised learning, then things
actually seem to work out a lot better.
And in that paper, kind of our takeaway was that actually a lot of reasons why large
scale RL efforts were so difficult before is that people were sort of applying their
supervised learning intuition and selecting architectures according to that, when in
fact, if you go like somewhat larger than that, maybe two times larger in terms of
architecture size, that actually seems to mitigate some of the issues.
It probably doesn't fully solve them, but it does make things a lot easier.
And it's not clear why that's true.
But one guess might be that when you're doing reinforcement learning, you don't just
need to represent the final solution at the end.
You don't just need to represent the optimal solution.
You also need to represent everything in between.
You need to represent all those suboptimal behaviors on the way there.
And those suboptimal behaviors might be a lot more complicated.
Like the final optimal behavior might be hard to find, but it might be actually a
fairly simple parsimonious behavior.
The suboptimal things where you're all kind of OK here, kind of OK there, maybe kind of
optimal over there, those might actually be more complicated and you might require more
representational capacity to go on that journey and ultimately reach the optimal
That's really interesting that in RL, you need to do this counterfactual reasoning pretty
explicitly, and so you'd need to represent these suboptimal behaviors.
But in, let's say, a language model, you don't need to.
They're often quite bad as a counterfactual reasoning.
And we do see that they get better at that as they get larger.
So there's something interesting here.
And actually trying to improve language models through reinforcement learning,
particularly value-based reinforcement learning, is something that my students and I are doing
quite a bit of work on these days.
So obviously, many of your listeners are probably familiar with the success of RL with
human preferences and recent language models work.
But one of the things that's kind of a, well, one of the ways in which that falls short
is that a lot of the ways that people do RL with language models now treats the language
models task as a one-step problem.
So it's just supposed to generate like one response and that response should get the
But if we're thinking about counterfactuals, that is typically situated in a multi-step
process. So maybe I would like to help you debug some kind of a technical problem.
Like maybe you're having trouble reinstalling your graphics driver.
Maybe I might ask you a question like, well, what kind of operating system do you have?
Have you tried running this diagnostic?
Now, in order to learn how to ask those questions appropriately, the system needs to
understand that if it has some piece of information, then it can produce the right answer.
And if it asks the question that can get that piece of information, it's a multi-step
process. And if it has suboptimal data of humans that were doing this task, maybe not
so well, then it needs to do this counterfactual reasoning to figure out what is the most
optimal questions to ask and so on.
And that's stuff that you're not going to get with these kind of one-step human
And certainly it's not what you're going to get with regular supervised learning
formulations, which will simply copy the behavior of the typical human.
So I think there's actually a lot of potential to get much more powerful language
models with appropriate value-based reinforced learning, the kind of reinforced learning
that we do in robotics and other RL applications.
Digging into that a little bit, like how does that work tactically for you and for students
at your lab, given that the larger you make these language models or the more capable
they are? And, you know, it's kind of hard to run even inference for these things on the
kind of compute that's usually available at an academic institution.
I mean, you guys have a decent amount of compute for universities, but still not quite
the same as say Google or OpenAI.
Yeah, it's certainly not easy, but I think it's entirely possible to take that problem
and divide it into its constituent parts so that if we're developing an algorithm that
is supposed to enable reinforcement learning with language models, well, that can be
done with a smaller model, evaluating the algorithm appropriately to just make sure
that it's like doing what it's supposed to be doing.
And that's a separate piece of work from the question of how it's going to be scaled up
to the largest size to really see how far it can be pushed.
So subdividing the problem appropriately can make this quite feasible.
And I don't think that's actually something that is uniquely demanded in academia.
Like even if you work for a large company, even if you have all the TPUs and GPUs that
you could wish for at your fingertips, which, by the way, researchers at large companies
don't always have, even then it's a good idea to chop up your problem into parts because
you don't want to be waiting three weeks just to see that you implemented something
incorrectly in your algorithm.
So in some ways, it's not actually that different, just that there's that last stage of
really fully scaling it up.
But, you know, hey, I mean, I think for graduate students that want to finish their PhD, in
many cases, they're happy to leave that to somebody who is more engineering focused to
get that last mile anyway.
So as long as we have good ways to get things, good benchmarks and good research
practices, we can make a lot of progress on this stuff.
Is there any worry that emergent behaviors that you see at much larger scales would kind
of cause you to make the wrong conclusion on a larger scale with some of these
Yes, that's definitely a really important thing to keep in mind.
So I think that it is important to have a loop, not just a one directional pipeline,
but there's a middle ground to this.
So we have to kind of hit that middle ground.
We don't want to be entirely we don't want to commit the same sin that all too often
people committed in the olden days of reinforcement learning research, where we do
things at too small of a scale to see the truth, so to speak.
But at the same time, we want to do it a small enough scale that we can make progress,
get some kind of turnaround, maybe find the right collaborators in an industrial setting
once we do have something working so that we can work together to scale it up and
complete the life cycle that way.
Yeah, actually, that brings me back to another question I was going to ask earlier when
you were talking about the examination of performance on Atari games as you made the
models just much larger.
It does seem like in reinforcement learning, the models are much, much smaller than they
are in many other parts of machine learning.
Do you have any sense for exactly why that is?
Is it just historical?
Is it merely a performance thing?
It just seems like, you know, I see a lot of like three layer continents or something
like not even a ResNet or like two layer MLP or something that's just much, much simpler
and very small dimensions.
Well, that has to do with the problems that people are working on.
So it's quite reasonable to say that if your images are Atari game images, it's a
reasonable guess that the visual representations that you would need for that are less
complex than what you would need for realistic images.
And when you start attacking more realistic problems, more or less exactly what you
expect happens, that the more modern architectures do become tremendously useful as
the problem becomes more realistic.
And certainly not in our robotics work, the kind of architectures we use generally are
much closer to the latest architectures in computer vision.
So it's really just with relation to the problem, like as you get closer to the real
world, the more the larger networks start to pay off quite a bit.
Although I guess the interesting thing about the Atari thing was like, as you made
these larger, they seem to help anyway, right?
So that was kind of the surprising thing.
So certainly in robotics, this was not news that, you know, in robotics, people, us and
many others have used larger models.
And yes, it was helping.
But the fact that for these Atari games, where if you just wanted to, let's say,
imitate good behavior, you get away with a very small network, learning that good
behavior with offline value-based reinforcement learning really benefited from the
And it seems to have more to do with the kind of optimization benefits rather than just
being able to represent the final answer.
In terms of the goal of getting to more general intelligence, some people, they feel,
okay, if we just keep scaling up language models and adding things onto them, so
doing, you know, multi-step human preferences formulations and finding some way to
spend compute at inference so that it can do reasoning, then we'll be able to get all
the way with just these language-based formulations.
What are your thoughts on that and kind of like the importance of robotics versus not?
There are a couple of things that I could say on this topic.
So first, let's just keep the discussion just to language models to start with.
So let's say that we believe that doing all the language tasks somebody would want to
do is kind of, that's good enough and that's fine.
Like, there's a lot you can do that way.
Is it sufficient to simply build larger language models?
And I think that the answer there, in my opinion, would be no, because there are
really two things that you need.
You need the ability to learn patterns and data, and you need the ability to plan.
Now, plan is a very loaded word, and I use that term in the same sense that, for
example, like Rich Sutton would use it, where plan really refers to some kind of
computational process that determines the course of action.
It doesn't necessarily need to be literally like you think of individual steps in a
plan. It could be reinforcement.
Reinforcement is a kind of amortized planning, but there's some kind of some
process that you need where you're actually reflecting on the patterns you learned
through some sort of optimization to find good actions rather than nearly average
actions. And that could be on a training time, so that could be like the value-based
RL. It could also be done at test time.
It can simply be that all you learn from your data is a predictive language model.
But then at test time, instead of simply doing the maximum posteriority coding,
instead of simply finding the most likely answer, you actually do some kind of
optimization to find an answer that actually leads to an outcome that you want to
see. So maybe I'm trying to debug your graphics driver problem, and what I want is
I want you to say at the end, thank you so much.
You did a good job. You fixed my graphics driver.
So I might ask the model, well, what could I say now that would maximize the
probability that I'll actually fix your graphics driver?
And if the model can answer that question, maybe some kind of optimization procedure
can answer that question. That's planning.
Planning could also mean just running Q-learning.
That's fine, too. So whatever thing it is, that's actually very important.
And I will say something here that a lot of people, when they appeal to the possibility
that you can simply build larger and larger models, they often reference Rich
Sutton's bitter lesson essay.
It's a great essay. I would actually strongly recommend to everybody to read it, but to
actually read it, because he doesn't say that you should use big models and lots of
data. He says you should use learning and planning.
And that's very, very important because learning is what gets you the patterns and
planning is what gets you to be better than the average thing in those patterns.
Yeah. So that's what we need.
We need the planning. I've been telling people to actually read the bitter lesson.
There's also Josh's takeaway.
Yeah, but I guess to push back on that just slightly as a double-decker advocate for a
second, like it might be the case that I think, you know, some of these people saying
the large language model maximalists are saying maybe we can get away with sort of
simple types of planning in language.
So, for example, chain of thought ensembling or asking the language model, like, what
would you do next? Or just sort of like kind of heuristic, simple, kind of bolted on
planning in language afterwards.
I think that's a perfectly reasonable hypothesis, for what it's worth.
I think that the part that I might actually take issue with is that that's actually an
easier way to do it.
I think it might actually be more complex.
It's just ultimately what we want is something that so we want simplicity because
simplicity makes it easy to make things work at a large scale.
Like, you know, if your method is simple, there's essentially fewer ways that it could
go wrong. So.
I don't think the problem with clever prompting is that it's too simple or primitive.
I think the problem might actually be that it might be too complex and then developing a
good, effective reinforcement learning or planning method might actually be a simpler,
more general solution.
What do you think of other types of reinforcement learning setups?
Like, I'm not sure if you saw the work by Anthropic, I think, maybe earlier this week
or very recently, basically, instead of doing RL with human feedback, they propose
doing RL with AI feedback.
It's like, OK, we'll train this on the preference model and then sort of use that to do
a feedback loop as a way of sort of automating this and getting the human out of the loop
as maybe an alternative to offline RL or yeah.
Yeah, I like that work very much.
I think that the part the part I might suddenly disagree with, I don't think it's an
alternative to offline RL. I think it's actually a very clever way to do offline RL.
I like that line of work very much because I think it gets at a similar goal of trying
to essentially do planning and optimization procedure at training time using what is in
effect a model.
The language model is being used as a model, and that's great because then you can get
And I think it's actually in my mind, it's actually more interesting than leveraging
human feedback, because with human feedback, you're essentially relying on human teachers
to hammer this into you, which is pragmatic.
Like if you want to build a company and you really want things to work today, like,
yeah, it's great to leverage humans because hire lots of humans and get them to hammer
your model until it does what you want.
But the prospect of having an autonomous improvement procedure, that's essentially the
dream of reinforcement learning, an autonomous improvement procedure where the more
compute you throw at it, the better it gets.
So, yeah, I read that paper.
I think it's great in terms of technical details.
I think a multi-step decision making process would be better than a single step decision
making process. But I think a lot of the ideas in terms of leveraging the language
models themselves to facilitate that improvement are great.
And I think that is actually in a reinforcement learning algorithm, an offline
reinforcement learning algorithm in disguise, actually a very thin disguise.
These language models, aside from what we talked about earlier with translating images
into language, can we use the embeddings that are learned or anything like that for
robotics type problems?
Yeah, so I think that perhaps one of the most immediate things that we get out of that is
a kind of human front end, in effect, where we can build robotic systems that understand
visuomotor control, basically how to manipulate the world and how to change things in
the environment. We can build those kinds of systems and then we can hook them up to an
interface that humans can talk to by using these visual language models.
So that's kind of the most obvious, most immediate application.
I do think what a really interesting potential is for it to not simply be a front end, but
actually have it be a bidirectional thing where potentially these things can also take
knowledge contained in language models and import it into robotic behavior.
Because one of the things that language models are very good at is acting as almost like
really, really fancy like relational databases, like the kind of stuff that people were
doing in the 80s and 90s, where you come up with a bunch of logical propositions and you
can say, well, like, is A true?
And you look up some facts and you figure out, you know, A is like B, etc.
Language models are great at essentially doing that.
So if you want the robot to figure out like, oh, I'm in this building, where do I go if I
want to get a glass of milk?
It's like, well, the milk is probably in the fridge, the fridge is probably in the kitchen,
the kitchen is probably down the hallway in the open area because kitchens tend to be near a
break area, it's an office building, like all this kind of factual stuff about the world can
probably get a language model to just tell you that.
And if you have a vision language model that acts as an interface between the symbolic
linguistic world and the physical world, then you can import that knowledge into your
And now for all this factual stuff, it'll kind of take care of it.
It won't take care of all the low level stuff, it won't tell the robot how to like move its
fingers. So the robot still has to do that.
But it does a great job of taking care of these kind of factual semantics.
Right, right. And there's a bunch of work using these language models for higher level
planning and then telling the instructions to the robot.
What do you think about the approach of collecting a lot of robotic data sets and then
making a much larger RL model and then training on this diversity of data sets to get kind
of quote unquote simulate the generality of something that you would get from one of these
large scale self-supervised models?
That's a great direction, and I should say that my students and I have been doing a lot of
work and a lot of planning on how to build general and reusable robotic control models.
So far, one of our results is kind of closest to this is a paper by Dhruv Shah called General
Navigation Models, which deals with the problem of robotic navigation.
And what Dhruv did is basically he went to all of his friends who work on robotic
navigation and borrowed their data sets.
So we put together a data set with eight different robots.
So it's not a huge number. It's probably eight.
But they really run the gamut all the way from small scale RC cars.
So these are all mobile robots.
So small scale RC car things like, you know, something that's like 10 inches long all the way
to full scale ATVs.
So these are off-road vehicles that are used for research.
Like you can actually sit in it.
So there's a large kind of car and everything in between.
I think there's a there's like a spot mini in there.
There's a bunch of other stuff.
And he trained a single model that does goal-based navigation just using data from all
these robots. And the model is not actually told which robot it's driving.
It's given a little context.
So it has a little bit of memory.
And basically just by looking at that memory, you can sort of guess roughly what the
properties of the robot is currently driving are.
And the model will actually generalize to drive new robots.
So we actually got it, for example, to fly a quadrotor.
Now, the quadrotor had to pretend to be a car.
So the quadrotor is still controlled only in two dimensions because there were no flying
vehicles in the data set. But it has a totally different camera.
It has like a fisheye lens.
Obviously it flies.
So it wobbles a bit. And the model could just in zero shot immediately fly this quadrotor.
In fact, we put that demo together before a deadline.
So the model worked on the first try.
What took us the most time is figuring out how to replace the battery in the quadrotor
because we hadn't used it for a year.
Once we figure out how to replace the battery, the model could actually figure out how to
fly the drone immediately.
So, I mean, navigation obviously is simpler in some ways than robotic manipulation because
you're not making contact with the environment, at least if everything's going well.
So in that sense, it's a simpler problem.
But it does seem like multi-robot generalization there was very effective for us.
We're certainly exploring multi-robot generalization for manipulation right now.
We're trying to collaborate with a number of other folks that have different kinds of
robots. There's a large collection effort from Chelsea Phillips Group at Stanford that
we're also partnering up with.
So I think we'll see a lot more of that coming in the future.
And I'm really hopeful that a few years from now, the standard way that people approach
robotics research will be just like Envision and NLP to start with a pre-trained
multi-robot model that has basic capability and really build their stuff on top of it.
That's really interesting. In terms of thinking about the next few years, let's say
the next five years, do you have a sense of what kind of developments you'd be most
excited to see that you kind of expect will happen aside from pre-trained models for
I mean, obviously, the pre-trained models one is a very pragmatic thing.
That's something that's super important.
But the thing that I would really hope to see is something that makes lifelong robotic
learning really the norm.
I think we've made a lot of progress on figuring out how to do large scale
imitation learning. We developed good RL methods.
We've built a lot of building blocks.
But to me, the real promise of robotic learning is that you can turn on a robot, leave
it alone for a month, come back, and suddenly it's like figured out something amazing
that you wouldn't have thought of yourself.
And I think to get there, you really need to get in the mindset of robotic learning
being an autonomous, continual and largely unattended process.
If I can get to the point where I can walk into the lab, turn on my robot and come back
in a few days and it's actually spent the intervening time productively, I would
consider that to be a really major success.
How much of that do you think is important to focus on the actual lifetime of the
individual robot, like treating it as an individual versus like, well, it's just like a
data collector for the offline RL data set and it just sends it up there and gets
whatever coming back down afterwards?
Oh, I think that's perfectly fine.
Yeah. And I think that in reality, for any practical deployment of these kinds of ideas
that scale, it would actually be many robots all collecting data, sharing it and
exchanging their brains over a network and all that.
That's the more scalable way to think about on the learning side.
But I do think that also on the physical side, there are a lot of practical challenges
and just like, you know, what kind of methods should we even have if we want the robot
in your home to practice, you know, cleaning your dishes for three days.
Like if you just run a learning algorithm for a robot in your home, probably the
first thing it'll do is wave its arm around, break your window, then break all of your
dishes, then break itself and then spend the remaining time it has just sitting there a
broken corner. So there's a lot of practicalities.
That's right. And it won't go out and buy more dishes, which is what you'd want it to
No, no, I don't think you'd want that to go outside and buy more dishes.
That would go outside and fall down the steps, hurt someone, get in the middle of the
road and cause an accident.
In all seriousness, that's where I think a lot of these challenges are wrapped up,
because in some ways, all of these difficulties that happen in the real world, they're
also opportunities like, OK, OK, maybe the breaking of the dishes is extreme, but if
it drops something on the ground, well, great, like figure out how to pick it up off the
ground. If it spills something, great, good time to figure out how to get out the sponge
and clean up your spill. Like robots should be able to treat all these unexpected events
that happen as new learning opportunities rather than things that just cause them to
fail. And I think that there's a lot of interesting research wrapped up in that.
It's just hard to attack that research because it always kind of falls in between
different disciplines. Like it doesn't slot neatly into just developing a better RL
method or just developing a better controller or something.
That's really interesting, huh?
That, yeah, it's kind of like somewhere between continual learning and robotics and
some other stuff.
And it's all about the messy deployment parts, like the part about the quadcopter
taking longer to replace the battery and then to train probably wasn't even in the
paper. It's just a thing that's elided.
It wasn't even in the appendix.
No, it wasn't in the appendix.
It might be in the undergraduate students grad school application essay.
Looking to the past, whose work do you feel like has impacted you the most?
It's an interesting question.
There's some kind of like very standard answers I could give, but I actually think
that one body of work that I want to highlight that maybe not many people are
familiar with that was actually quite influential on me is work by Immanuel
Todorov. So most people know about Professor Todorov from his work in developing
the MuJoCo simulator.
But before she did that, he actually did a lot of research at the intersections
between control theory, reinforcement learning and neuroscience.
And in many ways, the work that she did was quite ahead of its time in terms of
combining reinforcement learning ideas with probabilistic inference concepts and
controls. And besides that, you know, at some low, low technical level, a lot of
the ideas that I capitalized on in developing new RL algorithms were based on
some of these controls inference concepts that his work, as well as the work of
other people in that era, kind of pioneered.
But also, I think the general approach and philosophy of combining very technical
ideas in probabilistic inference, RL and neuroscience and controls altogether like
that was something that I would say really shaped my approach to research, because
essentially, I think one of the things that he and others in that kind of neck of the
woods did really well is really tear down the boundaries between these things.
As an example of something like this, there's this idea that's sometimes referred to
as Kalman duality, which is basically the concept that a forward-backward message
passing algorithm, like what you would use in a hidden Markov model, is more or less
the same thing as a control algorithm.
So inferring the most likely state to get, you know, given a sequence of
observations kind of looks an awful lot like inferring the optimal action given some
reward function. And that could be made into a mathematically precise state.
So it's not merely interdisciplinary, it's really tearing down the boundaries between
these areas and showing the underlying commonality that emerges when you basically
reason about sequential processes.
And I think that was actually very influential on me in terms of how I thought
about the technical concepts in these areas.
That's really interesting. It reminds me of a lot of folks are really interested in or
maybe not a lot, but a few people are very interested in formulating RL, the RL
formulation as kind of a sequence model formulation.
And so it feels like there's maybe a similar thing going on here.
I'm curious what you think about this formulation.
Yeah, I think to a degree that's true.
So certainly the idea that inference and sequence models looks a lot like control is a
very old idea. And the reason the Kalman duality is called the Kalman duality is because
it actually did show up in Kalman's original papers.
That's not what most people took away from.
Most people took away that it's a good way to do state estimation.
And, you know, that was in the age of the space race and people used it for state
estimation for like the Apollo program.
But buried in there is the relationship between control and inference and sequence
models that the same way that you would figure out what state that you're in given a
sequence of observations could be used to figure out what action to take to achieve
some outcome. And yeah, it's probably fair to say that the relationship between sequence
models and control is an extremely old one.
And there's still more to be gained from that connection.
Do you feel like you've read any papers or work recently that you were really surprised
by? There are a few things.
I mean, this is maybe a little bit tangential to what we discussed so far, but I have been
a bit surprised by some of the investigations into how language models act as a few
shot learners. So I worked a lot on metal learning, kind of, I would say at this point,
really the previous generation of metal learning algorithms, kind of the few shot stuff
that was in the, you know, 2018, 2019.
But with language models, there's a very interesting question as to the degree to which
they actually act as metal learners or not.
And there's been somewhat contradictory evidence, like one way or the other.
And some of that was kind of surprising to me.
Like, for example, you can take a few shot prompt and attach incorrect labels to it.
And then the model will look at it and start producing correct labels, which maybe kind
of suggests that perhaps it's not paying attention to the labels, but more of the format
of the problem. But of course, all of these studies are empirical and it's always a
question as to whether the next generation of models still exhibits the same behavior
or not. So you have to take it with a grain of salt.
But I have found some of the conclusions there to be kind of surprising that maybe these
things aren't really metal learners, rather they're just formats, getting like format
specification out of problems.
Yeah, they're like really, really, really good patent managers.
Like, yeah, interesting.
Also, as they get bigger, some people say they take less data to fine tune.
And so maybe they're doing some kind of few shot learning during training as well.
There's an interesting tension there because you would really like, I think, in the end
for the ideal metal learning method to have something that can get a little bit of data
for a new problem, use that to solve that problem, but also use it to improve the model.
And that's something that's always been a little tough with metal learning algorithms,
because typically the process of adapting to a new problem is very, very separate from
the process of training the model itself.
And certainly that's true in the classic way of using language models with prompts as
well. But it's very appealing to have a model that can fine tune on small amounts of
data because then the process of adapting to a task is the same as the process of
approving the model and the model actually gets better with every task.
You could imagine, for example, that the logical conclusion of this kind of stuff is a
kind of a lifelong online metal learning procedure where every new task that you're
exposed to, you can adapt to it more quickly and you can use it to improve your model so
that it can adapt to the next task even more quickly.
So I think that's in the world of metal learning, that's actually kind of an important
open problem is how to move towards lifelong and online metal learning procedures that
really do get better at both the meta and the low level.
And it's not actually obvious how to do that or whether the advent of large language
models makes that easier or harder, but it's an important problem.
What do you feel like are some underrated approaches or overlooked approaches that you
don't see many people working at today or it's not very popular, but you think it might
One thing that comes to mind, I don't know how much this counts as overlooked or
underrated, but I do think that it might be that to some degree, model based RL is a
little bit underutilized to some degree because, well, it sort of makes sense if we've
seen big advance in generative models, then more explicit model based RL techniques
perhaps can do better than they do now.
And it may also be that there's room for very effective methods to be developed that
hybridize model based and model free RL in interesting ways that could do a lot better
than either one individually, perhaps by leveraging the latest ideas from building very
effective generative models.
Just as one point about what these things could look like, model based reinforcement
learning at its core uses some mechanism that predicts the future.
But typically we think of predicting the future kind of the way we think about like
movies and videos, like you predict the world like one frame at a time.
But there isn't really any reason to think about it that way.
Like all you really need to predict is what will happen in the future.
If you do something that doesn't have to be one time step or frame at a time, it could
be that you predict something that will happen at some future point.
Maybe you don't even know which future point in particular, like soon or not so soon.
And it may be that this kind of more flexible way of looking at prediction could
provide for models that are easier to train that maybe leverage ideas from current
generative models, sufficient to do control, to do decision making, but not as
complicated as like full on frame by frame, pixel by pixel prediction of everything
that your robot will see for the next hour.
Why do you think that we haven't seen more advances there on the model based
reinforcement, given the success of these large generative models?
I mean, people are making large generative models really good for more than a few
years now, but we haven't really seen, I feel like them apply in the RL set directly.
Well, there is a big challenge there.
The challenge is that actually prediction is often much harder than generation.
One way you can think about it is if your task is to generate, let's say a picture of
an open door, right?
That you can control any door you want.
It can be any color as long as it's open.
But if your goal is to predict this particular door in my office, what it would
look like if I were to open it, now you really have to get all the other details
right, and you really have to get them right if you want to use that for control
because you want to get the system to figure out what thing in the scene needs to
So if you messed up a bunch of other parts or it's like, it's not open in the same
way that this particular door opens, that's actually much less useful to you.
So prediction can be a lot harder than generation because with just straight up
generation, you kind of have a lot of freedom to fudge a lot of the details.
When you get the freedom to fudge the details, you can basically do the easiest
thing you know how to do for all the stuff except for the main subject of the
Yeah, it's like once you have to do prediction, you need consistency, you need
it over a long time horizons.
There are like all of these other things to work on.
Why do you think we still see a lot of model based RL that does these kind of
frame by frame rollout versus predicting point in the future or something like
Or also versus predicting some aspects of the future, as you were mentioning
Like maybe this thing will happen or maybe this attribute will change or maybe I
expect this particular piece of the future.
Well, I do think that the decomposition into a predictive model and a planning
method is very clean.
So it's very tempting to say, well, like we know how to run RL against a
So as long as we get a model that basically acts as a slot in replacement for a
simulator, then we know exactly how to use it.
So it's a very clean and tempting decomposition.
And part of why I think we should think about breaking that decomposition is
because this notion of a very clean decomposition, it makes me hearken back to
the end to end stuff.
Like, you know, in robotics, we used to have another very clean decomposition,
which is the decomposition between estimation and control.
It used to be that perception and control were kept very separate because it's such
a clean decomposition.
And maybe here too, prediction and planning are kept very separate because it's
such a clean decomposition.
But just because it's clean doesn't mean it's right.
That's a notion that we ought to challenge.
And it kind of just feels like it hasn't been challenged so extremely so far.
One question, just going back to the importance of making robots that don't
smash all your dishes and smash all the windows and everything like that, which
does seem like a very useful thing for people to be working on, it does seem a
little bit underserved by existing incentive.
Like, do you have any ideas how to fix that?
Is it like a new conference?
Is that a new way of judging papers?
Is it just people being open to the importance of this problem?
Like, how do we actually make progress on that besides industry?
Like industry can certainly make progress, but in academia.
And it's something that I think about a lot.
I think one great way to approach that problem is to actually like set your goal
to build a robot that has some kind of existence that has some kind of life
I spend part of my time hanging out with robotics at Google, the Google
brain robotics research lab.
And there, I think we've actually done a pretty good job of this where if you walk
into our office, we'll get a Googler to escort you, obviously, like, you know,
don't break into our office.
But if you walk into our office legally, you will see robots just driving around
and you will walk like you walk into the micro kitchen there where people will go
and get their snacks and you might be standing in line behind a robot that's
getting a snack and people have gotten into this habit of like, well, the
robotics experiment is continual, it's ongoing, and it will live in the world
that you live in and you better deal with it.
And you deal with that as a researcher.
And that actually like gets you into this mindset where things do need to be more
robust and they need to be more configured in such a way that they support this
continual process and they don't break the dishes.
On a technical side, there's still a lot to do there, but just getting into that
mode of thinking about the research process, I think helps a ton.
And we're starting to move in that direction here at UC Berkeley, too.
We've got our little mobile robot roving around the building on a regular basis.
We've got our robotic arm in the corner, constantly trying to pick up objects.
And once you start doing research that way, now it becomes much more natural to
be thinking about these kinds of challenges.
It's another example of breaking down a barrier.
In this case, it's between the experimental environment and
your real life environment.
Do you feel like there was a work of yours that was most overlooked?
I think every researcher thinks that a work of theirs has been most overlooked,
but one thing maybe I could talk about a little bit is some work that two of my
postdocs, Nick Reinhart and Glenn Berseth did recently with me and a number of
other collaborators, studying intrinsic motivation from a very different
So intrinsic motivation and reinforcement learning is often thought of as the
problem of seeking out novelty in the absence of supervision.
So people formulate in different ways, like, you know, find something that's
surprising, find something that your model doesn't fit, et cetera.
And Nick and Glenn took a very different approach to it that was inspired by, it
was actually inspired by some neuroscience and cognitive science work from a
gentleman named Carl Friston from the UK.
There was this idea that perhaps intrinsic motivation can actually be driven by the
opposite objective, the objective of minimizing surprise.
And the intuition for why this might be true is that if you imagine kind of a
very ecological view of intelligence, let's say you're, you're a creature in the
jungle and you're hanging out there and you want to survive, well, maybe you
actually don't want to find surprising things like, you know, a tiger eating you
would be very surprising and you would rather that not happen.
So you'd rather like kind of find your niche, hang out there and be safe and
comfortable and that actually requires minimizing surprise, but minimizing
surprise might require taking some kind of coordinated action.
So you might think, well, it might rain tomorrow and then I'll get wet and that
kind of kicks me out of my comfortable niche.
So maybe I'll actually go on a little adventure and find some materials to
build shelter, which, you know, that might be a very uncomfortable thing to do.
It might be very surprising, but once I built that shelter, now I'll put myself
in a more stable niche where I'm less likely to get surprised by something.
So perhaps paradoxically minimizing surprise might actually lead to some
behavior that looks like curiosity or novelty seeking in service to getting
yourself to be more comfortable later.
It's a very kind of strange idea in some ways, but perhaps a really powerful one.
If we want to situate agents in open world settings where we want them to
explore without human supervision, but at the same time, not get distracted by the
million different things that could happen, like, you know, they should
explore, but they should explore in a way that kind of gets them to be more
capable, sort of accumulates capabilities and things like that, accumulates some
ability to affect their world.
So we had several papers that studied this one called SMIRL, surprise
minimization, reinforcement learning.
Another one called IC2, which is information capture for intrinsic control.
And both these papers looked at how minimizing novelty, either minimizing
the entropy of your own beliefs, meaning manipulate the world so that you're more
certain about how the world works or simply minimizing the entropy of your state.
So manipulate the world so that you occupy a narrow range of states actually
lead to emergent behavior.
And this was like very experimental, preliminary, half-baked kind of stuff.
But I think that's maybe a direction that has some interesting
implications in the future.
That's really interesting.
That's a very, yeah, unusual formulation.
What controversial or unusual research opinions do you feel like you have that
other people don't seem to agree with?
Well, I have quite a few, although I should say that I do tend to be open
minded and pragmatic about these things.
So I'm more than happy to work with people, even on projects that might
invalidate some of these opinions.
But some of the things that I think many people don't entirely agree with is for
one thing, there's a lot of activity in robotic learning around using simulation
to learn policies for real world robots.
And I think that's very pragmatic.
I think if I were to like start a company today, that's an approach that
I might explore.
The controversial part is I think in the long run, we're not going to do that.
And the reason that I think we're not going to do that in the long run is that
ultimately it'll be much easier to use data rather than simulation
to enable robots to do things.
And I think that'll be true for several reasons.
One of the reasons is that once we get the robots out there, data is much more
available and there's a lot less reason to use simulation.
So if you're in the Tesla regime, if you have a million robots out there, now
suddenly simulation doesn't look as appealing because hey, getting lots of
data is easy.
The other reason is that I think that the places where we'll really want learning
to attain really superhuman performance will be ones where the robot will need to
figure things out in sort of in tight coupling with the world.
So if we understand something well enough to simulate it really accurately, maybe
that's actually not the place where we most need learning.
And the other reason is that, well, if you look at other domains, if you look at
like NLP or computer vision, I mean, nobody in NLP thinks about coding up a
simulator to simulate how people produce language.
Like that sounds ridiculous.
Using data is the way to go.
I mean, you might use like synthetic data from a language model, but you're not
going to like write a computer program that like simulates how human fingers and
vocal cords work and create tech, you know, type on keyboards or emit sounds
like that just sounds crazy.
You use data.
And in computer vision, maybe there's a little bit more simulation, but still like
using real images, just so much easier than generating synthetic images.
Some people do work on synthetic images, but the data-driven paradigm is so
powerful and relatively easy to use that most people just do that.
And I think that we'll, we'll get to that point in robotics too.
Another one that I might say is that I think that, well, this is maybe coming
back to something that we discussed already, but I think there's a lot of
activity in robotics and also in other areas around using essentially
imitation learning style approaches.
So get humans to perform some tasks, maybe robotic tasks, or maybe they're
not, they're booking flights from the internet or something, whatever tasks
you want to do, get humans to generate lots of data for it, and then basically
do a really good job of emulating that behavior.
And again, I think this is like one of those things that I would put into the
category of very pragmatic approaches that would be very good to leverage if
you're starting like a company right now.
But if you want to really get general purpose, highly effective AI systems, I
think we really need to go beyond that.
And there's a really cute quote that my former post dog, Glenn, he posted this
on Twitter after a recent conference.
He said something like, oh, I saw a lot of papers on imitation learning, but
perhaps it harkens back to an earlier quote by Rodney Brooks that imitation
learning is doomed to succeed.
So Rodney Brooks had a quote years ago where he said simulation is doomed to
What he meant by that is that when people do robotics research and simulation, it
like always works.
So it always succeeds, but then it's hard to make that same thing work in the real
And I think Glenn's point was that, well, the imitation learning, it's easy to get
it to work, but then it like kind of like hit a wall where it's like, it's really
good for the thing that imitation learning is good for.
So it's like, looks deceptively effective, but then if you want to go beyond that,
if you really want to do something that people are not good at, then you just hit
And I think that that's a really big deal.
I think we should really be in robotics and in other areas where we want rational
intelligent decision making, really be thinking hard about planning, reinforcing
learning, things that go beyond just copying humans.
Yeah, that's really interesting.
I love this.
Imitation of doomed to succeed.
The third one, and maybe this is the last one that's big enough to be interesting
is, to be honest, I'm actually very skeptical about the utility of language in
the long run as a driving force for artificial intelligence.
I think that language is very, very useful right now.
I think there's like a kind of a very cognitive science view of language, which
says, well, people think in symbolic terms and language is sort of our expression of
those symbolic concepts and therefore language is like kind of the fundamental
substrate of thought.
And I think that's a very reasonable idea.
What I'm skeptical about is the degree to which that is really a prerequisite for
intelligence, because there are a lot of animals that are much more intelligent
than our robots that do not possess language.
They might possess some kind of symbolic rational thought, but they certainly don't
speak to us and they certainly don't express their thoughts in language.
And because of that, my suspicion is actually that the success of things like
language models has less to do with the fact that it's a language and more to do
with the fact that we've got an internet full of language data and that apps, it's
really not so much about language.
It's really about the fact that there is this structured repository that happens to
be written in language and that perhaps in the long run, we'll figure out how to do
all the wonderful things that we do with language models, but without the language
using, for example, sensory motor streams, videos, whatever, and we'll get that
generality, we'll get that power and it will come more from understanding the
physical and visual concepts in the world rather than necessarily parsing words in
English or something.
Earlier, we talked about hitting walls, methods that hit walls.
Do you think that the language-based method, when we think about artificial
general intelligence, would at some point hit a wall?
I do think though that we should be a little careful with that because language
models hit walls, but you can build ladders over those walls using other
mechanisms and certainly a recent robotics research, including robotics
research that the team that I work with at Google has done as well as many others.
We've seen a lot of really excellent innovations from people where they use
visual or visuomotor models that they understand action, understand images to
bridge the gap between language models, the symbolic world of language model and
the physical world, and I think that we've come a long way into it, but I do
think that purely language-based systems by themselves, they do have a major
limitation in terms of the inability to really ground out things into the lowest
level of perception and action, and that's very problematic because actually
the reason that we don't have a lot of texts on the internet of like, oh, if you
want to throw a football, then you should fire this neuron and actuate this muscle
and so on, we don't put that in text because it's so easy for us.
It's so easy for us, but that doesn't mean that it's easy for our machines.
The thing that where the gap between human capability and machine
capabilities largest is exactly the thing that we're not going to express online.
So basically the way in which the internet data set is skewed is that all of the
easy stuff is not on there, and so it doesn't get that.
What do you think about the idea that we might get an AGI that is able to solve all
digital tasks on your computer, like do everything digitally that a human can do,
but we'll still be many, many, many years away in robotics?
Well, maybe there's something comforting about that because then it can't like go
out into the world and start doing things that are too nefarious, but I think that
kind of stuff is possible.
In research, I do tend to be a little bit of an optimist and I do think that we can
figure out many of the nitty gritty physical robotic things.
I'm not sure how long that will take exactly, but I'm also kind of hopeful
that if we figure them out, we'll actually get a better solution for some of the
symbolic things. Like, you know, if your model understands how the physical world
works, you can probably do a better job in the digital world because the digital
world influences the physical world.
And a lot of the most important things there really do have a physical kind of
connection. I do think that in science, maybe it's actually, it'll go the other way.
And that's a really good idea to sometimes see how extreme of a design might still
work because you learn a lot from doing robotics.
And this is, by the way, something I get a lot of comments on this, like, you know,
I'll be talking to people and they'll be like, well, yeah, we know how to do robotic
grasping and we know how to do inverse kinematics and we know how to do this and
this. So why don't you like use those parts?
And it's like, yeah, you could.
But if you want to understand the utility, the value of some particular new design, it
kind of makes sense to really zoom in on that and really isolate it and really just
understand its value instead of trying to put in all these cartridges to compensate
for all the parts where we might have better existing kind of ideas.
Thanks for listening to the Generally Intelligent podcast.
If you like this, please consider giving us a rating and leaving a review on Apple
podcast. On Twitter, I'm at Kanjun, K-A-N-J-U-N, and our lab is at GenIntelligent.
Until next time.
I enjoy what you guys are up too. This kind of clever work and exposure!
Keep up the awesome works guys I've added you guys to my blogroll.
Heya i am for the first time here. I came
across this board and I find It truly useful & it helped me out much.
I hope to give something back and aid others like you helped me.