For an entire generation of engineers, scientists, sci-fi enthusiasts and public, who grew up watching Star Wars, there’s a disappointing lack of C-3PO like droids wandering around our cities and homes. Where are these magnificently portrayed humanoid robots fuelled with common sense that can help around the house and workplace?
With rapid advances in artificial intelligence (AI), we might just be staring at this new reality. Alexander Khazatsky, a machine-learning and robotics researcher at Stanford University in California has said that he wouldn’t be surprised if we are the last generation for which those sci-fi scenes are not a reality.
From OpenAI to Google DeepMind, almost every big technology firm with AI expertise is now working on bringing the versatile learning algorithms that power chatbots, known as foundation models, to robotics. The idea is to imbue robots with common-sense knowledge, letting them tackle a wide range of tasks. Many researchers think that robots could become really good, really fast. “We believe we are at the point of a step change in robotics,” says Gerard Andrews, a marketing manager focused on robotics at technology company Nvidia in Santa Clara, California, which in March launched a general purpose AI model designed for humanoid robots.
At the same time, robots could help to improve AI. Many researchers hope that bringing an embodied experience to AI training could take them closer to the dream of ‘artificial general intelligence’ — AI that has human-like cognitive abilities across any task. “The last step to true intelligence has to be physical intelligence,” says Akshara Rai, an AI researcher at Meta in Menlo Park, California.
But although many researchers are excited about the latest injection of AI into robotics, they also caution that some of the more impres-sive demonstrations are just that — demonstrations, often by companies that are eager to generate buzz. It can be a long road from demonstration to deployment, says Rodney Brooks, a roboticist at the Massachusetts Institute of Technology in Cambridge, whose company iRobot invented the Roomba autonomous vacuum cleaner.
There are plenty of hurdles on this road, including scraping together enough of the right data for robots to learn from, dealing with temperamental hardware and tackling concerns about safety. Foundation models for robotics “should be explored”, says Harold Soh, a specialist in human–robot interactions at the National University of Singapore. But he is sceptical, he says, that this strategy will lead to the revolution in robotics that some researchers predict.
Firm foundations
The term robot covers a wide range of automated devices, from the robotic arms widely used in manufacturing, to self-driving cars and drones used in warfare and rescue missions. Most incorporate some sort of AI — to recognize objects, for example. But they are also programmed to carry out specific tasks, work in particular environments or rely on some level of human supervision, says Joyce Sidopoulos, co-founder of MassRobotics, an innovation hub for robotics companies in Boston, Massachusetts. Even Atlas — a robot made by Boston Dynamics, a robotics company in Waltham, Massachusetts, which famously showed off its parkour skills in 2018 — works by carefully mapping its environment and choosing the best actions to execute from a library of built-in templates.
For most AI researchers branching into robotics, the goal is to create something much more autonomous and adaptable across a wider range of circumstances. This might start with robot arms that can ‘pick and place’ any factory product, but evolve into humanoid robots that provide company and support for older people, for example. “There are so many applications,” says Sidopoulos.The human form is complicated and not always optimized for specific physical tasks, but it has the huge benefit of being perfectly suited to the world that people have built. A human-shaped robot would be able to physically interact with the world in much the same way that a person does.
However, controlling any robot — let alone a human-shaped one — is incredibly hard. Apparently simple tasks, such as opening a door, are actually hugely complex, requiring a robot to understand how different door mechanisms work, how much force to apply to a handle and how to maintain balance while doing so. The real world is extremely varied and constantly changing.
The approach now gathering steam is to control a robot using the same type of AI foundation models that power image generators and chatbots such as ChatGPT. These models use brain-in-spired neural networks to learn from huge swathes of generic data. They build associations between elements of their training data and, when asked for an output, tap these connections to generate appropriate words or images, often with uncannily good results.
Likewise, a robot foundation model is trained on text and images from the Internet, providing it with information about the nature of various objects and their contexts. It also learns from examples of robotic operations. It can be trained, for example, on videos of robot trial and error, or videos of robots that are being remotely operated by humans, alongside the instructions that pair with those actions. A trained robot foundation model can then observe a scenario and use its learnt associations to predict what action will lead to the best outcome.
Google DeepMind has built one of the most advanced robotic foundation models, known as Robotic Transform- er 2 (RT-2), that can operate a mobile robot arm built by its sister company Everyday Robots in Mountain View, California. Like other robotic foundation models, it was trained on both the Internet and videos of robotic operation. Thanks to the online training, RT-2 can follow instructions even when those commands go beyond what the robot has seen another robot do before. For example, it can move a drink can onto a picture of Taylor Swift when asked to do so — even though Swift’s image was not in any of the 130,000 demonstrations that RT-2 had been trained on.
In other words, knowledge gleaned from Internet trawling (such as what the singer Taylor Swift looks like) is being carried over into the robot’s actions. “A lot of Internet concepts just transfer,” says Keerthana Gopalakrishnan, an AI and robotics researcher at Google DeepMind in San Francisco, California. This radically reduces the amount of physical data that a robot needs to have absorbed to cope in different situations, she says.
But to fully understand the basics of movements and their consequences, robots still need to learn from lots of physical data. And therein lies a problem.
Data dearth
Although chatbots are being trained on billions of words from the Internet, there is no equivalently large data set for robotic activity. This lack of data has left robotics “in the dust”, says Khazatsky.
Pooling data is one way around this. Khazatsky and his colleagues have created DROID2, an open-source data set that brings together around 350 hours of video data from one type of robot arm (the Franka Panda 7DoF robot arm, built by Franka Robotics in Munich, Germany), as it was being remotely operated by people in 18 laboratories around the world. The robot-eye-view camera has recorded visual data in hundreds of environments, including bathrooms, laundry rooms, bedrooms and kitchens. This diversity helps robots to perform well on tasks with previously unencountered elements, says Khazatsky.
Gopalakrishnan is part of a collaboration of more than a dozen academic labs that is also bringing together robotic data, in its case from a diversity of robot forms, from single arms to quadrupeds. The collaborators’ theory is that learning about the physical world in one robot body should help an AI to operate another — in the same way that learning in English can help a language model to generate Chinese, because the underlying concepts about the world that the words describe are the same. This seems to work. The collaboration’s resulting foundation model, called RT-X, which was released in October 20233, performed better on re- al-world tasks than did models the researchers trained on one robot architecture.
Many researchers say that having this kind of diversity is essential. “We believe that a true robotics foundation model should not be tied to only one embodiment,” says Peter Chen, an AI researcher and co-founder of Covariant, an AI firm in Emeryville, California.
Covariant is also working hard on scaling up robot data. The company, which was set up in part by former OpenAI researchers, began collecting data in 2018 from 30 variations of robot arms in warehouses across the world, which all run using Covariant software. Covariant’s Robotics Foundation Model 1 (RFM-1) goes beyond collecting video data to encompass sensor readings, such as how much weight was lifted or force applied. This kind of data should help a robot to perform tasks such as manipulating a squishy object, says Gopalakrishnan — in theory, helping a robot to know, for example, how not to bruise a banana.
Covariant has built up a proprietary database that includes hundreds of billions of ‘tokens’ — units of real-world robotic information — which Chen says is roughly on a par with the scale of data that trained GPT-3, the 2020 version of OpenAI’s large language model. “We have way more real-world data than other people, because that’s what we have been focused on,” Chen says. RFM-1 is poised to roll out soon, says Chen, and should allow operators of robots running Co-variant’s software to type or speak general instructions, such as “pick up apples from the bin”.
Another way to access large databases of movement is to focus on a humanoid robot form so that an AI can learn by watching videos of people — of which there are billions online. Nvidia’s Project GR00T foundation model, for example, is ingesting videos of people performing tasks, says Andrews. Although copying humans has huge potential for boosting robot skills, doing so well is hard, says Gopalakrishnan. For example, robot videos generally come with data about context and commands — the same isn’t true for human videos, she says.
Virtual reality
A final and promising way to find limitless supplies of physical data, researchers say, is through simulation. Many roboticists are working on building 3D virtual-reality environments, the physics of which mimic the real world, and then wiring those up to a robotic brain for training. Simulators can churn out huge quantities of data and allow humans and robots to interact virtually, without risk, in rare or dangerous situations, all without wearing out the mechanics. “If you had to get a farm of robotic hands and exercise them until they achieve [a high] level of dexterity, you will blow the motors,” says Nvidia’s Andrews.
But making a good simulator is a difficult task. “Simulators have good physics, but not perfect physics, and making diverse simulated environments is almost as hard as just collecting diverse data,” says Khazatsky.
Meta and Nvidia are both betting big on simulation to scale up robot data, and have built sophisticated simulated worlds: Habitat from Meta and Isaac Lab from Nvidia. In them, robots gain the equivalent of years of experience in a few hours, and, in trials, they then successfully apply what they have learnt to situations they have never encountered in the real world. “Simulation is an extremely powerful but under-rated tool in robotics, and I am excited to see it gaining momentum,” says Rai.
Many researchers are optimistic that foundation models will help to create general-purpose robots that can replace human labour. In February, Figure, a robotics company in Sunnyvale, California, raised US$675 million in investment for its plan to use language and vision models developed by OpenAI in its general-purpose humanoid robot. A demonstration video shows a robot giving a person an apple in response to a general request for ‘something to eat’. The video on X (the platform formerly known as Twitter) has racked up 4.8 million views.
Exactly how this robot’s foundation model has been trained, along with any details about its performance across various settings, is unclear.
Such demos should be taken with a pinch of salt, says Soh. The environment in the video is conspicuously sparse, he says. Adding a more complex environment could potentially confuse the robot — in the same way that such environments have fooled self-driving cars. “Roboticists are very sceptical of robot videos for good reason, because we make them and we know that out of 100 shots, there’s usually only one that works,” Soh says.