AI for the physical world (a.k.a Embodied AI) is multimodal and based on tightly coupled perception-planning-action. Think of Wall-E, that lovable Pixar creation, deftly navigating through intricate landscapes, making shrewd decisions, and executing tasks with boundless enthusiasm. Bringing such AI magic to life is not a Hollywood endeavor — it is a data-driven quest led by bright minds in the fields of AI, ML, computer vision and robotics. To embark on this journey, we need a treasure trove of training data, spanning perception, planning, and action.
Perception, Planning, and Action: The Robotics Trinity
Perception: Wall-E’s workday begins with perceiving his surroundings. The types of data involved include:
- Visual (Image and Video): Cameras capture visual data, enabling Wall-E to identify objects, obstacles, motion, and sequences of events.
- Auditory (Audio): Microphones detect sounds, from spoken commands to environmental cues, providing awareness of changes or potential threats.
- Tactile (Haptic): Sensors detect forces and vibrations, offering touch and texture feedback for object interaction and temperature assessment.
Planning: Once Wall-E knows what an object is, he needs to decide what to do with it. His plan involves:
- Language (Natural Language Processing): Understanding and translating commands like “move to the left” into actionable steps, determining position changes and directions.
- Visual (Image and Video Analysis): Predicting future environmental states, such as foreseeing object trajectories to plan movement and avoid collisions.
- Motion : Utilizing sequential data for potential trajectories and movement patterns to chart optimal paths around obstacles or toward destinations.
Action: Wall-E in action — compacting trash, stacking cubes, and holding hands — requires interactions with and constant feedback from the objects. His cute robot body and sensors are dealing with many types of data:
- Control Signals: Directing actuators and motors for movements like lifting an arm in response to a command.
- Locomotion and Manipulation: actions carried out by robotic bodies under the control signals.
At the same time, planning and action are influenced by feedback via perception data, such as tactile, visual and auditory data.
Embodied AI Data Challenge
In the era of large language models (LLMs), training a robot like Wall-E might seem straightforward. However, the reality is far more intricate. While certain perception and planning modules might leverage existing language and 2D databases, or pre-trained LLMs, the distinct modules in Embodied AI and their integration introduce a multifaceted challenge.
Data Collection Expense
LLMs primarily leverage text and 2D data for training, much of which has been organically generated and accumulated on the Internet over decades. Leading LLMs are trained on trillions of tokens, from datasets in the magnitude of billions (e.g., LAION-5B) and entire websites (e.g. Wikipedia).
In contrast, robotics operates in the real-world physical environment and demands diverse data types (images, sounds, tactile feedback, internal signals, and more) for effective functioning. All these data types necessitate specific collection methodologies, often manual or semi-manual. Moreover, to ensure a robot can handle real-world scenarios, data must be gathered under numerous conditions, often repeating tasks with slight variations.
The 3D datasets used in Embodied AI lag significantly behind 2D datasets. For example, Objaverse-XL, a major release, contains only 10M 3D objects, with a fraction being relevant depending on the training context.
Data Labeling Complexity
Multimodal data leads to more diversified labeling. Consider how robots navigate a three-dimensional space. They need data labeling beyond 2D understanding of an object, delving into 3D contexts such as depth or an object’s spatial orientation. Labeling 3D data for robotics or autonomous driving often mandates expertise in specific domains.
In addition to labeling what we can “see”, other labeling information includes:
- Task-specific labeling, such as weight, texture, and friction for object handling or terrain navigation.
- Time sequence labeling, essential but more complex than static frames.
The margin of error is also much smaller. While an error in LLM might lead to misinformation, a misstep in robotics could damage property or endanger lives.
It’s no wonder that Google DeepMind invested nearly 17 months to gather and label a mere 130,000 episodes of robot demonstration data for training RT-1 and RT-2, highlighting the immense complexity involved in Embodied AI training.
Conclusion
In the quest to bring Wall-E to life, we’ve uncovered the formidable data challenge posed by the diverse data types and the associated costs. In our next post, we’ll deep dive into the ‘high-quality 3D objects for Embodied AI’ and explore innovative strategies for expanding 3D datasets, propelling us closer to the future of robotics.