Member-only story
Mobility VLA: Google’s Approach to Making Robots Navigate Autonomously
Instructing a robot to Autonomously execute user commands is a complex task involving multiple steps. Initially, the robot must comprehend the user’s command (whether verbal or visual), followed by identifying the corresponding action and determining the necessary steps to execute it. Finally, the robot navigates the surrounding environment to achieve the goal.
Visual Language Models (VLMs) empower robots to understand their surroundings by analyzing video footage captured by their cameras. These models can respond to user queries about objects or actions within the video. However, for a robot to truly execute commands, it must possess the ability to navigate and interact with the physical environment.
Consider the task of instructing a robot to discard an empty cola can into a trash bin. Since the bin might be situated in a different room and beyond the camera’s range, the robot must have previously explored the environment and mapped it for the VLM. Upon identifying an image of the trash bin, the robot must then navigate towards it.
Google’s recent research proposes a novel approach that involves capturing a comprehensive video of the environment and feeding it to a VLM. The company utilized Gemini Pro 1.5, a VLM with an extensive context window capable of handling large…