### Google DeepMind’s Gemini: Revolutionizing Office Robotics
#### Introduction
In a bustling open-plan office in Mountain View, California, a sleek, wheeled robot has been making waves as a tour guide and office assistant. This is all thanks to a significant upgrade from Google DeepMind, which has equipped the robot with a large language model to understand commands and navigate its surroundings.
#### Gemini’s Capabilities
When a human asks, “Find me somewhere to write,” the robot efficiently leads them to a whiteboard within the building. This is possible because Gemini can handle both video and text, allowing it to process large amounts of information from previously recorded video tours of the office. The robot combines Gemini with an algorithm that translates commands into specific actions, such as turning, based on what it sees.
#### Vision Language Models
Gemini’s introduction in December marked a significant milestone. Demis Hassabis, CEO of Google DeepMind, emphasized its potential to interact with the physical world and perform useful tasks. Gemini and similar technologies are capable of understanding office layouts through a smartphone camera.
#### Research and Development
Academic and industry research labs are in a race to enhance robots’ abilities using language models. The May [program](https://ras.papercept.net/conferences/conferences/ICRA24/program/) for the International Conference on Robotics and Automation lists nearly two dozen papers on vision language models.
#### Investment in Robotics
Investors are [pouring money](https://www.theinformation.com/articles/ai-investors-turn-their-attention-and-deep-pockets-to-robotics) into startups that aim to apply AI advances to robotics. Researchers from the Google project have founded a startup called [Physical Intelligence](https://physicalintelligence.company/), which received $70 million in initial funding. They aim to combine large language models with real-world training for general problem-solving abilities. Similarly, [Skild AI](https://www.skild.ai/), founded by Carnegie Mellon University roboticists, recently announced $300 million in funding.
#### Evolution of Robot Navigation
A few years ago, robots required detailed maps and specific commands to navigate. Today, large language models contain valuable information about the physical world. Vision language models, trained on images and video as well as text, can answer questions requiring perception. Gemini enables Google’s robot to follow visual instructions, such as a sketch on a whiteboard showing a route.
#### Future Plans
Researchers plan to test Gemini on various types of robots. They believe it should handle more complex queries, like “Do they have my favorite drink today?” from a user with many empty Coke cans on their desk.
“Gemini allows Google’s robot to parse visual instructions as well as spoken ones, following a sketch on a whiteboard that shows a route to a new destination.”
4 Comments
Is Google’s DeepMind chatbot really the ‘key player’ in AI, or just another tool in the ever-growing arsenal?
DeepMind’s chatbot, a groundbreaking shift or just overhyped tech?
Game-changer or just another overhyped AI experiment?
Love how it’s only “groundbreaking” when Google does it!