Visual Language Models: when robots understand their surroundings
One of the main challenges for advanced industrial robotics is giving robots the ability to not just capture information from their environment, but to also interpret that information in a coherent and contextual way. Understanding requires more than just seeing, so in order for robots to operate autonomously and reliably in real-world settings, they must be able to integrate data from multiple sources, such as cameras, proximity sensors, LiDAR, microphones, and other systems, then transform that information into actionable knowledge in real time.
Machine vision has traditionally been based on specialized models that are trained for specific tasks, with a high level of dependence on data labeling and controlled scenarios. Although these methods have proven to be effective in well‑defined industrial contexts, they show clear limitations when faced with dynamic environments, operational variability, or situations that were not covered during the training.
In this context, the emergence of Visual Language Models (VLMs) represents a paradigm shift. These models combine the capabilities of machine vision and natural language processing into a unified architecture, making it possible to associate visual elements with high-level linguistic concepts. The result is a deeper understanding of the environment, which is based not only on visual patterns, but also on semantics, context, and relationships between objects and actions.
From a technical perspective, VLMs allow improved cross-domain generalization, which reduces the need for specific training on each use case, while also facilitating knowledge transfer among different scenarios. Models of this type have now been widely studied, and they have demonstrated a remarkable capacity for understanding images based on natural language descriptions, and vice-versa.
At GMV, these capabilities are being transferred to the operational environment and made available on the market through their integration into uPathWay, which is the company’s intelligent platform for management, orchestration, and optimization of heterogeneous fleets of robots and autonomous vehicles in industrial settings. This incorporation of VLMs is now opening the door to new scenarios for interaction and supervision, by adding another layer of contextual intelligence on top of more traditional perception.
Some of the most notable use cases now include:
- Monitoring of robots by using natural language supported by visual information, which facilitates more intuitive human-robot interactions while also reducing technical obstacles for operators and supervisors.
- Automatic generation of descriptions for operating conditions and incidents, based on images or video sequences captured by the robots themselves.
- Visual validation of tasks, such as automated confirmation that a load, a pallet, or an inspected element is correctly positioned or in its expected state.
- Context-based detection of anomalies, to identify unexpected situations that were not expressly defined in advance through rules or models.
- More natural and flexible interfaces that can support decision-making, by combining natural language prompts and visual information from the environment.
These capabilities are all contributing to a form of robotics that is more autonomous, explainable, and scalable, with the ability to adapt to complex or dynamic industrial environments, or those with a high degree of uncertainty. In addition to improving task automation, VLMs are also allowing progress towards systems that can not only execute instructions, but also interpret and communicate what is happening around them.
GMV is continuing its work on integration of advanced perception and contextual intelligence, as key elements that will drive the automation of the future, all with the aim of bringing these technologies out of the research phase and into real operational applications.
Author: Ángel C. Lázaro