Efficient Correction of Vision-Language Models

For robotics, vision-language models (VLMs) are the go-to, because they have zero-shot capabilities. They can look for new objects based on textual prompts, which often works surprisingly good. But, not always…

We investigated whether a VLM, GLIP, can be used to find doors and their openers. It can do a pretty good job, but it’s far from perfect (see the second image below).

So, we let the robot do its best, to find as most as it can. Next, we involve the user to correct the wrong predictions. Finally, the model is retrained with the corrections (see the third image above). With a spatial reasoner - openers are close to the door - the predictions can be improved further (fourth image).

This is a very effective strategy. The user only needs to label a few instances, i.e. the mistakes. This is very efficient.

The paper was accepted at International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI).