REF-VLM: A New Visual Decoding Approach Using a Triplet-Based Referring Paradigm

A New Approach to Visual Decoding: REF-VLM and the Triplet-Based Referring Paradigm

Multimodal large language models (MLLMs) demonstrate robust zero-shot capabilities in various vision-language tasks after training with massive datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs that utilize latent embeddings for visual task decoding generally show limited adaptability to both multi-task learning and scenarios with varying granularity.

A new research approach called REF-VLM (Referring Visual Language Model) promises to address these challenges. REF-VLM is an end-to-end framework for the unified training of various visual decoding tasks. At the heart of this approach is the so-called Triplet-Based Referring Paradigm (TRP). TRP explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. By using symbolic separators, TRP enforces structured representation learning, which improves the analyzability and interpretability of the model outputs.

The developers of REF-VLM have also created a new dataset called VT-Instruct (Visual-Task Instruction Following Dataset). This extensive multi-task dataset contains over 100 million multimodal dialogue examples for 25 task types. VT-Instruct goes beyond text input and output by integrating various visual inputs such as points, boxes, scribbles, and masks. The outputs consist of text and visual units like boxes, keypoints, depth information, and masks. The combination of different visual inputs and visual units creates a variety of task types, significantly expanding the applicability of REF-VLM.

The Triplet-Based Referring Paradigm in Detail

The TRP allows REF-VLM to handle complex visual decoding tasks by breaking down the tasks into three core components:

1. Concepts: These represent the semantic units that should be recognized in the visual scene, e.g., "person," "car," or "tree."

2. Decoding Types: These specify the type of visual decoding to be performed, e.g., "segmentation," "keypoint detection," or "object detection."

3. Targets: These define the specific regions or points in the image to which the decoding refers, e.g., "the person on the left" or "the red car."

By explicitly separating these three dimensions, REF-VLM can flexibly respond to various visual decoding tasks while simultaneously improving the interpretability of the results.

Potential and Outlook

Both qualitative and quantitative experiments show that REF-VLM outperforms other MLLMs in a variety of standard benchmarks. The ability to handle various visual decoding tasks in a unified framework opens up new possibilities for developing more powerful and flexible MLLMs. The combination of TRP and the extensive VT-Instruct dataset could lead to further advancements in the field of multimodal AI.

The research findings and the code for REF-VLM are publicly available, which promotes further research and development in this promising area.