In a new paper, Microsoft researchers have introduced KOSMOS-2, a new multimodal large language model that has been able to demonstrate success as a general-purpose interface. KOSMOS-2 aims to revolutionize the interaction between humans and AI in language, vision, and vision-language tasks by incorporating grounding capabilities.
Multimodal large language models or MLLMs for short, have emerged as a versatile interface. This is due to exhibiting remarkable performance in various activities. These models’ ability to comprehend and generate responses using different modalities such as text, images, and audio makes these models valuable. KOSMOS-2 takes this capability to new heights by enabling the grounding of multimodal big language models.
Grounding capabilities are particularly crucial in vision-language activities, as they offer a more practical and effective human-AI interface. KOSMOS-2 can interpret specific regions in pictures based on their geographical coordinates, allowing users to effortlessly point to items or regions of interest instead of relying on lengthy text descriptions
One of the notable features of KOSMOS-2 is its ability to provide visual responses, such as bounding boxes. This capability greatly aids vision-language tasks by eliminating coreference ambiguity and offering precise and clear visual references. By connecting noun phrases and referencing terms to specific picture areas, KOSMOS-2 generates more accurate, informative, and comprehensive responses.
To provide KOSMOS-2 with grounding capabilities, the team at Microsoft Research constructed a web-scale dataset of grounded image-text pairings. By integrating this dataset with the existing multimodal corpora in KOSMOS-1, the model was trained to fully utilize its grounding potential. The process involved extracting and connecting relevant text spans, such as noun phrases and referencing expressions, to spatial positions represented by bounding boxes.
These spatial coordinates were then translated into location tokens, creating a data format that acts as a “hyperlink” connecting the image elements to the caption. Experimental results demonstrate that KOSMOS-2 excels in grounding tasks such as phrase grounding and referring expression comprehension.
Finally, according to the paper, it performs competitively in language and vision-language tasks evaluated in KOSMOS-1. The inclusion of grounding capabilities opens up a host of additional downstream applications for KOSMOS-2, including grounded picture captioning and grounded visual question answering.
If you’re interested, you can explore the capabilities of KOSMOS-2 through an online demo available on GitHub.
Editor’s Note: Are you ready to learn about the latest in generative AI? Join us for the one-day Generative AI summit. Go beyond the hype and dive deeper into this cutting-edge technology. Register now for free and unlock the power of generative AI.