OmniParser V2 is trained with a larger set of interactive element detection data and icon functional caption data. By decreasing the image size of the icon caption model, OmniParser V2 reduces the latency by 60% compared to the previous version
Published Date – 16 February 2025, 07:45 PM

Hyderabad: A new AI model, OmniParser V2, was unveiled by Microsoft. The open-source model allows large language models (LLMs) — which are deep-learning models pre-trained on vast amounts of data — to act as agents capable of using a computer.
According to Microsoft, Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens.
However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen.
OmniParser closes this gap by ‘tokenising’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs.
This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.
OmniParser V2 takes this capability to the next level. Compared to its predecessor, it achieves higher accuracy in detecting smaller interactable elements and faster inference, making it a useful tool for GUI automation.
In particular, OmniParser V2 is trained with a larger set of interactive element detection data and icon functional caption data. By decreasing the image size of the icon caption model, OmniParser V2 reduces the latency by 60% compared to the previous version.
Notably, Omniparser+GPT-4o achieves state-of-the-art average accuracy of 39.6 on a recently released grounding benchmark ScreenSpot Pro, which features high resolution screen and tiny target icons. This is a substantial improvement on GPT-4o’s original score of 0.8.
In simple terms, OmniParserV2 is a tool designed to help AI models interact with graphical user interfaces (GUIs), like the ones you see on your computer screen. When AI models are asked to automate tasks in a GUI, they face two main problems:
1. Recognising which parts of the screen can be interacted with (like buttons, icons, etc.).
2. Understanding what each part of the screen means and knowing what action should be taken on it (like clicking a button or entering text).
OmniParser V2 solves these problems by taking a screenshot of the GUI and breaking it down into structured, understandable elements.
It converts the visual information (the pixels) into parts that AI models can easily interpret.
This makes it possible for AI to predict what the next action should be based on the parsed elements, such as which button to press or field to fill in.
(Source: Microsoft.com)