Need smarter insights in your inbox? Join our weekly newsletters to get solely what issues to enterprise AI, information, and safety leaders. Subscribe Now
A brand new framework from researchers at The College of Hong Kong (HKU) and collaborating establishments gives an open supply basis for creating sturdy AI brokers that may function computer systems. The framework, referred to as OpenCUA, consists of the instruments, information, and recipes for scaling the event of computer-use brokers (CUAs).
Fashions educated utilizing this framework carry out strongly on CUA benchmarks, outperforming current open supply fashions and competing intently with closed brokers from main AI labs like OpenAI and Anthropic.
The problem of constructing computer-use brokers
Laptop-use brokers are designed to autonomously full duties on a pc, from navigating web sites to working advanced software program. They will additionally assist automate workflows within the enterprise. Nonetheless, essentially the most succesful CUA programs are proprietary, with vital particulars about their coaching information, architectures, and growth processes stored non-public.
“As the shortage of transparency limits technical developments and raises security issues, the analysis group wants actually open CUA frameworks to check their capabilities, limitations, and dangers,” the researchers state in their paper.
AI Scaling Hits Its Limits
Energy caps, rising token prices, and inference delays are reshaping enterprise AI. Be a part of our unique salon to find how high groups are:
- Turning power right into a strategic benefit
- Architecting environment friendly inference for actual throughput features
- Unlocking aggressive ROI with sustainable AI programs
Safe your spot to remain forward: https://bit.ly/4mwGngO
On the identical time, open supply efforts face their very own set of hurdles. There was no scalable infrastructure for gathering the various, large-scale information wanted to coach these brokers. Present open supply datasets for graphical person interfaces (GUIs) have restricted information, and lots of analysis tasks present inadequate element about their strategies, making it troublesome for others to copy their work.
In response to the paper, “These limitations collectively hinder advances in general-purpose CUAs and prohibit a significant exploration of their scalability, generalizability, and potential studying approaches.”
Introducing OpenCUA

OpenCUA is an open supply framework designed to deal with these challenges by scaling each the information assortment and the fashions themselves. At its core is the AgentNet Instrument for recording human demonstrations of pc duties on completely different working programs.
The software streamlines information assortment by working within the background on an annotator’s private pc, capturing display screen movies, mouse and keyboard inputs, and the underlying accessibility tree, which gives structured details about on-screen components. This uncooked information is then processed into “state-action trajectories,” pairing a screenshot of the pc (the state) with the person’s corresponding motion (a click on, key press, and so forth.). Annotators can then evaluate, edit, and submit these demonstrations.

Utilizing this software, the researchers collected the AgentNet dataset, which accommodates over 22,600 process demonstrations throughout Home windows, macOS, and Ubuntu, spanning greater than 200 functions and web sites. “This dataset authentically captures the complexity of human behaviors and environmental dynamics from customers’ private computing environments,” the paper notes.
Recognizing that screen-recording instruments increase important information privateness issues for enterprises, the researchers designed the AgentNet Instrument with safety in thoughts. Xinyuan Wang, co-author of the paper and PhD pupil at HKU, defined that they applied a multi-layer privateness safety framework. “First, annotators themselves can absolutely observe the information they generate… earlier than deciding whether or not to submit it,” he informed VentureBeat. The info then undergoes guide verification for privateness points and automatic scanning by a big mannequin to detect any remaining delicate content material earlier than launch. “This layered course of ensures enterprise-grade robustness for environments dealing with delicate buyer or monetary information,” Wang added.
To speed up analysis, the group additionally curated AgentNetBench, an offline benchmark that gives a number of appropriate actions for every step, providing a extra environment friendly technique to measure an agent’s efficiency.
A brand new recipe for coaching brokers
The OpenCUA framework introduces a novel pipeline for processing information and coaching computer-use brokers. Step one converts the uncooked human demonstrations into clear state-action pairs appropriate for coaching vision-language fashions (VLMs). Nonetheless, the researchers discovered that merely coaching fashions on these pairs yields restricted efficiency features, even with giant quantities of information.

The important thing perception was to reinforce these trajectories with chain-of-thought (CoT) reasoning. This course of generates an in depth “internal monologue” for every motion, which incorporates planning, reminiscence, and reflection. This structured reasoning is organized into three ranges: a high-level remark of the display screen, reflective ideas that analyze the scenario and plan the following steps, and at last, the concise, executable motion. This strategy helps the agent develop a deeper understanding of the duties.
“We discover pure language reasoning essential for generalizable computer-use basis fashions, serving to CUAs internalize cognitive capabilities,” the researchers write.
This information synthesis pipeline is a common framework that may be tailored by firms to coach brokers on their very own distinctive inside instruments. In response to Wang, an enterprise can document demonstrations of its proprietary workflows and use the identical “reflector” and “generator” pipeline to create the mandatory coaching information. “This permits them to bootstrap a high-performing agent tailor-made to their inside instruments with no need to handcraft reasoning traces manually,” he defined.
Placing OpenCUA to the check
The researchers utilized the OpenCUA framework to coach a spread of open supply VLMs, together with variants of Qwen and Kimi-VL, with parameter sizes from 3 billion to 32 billion. The fashions have been evaluated on a collection of on-line and offline benchmarks that check their capability to carry out duties and perceive GUIs.
The 32-billion-parameter mannequin, OpenCUA-32B, established a brand new state-of-the-art success price amongst open supply fashions on the OSWorld-Verified benchmark. It additionally surpassed OpenAI’s GPT-4o-based CUA and considerably closed the efficiency hole with Anthropic’s main proprietary fashions.

For enterprise builders and product leaders, the analysis provides a number of key findings. The OpenCUA technique is broadly relevant, bettering efficiency on fashions with completely different architectures (each dense and mixture-of-experts) and sizes. The educated brokers additionally present sturdy generalization, performing nicely throughout a various vary of duties and working programs.
In response to Wang, the framework is especially suited to automating repetitive, labor-intensive enterprise workflows. “For instance, within the AgentNet dataset, we already seize a couple of demonstrations of launching EC2 situations on Amazon AWS and configuring annotation parameters on MTurk,” he informed VentureBeat. “These duties contain many sequential steps however observe repeatable patterns.”
Nonetheless, Wang famous that bridging the hole to dwell deployment requires addressing key challenges round security and reliability. “The most important problem in actual deployment is security and reliability: the agent should keep away from errors that would inadvertently alter system settings or set off dangerous unwanted effects past the supposed process,” he stated.
The researchers have launched the code, dataset, and weights for his or her fashions.
As open supply brokers constructed on frameworks like OpenCUA change into extra succesful, they may essentially evolve the connection between information staff and their computer systems. Wang envisions a future the place proficiency in advanced software program turns into much less vital than the flexibility to obviously articulate objectives to an AI agent.
He described two main modes of labor: “offline automation, the place the agent leverages its broader software program information to pursue a process end-to-end,” and “on-line collaboration, the place the agent responds in real-time and works aspect by aspect with the human, very similar to a colleague.” Principally, the people will present the strategic “what,” whereas more and more refined AI brokers deal with the operational “how.”