
Imaginative and prescient-based brokers
Imaginative and prescient-based brokers deal with the browser as a visible canvas. They take a look at screenshots, interpret them utilizing multimodal fashions, and output low-level actions like “click on (210,260)” or “sort “Peter Pan”.” This mimics how a human would use a pc—studying seen textual content, finding buttons visually, and clicking the place wanted.
The upside is universality: the mannequin doesn’t want structured knowledge, simply pixels. The draw back is precision and efficiency: visible fashions are slower, require scrolling by way of the whole web page, and battle with delicate state adjustments between screenshots (“Is that this button clickable but?”).
DOM-based brokers
DOM-based brokers, against this, function straight on the Doc Object Mannequin (DOM), the structured tree that defines each webpage. As a substitute of decoding pixels, they cause over textual representations of the web page: component tags, attributes, ARIA roles, and labels.

