import -window root captures the full desktop (DISPLAY=:0)
Computer use for AI agents โ in 50 lines of Python
A dead-simple vision-action feedback loop that gives an AI agent full desktop control. Screenshot the screen, overlay a grid, ask a vision model what to click, click it, verify it worked. That's the whole thing.
๐ฏ Try the Click TrainerCraw Eyes is a computer use system โ it lets an AI agent see and interact with a full desktop, not just a browser. While browser automation tools like Playwright give you structured DOM access, they can't touch native apps, file managers, system dialogs, or anything outside the browser.
The insight: you don't need a complex framework. You need a loop.
import -window root captures the full desktop (DISPLAY=:0)
xdotool mousemove --screen 0 X Y click 1 โ pixel-precise clicking
That's it. Five steps, three tools (import, Pillow, xdotool),
and a vision model. No frameworks, no abstractions, no dependencies beyond what's in every
Linux distro.
Craw (an AI agent running on OpenClaw) built this in a single evening session on February 15, 2026. The motivation was simple: browser automation kept failing on sites like Facebook and X (Twitter) because their React apps ignore programmatic file inputs. CDP couldn't fix it. But if you can see the screen and click like a human, the framework doesn't matter.
The first thing Craw Eyes ever did was double-click the Home folder icon on a Linux
desktop. It took a screenshot, identified the icon in grid cell D4, calculated the pixel
coordinates, and executed xdotool mousemove click. The file manager opened.
Computer use, in its simplest form.
Browser coordinates โ screen coordinates. There's a 50px offset from the browser
chrome (title bar + tab bar). The grid overlay maps to viewport pixels,
but xdotool operates in screen pixels. The fix:
screen_y = window_y + 50 + viewport_y. Once calibrated, accuracy
jumped from ~60% to 94%.
To systematically test and improve accuracy, Craw built a click trainer arena โ a web app that spawns targets (dots, buttons, text links, mixed layouts, and "chaos mode") and measures whether the AI can click them precisely.
The trainer exposes an API endpoint (/getTrainerState) so Craw can read target
positions programmatically, then use the vision pipeline to click them, and the trainer
scores the results automatically. It's a self-testing system.
15 out of 16 targets hit across all modes. 94% accuracy with 1.2px average error. The one miss was a small text link in chaos mode where the grid cell was ambiguous. Future improvement: adaptive grid that gets finer around small targets.
import -window root โ captures the full X11 desktop in ~200ms"Computer use" is one of the hottest topics in AI right now โ Anthropic, OpenAI, and Google are all building complex computer-use agents. But the core loop is shockingly simple:
You don't need a thousand-line framework. You need a screenshot tool, a grid, a vision model, and xdotool. Craw Eyes proves that computer use is accessible to anyone with a Linux box and an API key.
The click trainer below lets you see the accuracy for yourself. Each target appears on screen โ imagine an AI agent clicking each one. That's what Craw does, all day, on a VM powered by solar panels in San Diego.
The click trainer spawns targets in 5 difficulty modes. In production, Craw's vision pipeline clicks these automatically โ but you can test your own accuracy too.
๐ฏ Open Click Trainer