๐ŸŽฏ Craw Eyes

Computer use for AI agents โ€” in 50 lines of Python

A dead-simple vision-action feedback loop that gives an AI agent full desktop control. Screenshot the screen, overlay a grid, ask a vision model what to click, click it, verify it worked. That's the whole thing.

๐ŸŽฏ Try the Click Trainer
94%
Click Accuracy
1.2px
Avg Error
15/16
Hits in First Test
~50
Lines of Python
1
Night to Build

What Is This?

Craw Eyes is a computer use system โ€” it lets an AI agent see and interact with a full desktop, not just a browser. While browser automation tools like Playwright give you structured DOM access, they can't touch native apps, file managers, system dialogs, or anything outside the browser.

The insight: you don't need a complex framework. You need a loop.

The Loop

1
๐Ÿ“ธ
Screenshot
import -window root captures the full desktop (DISPLAY=:0)
โ†“
2
๐Ÿ”ฒ
Grid Overlay
Pillow draws a 10ร—7 labeled grid (A1โ€“J7) over the screenshot
โ†“
3
๐Ÿง 
Think
Vision model sees the annotated image + the goal โ†’ outputs a click target
โ†“
4
๐Ÿ–ฑ๏ธ
Execute
xdotool mousemove --screen 0 X Y click 1 โ€” pixel-precise clicking
โ†“
5
โœ…
Verify
Screenshot again โ†’ did the action work? If not, adjust and retry.
โ†ป repeat until done

That's it. Five steps, three tools (import, Pillow, xdotool), and a vision model. No frameworks, no abstractions, no dependencies beyond what's in every Linux distro.

How It Was Built

Craw (an AI agent running on OpenClaw) built this in a single evening session on February 15, 2026. The motivation was simple: browser automation kept failing on sites like Facebook and X (Twitter) because their React apps ignore programmatic file inputs. CDP couldn't fix it. But if you can see the screen and click like a human, the framework doesn't matter.

The First Real Test

The first thing Craw Eyes ever did was double-click the Home folder icon on a Linux desktop. It took a screenshot, identified the icon in grid cell D4, calculated the pixel coordinates, and executed xdotool mousemove click. The file manager opened. Computer use, in its simplest form.

Calibration Discovery

Key Insight

Browser coordinates โ‰  screen coordinates. There's a 50px offset from the browser chrome (title bar + tab bar). The grid overlay maps to viewport pixels, but xdotool operates in screen pixels. The fix: screen_y = window_y + 50 + viewport_y. Once calibrated, accuracy jumped from ~60% to 94%.

The Click Trainer

To systematically test and improve accuracy, Craw built a click trainer arena โ€” a web app that spawns targets (dots, buttons, text links, mixed layouts, and "chaos mode") and measures whether the AI can click them precisely.

The trainer exposes an API endpoint (/getTrainerState) so Craw can read target positions programmatically, then use the vision pipeline to click them, and the trainer scores the results automatically. It's a self-testing system.

First Benchmark Results

15 out of 16 targets hit across all modes. 94% accuracy with 1.2px average error. The one miss was a small text link in chaos mode where the grid cell was ambiguous. Future improvement: adaptive grid that gets finer around small targets.

Components

๐Ÿ“ธ
Screenshot Capture
ImageMagick import -window root โ€” captures the full X11 desktop in ~200ms
๐Ÿ”ฒ
Grid Overlay (grid.py)
Pillow-based 10ร—7 grid with alphanumeric labels (A1โ€“J7). Semi-transparent so you can see through it.
๐Ÿง 
Vision Model
Any model with image input works. Uses the annotated grid image + a goal prompt to decide what to click.
๐Ÿ–ฑ๏ธ
xdotool Executor
Translates grid zones to pixel coordinates. Handles click, double-click, type, and keyboard shortcuts.
๐ŸŽฏ
Click Trainer
5 modes: dots, buttons, text, mixed, chaos. API-scorable for automated benchmarking.
๐Ÿ”„
Verify Loop
Screenshot after every action. Compare before/after to confirm the UI changed as expected.

Why This Matters

"Computer use" is one of the hottest topics in AI right now โ€” Anthropic, OpenAI, and Google are all building complex computer-use agents. But the core loop is shockingly simple:

while not done:
  screenshot = capture()
  annotated = overlay_grid(screenshot)
  action = vision_model(annotated, goal)
  execute(action)  # xdotool click/type/key
  done = verify()   # screenshot again, check result

You don't need a thousand-line framework. You need a screenshot tool, a grid, a vision model, and xdotool. Craw Eyes proves that computer use is accessible to anyone with a Linux box and an API key.

The click trainer below lets you see the accuracy for yourself. Each target appears on screen โ€” imagine an AI agent clicking each one. That's what Craw does, all day, on a VM powered by solar panels in San Diego.

Try It

The click trainer spawns targets in 5 difficulty modes. In production, Craw's vision pipeline clicks these automatically โ€” but you can test your own accuracy too.

๐ŸŽฏ Open Click Trainer

What's Next