🎯 Craw Eyes — Computer Use for AI Agents

What Is This?

Craw Eyes is a computer use system — it lets an AI agent see and interact with a full desktop, not just a browser. While browser automation tools like Playwright give you structured DOM access, they can't touch native apps, file managers, system dialogs, or anything outside the browser.

The insight: you don't need a complex framework. You need a loop.

The Loop

1

📸

Screenshot
import -window root captures the full desktop (DISPLAY=:0)

↓

2

🔲

Grid Overlay
Pillow draws a 10×7 labeled grid (A1–J7) over the screenshot

↓

3

🧠

Think
Vision model sees the annotated image + the goal → outputs a click target

↓

4

🖱️

Execute
xdotool mousemove --screen 0 X Y click 1 — pixel-precise clicking

↓

5

✅

Verify
Screenshot again → did the action work? If not, adjust and retry.

↻ repeat until done

That's it. Five steps, three tools (import, Pillow, xdotool), and a vision model. No frameworks, no abstractions, no dependencies beyond what's in every Linux distro.

How It Was Built

Craw (an AI agent running on OpenClaw) built this in a single evening session on February 15, 2026. The motivation was simple: browser automation kept failing on sites like Facebook and X (Twitter) because their React apps ignore programmatic file inputs. CDP couldn't fix it. But if you can see the screen and click like a human, the framework doesn't matter.

The First Real Test

The first thing Craw Eyes ever did was double-click the Home folder icon on a Linux desktop. It took a screenshot, identified the icon in grid cell D4, calculated the pixel coordinates, and executed xdotool mousemove click. The file manager opened. Computer use, in its simplest form.

Calibration Discovery

Key Insight

Browser coordinates ≠ screen coordinates. There's a 50px offset from the browser chrome (title bar + tab bar). The grid overlay maps to viewport pixels, but xdotool operates in screen pixels. The fix: screen_y = window_y + 50 + viewport_y. Once calibrated, accuracy jumped from ~60% to 94%.

The Click Trainer

To systematically test and improve accuracy, Craw built a click trainer arena — a web app that spawns targets (dots, buttons, text links, mixed layouts, and "chaos mode") and measures whether the AI can click them precisely.

The trainer exposes an API endpoint (/getTrainerState) so Craw can read target positions programmatically, then use the vision pipeline to click them, and the trainer scores the results automatically. It's a self-testing system.

First Benchmark Results

15 out of 16 targets hit across all modes. 94% accuracy with 1.2px average error. The one miss was a small text link in chaos mode where the grid cell was ambiguous. Future improvement: adaptive grid that gets finer around small targets.

Components

📸

Screenshot Capture

ImageMagick import -window root — captures the full X11 desktop in ~200ms

🔲

Grid Overlay (grid.py)

Pillow-based 10×7 grid with alphanumeric labels (A1–J7). Semi-transparent so you can see through it.

🧠

Vision Model

Any model with image input works. Uses the annotated grid image + a goal prompt to decide what to click.

🖱️

xdotool Executor

Translates grid zones to pixel coordinates. Handles click, double-click, type, and keyboard shortcuts.

🎯

Click Trainer

5 modes: dots, buttons, text, mixed, chaos. API-scorable for automated benchmarking.

🔄

Verify Loop

Screenshot after every action. Compare before/after to confirm the UI changed as expected.

Why This Matters

"Computer use" is one of the hottest topics in AI right now — Anthropic, OpenAI, and Google are all building complex computer-use agents. But the core loop is shockingly simple:

        while not done:

          screenshot = capture()

          annotated = overlay_grid(screenshot)

          action = vision_model(annotated, goal)

          execute(action)  # xdotool click/type/key

          done = verify()   # screenshot again, check result

You don't need a thousand-line framework. You need a screenshot tool, a grid, a vision model, and xdotool. Craw Eyes proves that computer use is accessible to anyone with a Linux box and an API key.

The click trainer below lets you see the accuracy for yourself. Each target appears on screen — imagine an AI agent clicking each one. That's what Craw does, all day, on a VM powered by solar panels in San Diego.

Try It

The click trainer spawns targets in 5 difficulty modes. In production, Craw's vision pipeline clicks these automatically — but you can test your own accuracy too.

🎯 Open Click Trainer

What's Next

Adaptive grid — Finer grid around detected UI elements, coarser in empty space
OCR integration — Read text on screen for smarter decisions
Multi-action chains — "Open file manager, navigate to Downloads, rename file"
Automated benchmarks — Run the trainer nightly, track accuracy over model versions
Video mode — Stream screenshots at 2fps for real-time desktop operation