Inside Clawdbot: Browser Automation, Media Processing, and the Canvas System

This is the seventh post in my Clawdbot deep dive series. Here I'm covering the systems that let Clawdbot interact with the visual and audible world — browser automation, media processing, AI-powered media understanding, the canvas rendering system, and text-to-speech.

These subsystems span ~38,000 lines of TypeScript across ~120 source files in the open-source codebase. They handle everything from navigating a web page and clicking a button to transcribing voice messages and rendering interactive UIs on mobile devices.

Browser Control Architecture

The browser subsystem is Clawdbot's most architecturally complex module. It provides full programmatic browser control through a layered stack: an Express HTTP API server at the top, profile management in the middle, and Playwright/CDP connections at the bottom driving actual Chrome instances.

┌─────────────────────────────────────────┐
│         Agent Tool Calls (browser)      │
├─────────────────────────────────────────┤
│       client.ts / client-actions.ts     │  HTTP client library
├─────────────────────────────────────────┤
│              server.ts                  │  Express HTTP server
│         routes/ (dispatcher)            │  Route handlers
├─────────────────────────────────────────┤
│          server-context.ts              │  Profile management
│          profiles-service.ts            │  Multi-profile orchestration
├─────────────────────────────────────────┤
│         pw-session.ts                   │  Playwright connection mgmt
│         pw-tools-core.*.ts              │  Playwright operations
│         pw-role-snapshot.ts             │  Accessibility tree → refs
├─────────────────────────────────────────┤
│           cdp.ts                        │  Raw CDP protocol
│           chrome.ts                     │  Chrome process lifecycle
│        extension-relay.ts              │  Chrome Extension bridge
└─────────────────────────────────────────┘

The server is an Express app that binds to 127.0.0.1 on a port derived from the gateway port. Three route groups handle different concerns:

Basic routes — server status, start/stop, profile management
Tab routes — listing, opening, closing, and focusing browser tabs
Agent routes — the full spectrum of AI-driven browser interactions

The agent routes expose a POST /act endpoint with a kind parameter that dispatches to specific browser operations:

Kind	Description	Key Params
`click`	Click an element	`ref`, `doubleClick`, `button`, `modifiers`
`type`	Type text into an element	`ref`, `text`, `submit`, `slowly`
`press`	Press a keyboard key	`key`
`hover`	Hover over an element	`ref`
`drag`	Drag between elements	`startRef`, `endRef`
`select`	Select dropdown option	`ref`, `values`
`fill`	Fill multiple form fields	`fields[]` (ref, type, value)
`evaluate`	Execute JavaScript	`fn`, optional `ref`
`wait`	Wait for condition	`timeMs`, `text`, `textGone`, `selector`

Additional endpoints handle navigation, screenshots, PDF generation, file upload dialogs, and download management.

Multi-Profile System

The browser supports named profiles, each with its own Chrome instance or connection. Two default profiles exist:

clawd profile — A managed Chromium instance launched and controlled entirely by Clawdbot. Chrome is spawned as a child process with CDP debugging enabled:

const args = [
  `--remote-debugging-port=${profile.cdpPort}`,
  `--user-data-dir=${userDataDir}`,
  "--no-first-run", "--no-default-browser-check",
  "--disable-sync", "--disable-background-networking",
  "--disable-component-update",
  "--disable-features=Translate,MediaRouter",
];
if (resolved.headless) args.push("--headless=new");
if (resolved.noSandbox) args.push("--no-sandbox");

The executable resolver auto-detects Chrome, Brave, Edge, or Chromium across macOS, Linux, and Windows. After launch, a polling loop waits up to 15 seconds for the CDP port to become reachable.

chrome profile — A Chrome Extension relay that bridges the user's existing Chrome browser. This is an ~480-line WebSocket-based CDP proxy with two endpoints:

/extension — where the Chrome extension connects
/cdp — where Playwright connects, as if it were a real CDP endpoint

The relay intercepts CDP commands, handles browser-level commands locally (like Browser.getVersion and Target.getTargets), and forwards everything else to the extension. It also exposes HTTP endpoints mimicking Chrome's DevTools JSON API (/json/list, /json/version) so Playwright sees a standard Chrome instance.

┌────────────┐     WebSocket     ┌──────────────────┐     WebSocket     ┌───────────┐
│  Chrome     │ ◄──────────────► │  Extension Relay  │ ◄──────────────► │ Playwright │
│  Extension  │    /extension    │  (CDP Proxy)      │      /cdp        │           │
└────────────┘                   └──────────────────┘                   └───────────┘

Each profile carries metadata — a CDP port, a toolbar color for visual differentiation, and a driver type (clawd or extension). Profile resolution happens on every request via a ?profile= query parameter.

Accessibility Snapshots and Element References

This is the key innovation that makes browser control work for an LLM agent. Instead of dealing with CSS selectors or XPath, Clawdbot generates role-based element references from the browser's accessibility tree.

Two snapshot systems exist:

ARIA snapshots (raw CDP) use Accessibility.getFullAXTree to pull the complete accessibility tree directly over the CDP protocol. This is the lower-level approach.

Role snapshots (Playwright, the primary mode) parse Playwright's ariaSnapshot() output and assign short refs (e1, e2, e3...) to interactive elements. The system classifies elements into three categories:

Interactive — buttons, links, textboxes, checkboxes, radio buttons, comboboxes
Content — headings, cells, list items, articles, regions, navigation landmarks
Structural — generic containers, groups, lists, tables, rows

When the agent receives a snapshot, it sees something like:

- navigation:
  - link "Home" [e1]
  - link "Products" [e2]
  - link "About" [e3]
- main:
  - heading "Welcome" (level=1)
  - textbox "Search..." [e4]
  - button "Go" [e5]

The agent can then issue { kind: "click", ref: "e2" } to click the "Products" link. Under the hood, refLocator() resolves that ref back to a Playwright locator using the stored role and name:

export function refLocator(page: Page, ref: string) {
  if (/^e\d+$/.test(normalized)) {
    if (state?.roleRefsMode === "aria") {
      return scope.locator(`aria-ref=${normalized}`);
    }
    const info = state?.roleRefs?.[normalized];
    const locator = info.name
      ? locAny.getByRole(info.role, { name: info.name, exact: true })
      : locAny.getByRole(info.role);
    return info.nth !== undefined ? locator.nth(info.nth) : locator;
  }
}

Role+name deduplication uses nth indices when multiple elements share the same role and accessible name. Refs are cached per target ID, so they remain stable across requests within the same page.

For visual debugging, a labeled screenshot feature overlays orange bounding boxes and ref labels directly onto the page DOM, takes a screenshot, then cleans up the injected elements. This gives the agent (or a human reviewing its actions) a visual map of what each ref points to.

Playwright Session Management

The Playwright connection layer (pw-session.ts) manages persistent connections over CDP with retry logic — 3 attempts with progressive backoff (5s, 7s, 9s timeouts). Once connected, every page gets comprehensive state tracking:

Console messages — buffered up to 500 entries
Page errors — buffered up to 200 entries
Network requests — buffered up to 500, matched with responses and failures
Role refs — the e1/e2 mapping for element references
Armed hooks — file upload dialogs, JS dialogs, and download interceptors

Errors from Playwright are converted to AI-friendly messages via toAIFriendlyError(), turning stack traces into descriptions the agent can reason about and retry.

For cases where the full Playwright layer isn't needed, a raw CDP module provides direct protocol access — screenshots via Page.captureScreenshot, JavaScript evaluation via Runtime.evaluate, and DOM snapshots through injected JavaScript that walks the tree.

Media Processing Pipeline

All media in Clawdbot flows through a central store at ~/.config/clawdbot/media/. The store handles ingestion, MIME detection, image processing, and serving.

Storage and Lifecycle

Media files get a UUID-based filename with the original name embedded for traceability:

{sanitized-original-name}---{uuid}.{ext}

Files are written with 0o600 permissions (owner-only read/write). A cleanup process runs periodically, deleting files older than the configurable TTL — 2 minutes by default. This aggressive cleanup prevents disk accumulation since most media is transient (processed and forwarded within seconds).

MIME Detection

A three-stage detection system with priority ordering:

Magic bytes — sniffing the file header via file-type
File extension — mapping from a built-in extension table
HTTP Content-Type — from the response headers when fetching remotely

The priority logic prefers sniffed MIME (unless it's generic like application/octet-stream), then falls back to extension, then to headers. This handles cases where servers send wrong Content-Types or files have wrong extensions.

Image Operations

Two backends are supported — Sharp (Node.js, default) and sips (macOS native). The system auto-selects sips when running under Bun on macOS, since Sharp's native bindings can be problematic there.

Key operations:

EXIF orientation normalization — reads JPEG EXIF bytes directly and applies rotation
Resize to JPEG — progressive quality/size grid search to fit under byte limits
Resize to PNG — preserves alpha channel when transparency is detected
HEIC → JPEG conversion — via Sharp or sips
Alpha channel detection — determines whether to use the PNG or JPEG path

The JPEG resize uses a grid search across size and quality parameters to find the best fit under a given byte limit:

// sizes: [2048, 1536, 1280, 1024, 800]
// qualities: [80, 70, 60, 50, 40]
// Tries combinations until output fits under maxBytes

Media Server

An Express server serves media files with multiple security controls:

ID validation — alphanumeric plus dots/hyphens/underscores, max 200 chars
Path traversal protection — openFileWithinRoot() validates real paths
Size limits — 5MB maximum per file
TTL enforcement — files past their expiry are rejected
Single-use cleanup — files are deleted after being served

For external access (e.g., sending media URLs to messaging platforms), media can be hosted via a Tailscale hostname, generating URLs like https://{tailnet-hostname}/media/{id}.

AI-Powered Media Understanding

The media understanding module provides AI analysis of attachments — transcribing audio, describing images, and analyzing video. Six providers are built in:

Provider	Capabilities	Default Model
Groq	Audio transcription	Whisper
OpenAI	Audio transcription	`gpt-4o-mini-transcribe`
Google	Audio + Video	Gemini
Anthropic	Image description	Claude
MiniMax	Image description	MiniMax-VL-01
Deepgram	Audio transcription	`nova-3`

The Runner Pipeline

The runner orchestrates a multi-step pipeline for each media attachment:

Check if capability is enabled — per-capability (audio/image/video) toggle
Select relevant attachments — based on policy rules
Check scope — session, channel, and chat-type restrictions
Skip if redundant — image understanding is skipped when the primary LLM already supports vision natively
Resolve model entries — from config or auto-detection
Run with fallback chain — tries each provider in order until one succeeds

Auto-detection checks for API keys in the environment and available local binaries:

// Priority for audio:
// 1. Active model provider (if it supports audio)
// 2. Local tools: sherpa-onnx → whisper-cli → whisper
// 3. Gemini CLI
// 4. API key providers: openai, groq, deepgram, google

Local CLI tools are supported through template substitution — variables like {{MediaPath}}, {{OutputDir}}, and {{Prompt}} get replaced in command arguments before execution.

Provider Implementations

Audio providers use standard REST APIs. OpenAI-compatible endpoints accept multipart form uploads; Deepgram takes raw binary with a Content-Type header. Video description sends the full video as inline base64 to Google's Gemini API. Image description uses the pi-ai model registry for multi-provider support, falling back through available providers.

Errors use a MediaUnderstandingSkipError type that allows graceful fallback — if one provider fails (rate limit, unsupported format, timeout), the runner moves to the next in the chain.

Link Content Extraction

The link understanding module extracts URLs from user messages and runs them through configured CLI tools, typically web_fetch with Mozilla's Readability algorithm.

Link detection strips markdown syntax first ([text](url) → url), then matches HTTP(S) URLs with deduplication. A maximum of 3 links per message is enforced by default. SSRF protection rejects localhost and loopback addresses via isAllowedUrl().

Results are injected back into the message context:

export async function applyLinkUnderstanding(params) {
  const result = await runLinkUnderstanding({ cfg, ctx });
  if (result.outputs.length > 0) {
    ctx.LinkUnderstanding = [...(ctx.LinkUnderstanding ?? []), ...result.outputs];
    ctx.Body = formatLinkUnderstandingBody({
      body: ctx.Body,
      outputs: result.outputs,
    });
  }
}

The extracted content appears in the agent's context as additional context alongside the user's message, so the agent can answer questions about linked pages without needing to open a browser.

SSRF Prevention

Both media fetching and link extraction use hostname pinning via resolvePinnedHostname(). This resolves DNS once and pins the IP, preventing DNS rebinding attacks where a hostname resolves to an internal IP after the initial check. Combined with the isAllowedUrl() filter that blocks private ranges, this prevents the agent from being tricked into accessing internal services.

Canvas System

The canvas system serves a local web application that acts as an interactive display surface, primarily used on mobile nodes (phones and tablets running the Clawdbot companion app).

Canvas Host Server

The server watches a local directory (~/clawd/canvas/) and serves its contents as a web application:

const rootDir = resolveUserPath(
  opts.rootDir ?? path.join(os.homedir(), "clawd", "canvas")
);

A chokidar file watcher monitors the directory for changes and triggers live reload via WebSocket — edits to HTML, CSS, or JavaScript files are reflected on the device within ~75ms (debounce threshold). The server binds to 0.0.0.0 rather than localhost so mobile nodes on the same network can access it.

Every served HTML page gets a live reload snippet injected before </body>:

const ws = new WebSocket(proto + "://" + location.host + "/__clawdbot/ws");
ws.onmessage = (ev) => {
  if (ev.data === "reload") location.reload();
};

A2UI Bridge

The A2UI (App-to-UI) system bridges native mobile apps and the canvas web content. A JavaScript API is injected into every page:

globalThis.Clawdbot = {
  postMessage: function(payload) {
    // iOS: window.webkit.messageHandlers
    //       .clawdbotCanvasA2UIAction.postMessage(raw)
    // Android: window.clawdbotCanvasA2UIAction.postMessage(raw)
  },
  sendUserAction: function(userAction) {
    const id = userAction.id || crypto.randomUUID();
    return this.postMessage({ userAction: { ...userAction, id } });
  },
};

This means canvas pages can send structured actions back to the native app, and from there to the Clawdbot gateway. The agent can render a custom UI on a phone screen — a form, a dashboard, a game — and receive user interactions back through the A2UI bridge.

The default index.html is a diagnostic page with buttons that test whether the native bridge is available on iOS, Android, or neither.

Text-to-Speech

The TTS module (~1100 lines in a single file) supports three providers with automatic fallback:

Provider	API Endpoint	Default Voice	Output
OpenAI	`POST /audio/speech`	`alloy` (model: `gpt-4o-mini-tts`)	MP3 or Opus
ElevenLabs	`POST /v1/text-to-speech/:voiceId`	`pMsXgVXv3BLzUgSXRplE` (model: `eleven_multilingual_v2`)	MP3 or Opus
Edge	`node-edge-tts` library	`en-US-MichelleNeural`	MP3

Channel-Aware Output

The output format adapts to the messaging platform. Telegram gets Opus (smaller, supports voice notes natively). Everything else gets MP3:

const TELEGRAM_OUTPUT = {
  openai: "opus",
  elevenlabs: "opus_48000_64",
  extension: ".opus",
  voiceCompatible: true,
};

const DEFAULT_OUTPUT = {
  openai: "mp3",
  elevenlabs: "mp3_44100_128",
  extension: ".mp3",
  voiceCompatible: false,
};

Auto Modes and Directives

Four auto modes control when TTS activates:

off — disabled entirely
always — every reply gets audio
inbound — only when the user sends audio/voice
tagged — only when the agent includes [[tts]] directives

The LLM can control TTS parameters via inline directives in its response:

[[tts:provider=elevenlabs voice_id=pMsXgVXv3BLzUgSXRplE stability=0.7 speed=1.2]]
[[tts:text]]This specific text will be spoken.[[/tts:text]]

Directives support provider, voice, model, stability, similarity boost, style, speed, speaker boost, normalization, language, and seed — each individually gatable by the modelOverrides policy to prevent unauthorized model switching.

Summarization

Long text is automatically summarized before synthesis. When text exceeds maxLength (default 1500 characters), the system calls the LLM to produce a condensed version. If summarization is disabled, it truncates with an ellipsis. This prevents excessive API costs and keeps audio clips at a reasonable length.

Custom Endpoints

OpenAI TTS supports custom base URLs via OPENAI_TTS_BASE_URL, enabling local TTS servers like Kokoro or LocalAI. When a custom endpoint is detected, model and voice validation is relaxed to allow arbitrary values.

User preferences (provider, auto mode, max length, summarization toggle) are persisted in ~/.config/clawdbot/settings/tts.json with atomic writes via temp-file-plus-rename to prevent corruption.

Cross-Cutting Concerns

Error Handling

Each subsystem has its own error strategy tuned to its failure modes:

Browser — toAIFriendlyError() converts Playwright stack traces to messages the agent can reason about and retry
Media — MediaFetchError with typed codes (max_bytes, http_error, fetch_failed)
Media Understanding — MediaUnderstandingSkipError enables graceful provider fallback
TTS — provider chain tries each backend in order on failure

Security

Security is layered across all modules:

SSRF protection — resolvePinnedHostname() for DNS pinning, isAllowedUrl() blocks private ranges
Path traversal — openFileWithinRoot() with real-path validation on every file serve
Media TTL — automatic 2-minute cleanup prevents disk accumulation
Extension relay — loopback-only binding, rejects non-loopback WebSocket upgrades
Single-use media — files are deleted after being served once
Credential safety — atomic writes with backup for persistent auth state

Configuration

All systems share sections in the central ClawdbotConfig:

{
  browser: {
    // profiles, CDP settings, headless/sandbox modes
  },
  tools: {
    media: {
      // per-capability model configs, scope, timeouts
    },
    links: {
      // CLI tools, scope, maxLinks
    },
  },
  messages: {
    tts: {
      // provider, voice, auto mode, summarization
    },
  },
}