- Published on
Inside Clawdbot: Browser Automation, Media Processing, and the Canvas System
- Authors

- Name
- Avasdream
- @avasdream_
This is the seventh post in my Clawdbot deep dive series. Here I'm covering the systems that let Clawdbot interact with the visual and audible world — browser automation, media processing, AI-powered media understanding, the canvas rendering system, and text-to-speech.
These subsystems span ~38,000 lines of TypeScript across ~120 source files in the open-source codebase. They handle everything from navigating a web page and clicking a button to transcribing voice messages and rendering interactive UIs on mobile devices.
Browser Control Architecture
The browser subsystem is Clawdbot's most architecturally complex module. It provides full programmatic browser control through a layered stack: an Express HTTP API server at the top, profile management in the middle, and Playwright/CDP connections at the bottom driving actual Chrome instances.
┌─────────────────────────────────────────┐
│ Agent Tool Calls (browser) │
├─────────────────────────────────────────┤
│ client.ts / client-actions.ts │ HTTP client library
├─────────────────────────────────────────┤
│ server.ts │ Express HTTP server
│ routes/ (dispatcher) │ Route handlers
├─────────────────────────────────────────┤
│ server-context.ts │ Profile management
│ profiles-service.ts │ Multi-profile orchestration
├─────────────────────────────────────────┤
│ pw-session.ts │ Playwright connection mgmt
│ pw-tools-core.*.ts │ Playwright operations
│ pw-role-snapshot.ts │ Accessibility tree → refs
├─────────────────────────────────────────┤
│ cdp.ts │ Raw CDP protocol
│ chrome.ts │ Chrome process lifecycle
│ extension-relay.ts │ Chrome Extension bridge
└─────────────────────────────────────────┘
The server is an Express app that binds to 127.0.0.1 on a port derived from the gateway port. Three route groups handle different concerns:
- Basic routes — server status, start/stop, profile management
- Tab routes — listing, opening, closing, and focusing browser tabs
- Agent routes — the full spectrum of AI-driven browser interactions
The agent routes expose a POST /act endpoint with a kind parameter that dispatches to specific browser operations:
| Kind | Description | Key Params |
|---|---|---|
click | Click an element | ref, doubleClick, button, modifiers |
type | Type text into an element | ref, text, submit, slowly |
press | Press a keyboard key | key |
hover | Hover over an element | ref |
drag | Drag between elements | startRef, endRef |
select | Select dropdown option | ref, values |
fill | Fill multiple form fields | fields[] (ref, type, value) |
evaluate | Execute JavaScript | fn, optional ref |
wait | Wait for condition | timeMs, text, textGone, selector |
Additional endpoints handle navigation, screenshots, PDF generation, file upload dialogs, and download management.
Multi-Profile System
The browser supports named profiles, each with its own Chrome instance or connection. Two default profiles exist:
clawd profile — A managed Chromium instance launched and controlled entirely by Clawdbot. Chrome is spawned as a child process with CDP debugging enabled:
const args = [
`--remote-debugging-port=${profile.cdpPort}`,
`--user-data-dir=${userDataDir}`,
"--no-first-run", "--no-default-browser-check",
"--disable-sync", "--disable-background-networking",
"--disable-component-update",
"--disable-features=Translate,MediaRouter",
];
if (resolved.headless) args.push("--headless=new");
if (resolved.noSandbox) args.push("--no-sandbox");
The executable resolver auto-detects Chrome, Brave, Edge, or Chromium across macOS, Linux, and Windows. After launch, a polling loop waits up to 15 seconds for the CDP port to become reachable.
chrome profile — A Chrome Extension relay that bridges the user's existing Chrome browser. This is an ~480-line WebSocket-based CDP proxy with two endpoints:
/extension— where the Chrome extension connects/cdp— where Playwright connects, as if it were a real CDP endpoint
The relay intercepts CDP commands, handles browser-level commands locally (like Browser.getVersion and Target.getTargets), and forwards everything else to the extension. It also exposes HTTP endpoints mimicking Chrome's DevTools JSON API (/json/list, /json/version) so Playwright sees a standard Chrome instance.
┌────────────┐ WebSocket ┌──────────────────┐ WebSocket ┌───────────┐
│ Chrome │ ◄──────────────► │ Extension Relay │ ◄──────────────► │ Playwright │
│ Extension │ /extension │ (CDP Proxy) │ /cdp │ │
└────────────┘ └──────────────────┘ └───────────┘
Each profile carries metadata — a CDP port, a toolbar color for visual differentiation, and a driver type (clawd or extension). Profile resolution happens on every request via a ?profile= query parameter.
Accessibility Snapshots and Element References
This is the key innovation that makes browser control work for an LLM agent. Instead of dealing with CSS selectors or XPath, Clawdbot generates role-based element references from the browser's accessibility tree.
Two snapshot systems exist:
ARIA snapshots (raw CDP) use Accessibility.getFullAXTree to pull the complete accessibility tree directly over the CDP protocol. This is the lower-level approach.
Role snapshots (Playwright, the primary mode) parse Playwright's ariaSnapshot() output and assign short refs (e1, e2, e3...) to interactive elements. The system classifies elements into three categories:
- Interactive — buttons, links, textboxes, checkboxes, radio buttons, comboboxes
- Content — headings, cells, list items, articles, regions, navigation landmarks
- Structural — generic containers, groups, lists, tables, rows
When the agent receives a snapshot, it sees something like:
- navigation:
- link "Home" [e1]
- link "Products" [e2]
- link "About" [e3]
- main:
- heading "Welcome" (level=1)
- textbox "Search..." [e4]
- button "Go" [e5]
The agent can then issue { kind: "click", ref: "e2" } to click the "Products" link. Under the hood, refLocator() resolves that ref back to a Playwright locator using the stored role and name:
export function refLocator(page: Page, ref: string) {
if (/^e\d+$/.test(normalized)) {
if (state?.roleRefsMode === "aria") {
return scope.locator(`aria-ref=${normalized}`);
}
const info = state?.roleRefs?.[normalized];
const locator = info.name
? locAny.getByRole(info.role, { name: info.name, exact: true })
: locAny.getByRole(info.role);
return info.nth !== undefined ? locator.nth(info.nth) : locator;
}
}
Role+name deduplication uses nth indices when multiple elements share the same role and accessible name. Refs are cached per target ID, so they remain stable across requests within the same page.
For visual debugging, a labeled screenshot feature overlays orange bounding boxes and ref labels directly onto the page DOM, takes a screenshot, then cleans up the injected elements. This gives the agent (or a human reviewing its actions) a visual map of what each ref points to.
Playwright Session Management
The Playwright connection layer (pw-session.ts) manages persistent connections over CDP with retry logic — 3 attempts with progressive backoff (5s, 7s, 9s timeouts). Once connected, every page gets comprehensive state tracking:
- Console messages — buffered up to 500 entries
- Page errors — buffered up to 200 entries
- Network requests — buffered up to 500, matched with responses and failures
- Role refs — the
e1/e2mapping for element references - Armed hooks — file upload dialogs, JS dialogs, and download interceptors
Errors from Playwright are converted to AI-friendly messages via toAIFriendlyError(), turning stack traces into descriptions the agent can reason about and retry.
For cases where the full Playwright layer isn't needed, a raw CDP module provides direct protocol access — screenshots via Page.captureScreenshot, JavaScript evaluation via Runtime.evaluate, and DOM snapshots through injected JavaScript that walks the tree.
Media Processing Pipeline
All media in Clawdbot flows through a central store at ~/.config/clawdbot/media/. The store handles ingestion, MIME detection, image processing, and serving.
Storage and Lifecycle
Media files get a UUID-based filename with the original name embedded for traceability:
{sanitized-original-name}---{uuid}.{ext}
Files are written with 0o600 permissions (owner-only read/write). A cleanup process runs periodically, deleting files older than the configurable TTL — 2 minutes by default. This aggressive cleanup prevents disk accumulation since most media is transient (processed and forwarded within seconds).
MIME Detection
A three-stage detection system with priority ordering:
- Magic bytes — sniffing the file header via
file-type - File extension — mapping from a built-in extension table
- HTTP Content-Type — from the response headers when fetching remotely
The priority logic prefers sniffed MIME (unless it's generic like application/octet-stream), then falls back to extension, then to headers. This handles cases where servers send wrong Content-Types or files have wrong extensions.
Image Operations
Two backends are supported — Sharp (Node.js, default) and sips (macOS native). The system auto-selects sips when running under Bun on macOS, since Sharp's native bindings can be problematic there.
Key operations:
- EXIF orientation normalization — reads JPEG EXIF bytes directly and applies rotation
- Resize to JPEG — progressive quality/size grid search to fit under byte limits
- Resize to PNG — preserves alpha channel when transparency is detected
- HEIC → JPEG conversion — via Sharp or sips
- Alpha channel detection — determines whether to use the PNG or JPEG path
The JPEG resize uses a grid search across size and quality parameters to find the best fit under a given byte limit:
// sizes: [2048, 1536, 1280, 1024, 800]
// qualities: [80, 70, 60, 50, 40]
// Tries combinations until output fits under maxBytes
Media Server
An Express server serves media files with multiple security controls:
- ID validation — alphanumeric plus dots/hyphens/underscores, max 200 chars
- Path traversal protection —
openFileWithinRoot()validates real paths - Size limits — 5MB maximum per file
- TTL enforcement — files past their expiry are rejected
- Single-use cleanup — files are deleted after being served
For external access (e.g., sending media URLs to messaging platforms), media can be hosted via a Tailscale hostname, generating URLs like https://{tailnet-hostname}/media/{id}.
AI-Powered Media Understanding
The media understanding module provides AI analysis of attachments — transcribing audio, describing images, and analyzing video. Six providers are built in:
| Provider | Capabilities | Default Model |
|---|---|---|
| Groq | Audio transcription | Whisper |
| OpenAI | Audio transcription | gpt-4o-mini-transcribe |
| Audio + Video | Gemini | |
| Anthropic | Image description | Claude |
| MiniMax | Image description | MiniMax-VL-01 |
| Deepgram | Audio transcription | nova-3 |
The Runner Pipeline
The runner orchestrates a multi-step pipeline for each media attachment:
- Check if capability is enabled — per-capability (audio/image/video) toggle
- Select relevant attachments — based on policy rules
- Check scope — session, channel, and chat-type restrictions
- Skip if redundant — image understanding is skipped when the primary LLM already supports vision natively
- Resolve model entries — from config or auto-detection
- Run with fallback chain — tries each provider in order until one succeeds
Auto-detection checks for API keys in the environment and available local binaries:
// Priority for audio:
// 1. Active model provider (if it supports audio)
// 2. Local tools: sherpa-onnx → whisper-cli → whisper
// 3. Gemini CLI
// 4. API key providers: openai, groq, deepgram, google
Local CLI tools are supported through template substitution — variables like {{MediaPath}}, {{OutputDir}}, and {{Prompt}} get replaced in command arguments before execution.
Provider Implementations
Audio providers use standard REST APIs. OpenAI-compatible endpoints accept multipart form uploads; Deepgram takes raw binary with a Content-Type header. Video description sends the full video as inline base64 to Google's Gemini API. Image description uses the pi-ai model registry for multi-provider support, falling back through available providers.
Errors use a MediaUnderstandingSkipError type that allows graceful fallback — if one provider fails (rate limit, unsupported format, timeout), the runner moves to the next in the chain.
Link Content Extraction
The link understanding module extracts URLs from user messages and runs them through configured CLI tools, typically web_fetch with Mozilla's Readability algorithm.
Link detection strips markdown syntax first ([text](url) → url), then matches HTTP(S) URLs with deduplication. A maximum of 3 links per message is enforced by default. SSRF protection rejects localhost and loopback addresses via isAllowedUrl().
Results are injected back into the message context:
export async function applyLinkUnderstanding(params) {
const result = await runLinkUnderstanding({ cfg, ctx });
if (result.outputs.length > 0) {
ctx.LinkUnderstanding = [...(ctx.LinkUnderstanding ?? []), ...result.outputs];
ctx.Body = formatLinkUnderstandingBody({
body: ctx.Body,
outputs: result.outputs,
});
}
}
The extracted content appears in the agent's context as additional context alongside the user's message, so the agent can answer questions about linked pages without needing to open a browser.
SSRF Prevention
Both media fetching and link extraction use hostname pinning via resolvePinnedHostname(). This resolves DNS once and pins the IP, preventing DNS rebinding attacks where a hostname resolves to an internal IP after the initial check. Combined with the isAllowedUrl() filter that blocks private ranges, this prevents the agent from being tricked into accessing internal services.
Canvas System
The canvas system serves a local web application that acts as an interactive display surface, primarily used on mobile nodes (phones and tablets running the Clawdbot companion app).
Canvas Host Server
The server watches a local directory (~/clawd/canvas/) and serves its contents as a web application:
const rootDir = resolveUserPath(
opts.rootDir ?? path.join(os.homedir(), "clawd", "canvas")
);
A chokidar file watcher monitors the directory for changes and triggers live reload via WebSocket — edits to HTML, CSS, or JavaScript files are reflected on the device within ~75ms (debounce threshold). The server binds to 0.0.0.0 rather than localhost so mobile nodes on the same network can access it.
Every served HTML page gets a live reload snippet injected before </body>:
const ws = new WebSocket(proto + "://" + location.host + "/__clawdbot/ws");
ws.onmessage = (ev) => {
if (ev.data === "reload") location.reload();
};
A2UI Bridge
The A2UI (App-to-UI) system bridges native mobile apps and the canvas web content. A JavaScript API is injected into every page:
globalThis.Clawdbot = {
postMessage: function(payload) {
// iOS: window.webkit.messageHandlers
// .clawdbotCanvasA2UIAction.postMessage(raw)
// Android: window.clawdbotCanvasA2UIAction.postMessage(raw)
},
sendUserAction: function(userAction) {
const id = userAction.id || crypto.randomUUID();
return this.postMessage({ userAction: { ...userAction, id } });
},
};
This means canvas pages can send structured actions back to the native app, and from there to the Clawdbot gateway. The agent can render a custom UI on a phone screen — a form, a dashboard, a game — and receive user interactions back through the A2UI bridge.
The default index.html is a diagnostic page with buttons that test whether the native bridge is available on iOS, Android, or neither.
Text-to-Speech
The TTS module (~1100 lines in a single file) supports three providers with automatic fallback:
| Provider | API Endpoint | Default Voice | Output |
|---|---|---|---|
| OpenAI | POST /audio/speech | alloy (model: gpt-4o-mini-tts) | MP3 or Opus |
| ElevenLabs | POST /v1/text-to-speech/:voiceId | pMsXgVXv3BLzUgSXRplE (model: eleven_multilingual_v2) | MP3 or Opus |
| Edge | node-edge-tts library | en-US-MichelleNeural | MP3 |
Channel-Aware Output
The output format adapts to the messaging platform. Telegram gets Opus (smaller, supports voice notes natively). Everything else gets MP3:
const TELEGRAM_OUTPUT = {
openai: "opus",
elevenlabs: "opus_48000_64",
extension: ".opus",
voiceCompatible: true,
};
const DEFAULT_OUTPUT = {
openai: "mp3",
elevenlabs: "mp3_44100_128",
extension: ".mp3",
voiceCompatible: false,
};
Auto Modes and Directives
Four auto modes control when TTS activates:
off— disabled entirelyalways— every reply gets audioinbound— only when the user sends audio/voicetagged— only when the agent includes[[tts]]directives
The LLM can control TTS parameters via inline directives in its response:
[[tts:provider=elevenlabs voice_id=pMsXgVXv3BLzUgSXRplE stability=0.7 speed=1.2]]
[[tts:text]]This specific text will be spoken.[[/tts:text]]
Directives support provider, voice, model, stability, similarity boost, style, speed, speaker boost, normalization, language, and seed — each individually gatable by the modelOverrides policy to prevent unauthorized model switching.
Summarization
Long text is automatically summarized before synthesis. When text exceeds maxLength (default 1500 characters), the system calls the LLM to produce a condensed version. If summarization is disabled, it truncates with an ellipsis. This prevents excessive API costs and keeps audio clips at a reasonable length.
Custom Endpoints
OpenAI TTS supports custom base URLs via OPENAI_TTS_BASE_URL, enabling local TTS servers like Kokoro or LocalAI. When a custom endpoint is detected, model and voice validation is relaxed to allow arbitrary values.
User preferences (provider, auto mode, max length, summarization toggle) are persisted in ~/.config/clawdbot/settings/tts.json with atomic writes via temp-file-plus-rename to prevent corruption.
Cross-Cutting Concerns
Error Handling
Each subsystem has its own error strategy tuned to its failure modes:
- Browser —
toAIFriendlyError()converts Playwright stack traces to messages the agent can reason about and retry - Media —
MediaFetchErrorwith typed codes (max_bytes,http_error,fetch_failed) - Media Understanding —
MediaUnderstandingSkipErrorenables graceful provider fallback - TTS — provider chain tries each backend in order on failure
Security
Security is layered across all modules:
- SSRF protection —
resolvePinnedHostname()for DNS pinning,isAllowedUrl()blocks private ranges - Path traversal —
openFileWithinRoot()with real-path validation on every file serve - Media TTL — automatic 2-minute cleanup prevents disk accumulation
- Extension relay — loopback-only binding, rejects non-loopback WebSocket upgrades
- Single-use media — files are deleted after being served once
- Credential safety — atomic writes with backup for persistent auth state
Configuration
All systems share sections in the central ClawdbotConfig:
{
browser: {
// profiles, CDP settings, headless/sandbox modes
},
tools: {
media: {
// per-capability model configs, scope, timeouts
},
links: {
// CLI tools, scope, maxLinks
},
},
messages: {
tts: {
// provider, voice, auto mode, summarization
},
},
}
Series
- Core Architecture & Gateway
- Memory System
- Agent System & AI Providers
- Channel & Messaging
- Sessions & Multi-Agent
- CLI, Commands & TUI
- Browser, Media & Canvas (this post)
- Infrastructure & Security