Build AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on sandboxing, security, and handling the unique challenges of vision-based control. Use when: computer use, desktop automation agent, screen control AI, vision-based agent, GUI automation.
Add this skill
npx mdskills install sickn33/computer-use-agentsComprehensive computer use agent guide with production-quality patterns and security emphasis
1---2name: computer-use-agents3description: "Build AI agents that interact with computers like humans do - viewing screens, moving cursors, clicking buttons, and typing text. Covers Anthropic's Computer Use, OpenAI's Operator/CUA, and open-source alternatives. Critical focus on sandboxing, security, and handling the unique challenges of vision-based control. Use when: computer use, desktop automation agent, screen control AI, vision-based agent, GUI automation."4source: vibeship-spawner-skills (Apache 2.0)5---67# Computer Use Agents89## Patterns1011### Perception-Reasoning-Action Loop1213The fundamental architecture of computer use agents: observe screen,14reason about next action, execute action, repeat. This loop integrates15vision models with action execution through an iterative pipeline.1617Key components:181. PERCEPTION: Screenshot captures current screen state192. REASONING: Vision-language model analyzes and plans203. ACTION: Execute mouse/keyboard operations214. FEEDBACK: Observe result, continue or correct2223Critical insight: Vision agents are completely still during "thinking"24phase (1-5 seconds), creating a detectable pause pattern.252627**When to use**: ['Building any computer use agent from scratch', 'Integrating vision models with desktop control', 'Understanding agent behavior patterns']2829```python30from anthropic import Anthropic31from PIL import Image32import base6433import pyautogui34import time3536class ComputerUseAgent:37 """38 Perception-Reasoning-Action loop implementation.39 Based on Anthropic Computer Use patterns.40 """4142 def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):43 self.client = client44 self.model = model45 self.max_steps = 50 # Prevent runaway loops46 self.action_delay = 0.5 # Seconds between actions4748 def capture_screenshot(self) -> str:49 """Capture screen and return base64 encoded image."""50 screenshot = pyautogui.screenshot()51 # Resize for token efficiency (1280x800 is good balance)52 screenshot = screenshot.resize((1280, 800), Image.LANCZOS)5354 import io55 buffer = io.BytesIO()56 screenshot.save(buffer, format="PNG")57 return base64.b64encode(buffer.getvalue()).decode()5859 def execute_action(self, action: dict) -> dict:60 """Execute mouse/keyboard action on the computer."""61 action_type = action.get("type")6263 if action_type == "click":64 x, y = action["x"], action["y"]65 button = action.get("button", "left")66 pyautogui.click(x, y, button=button)67 return {"success": True, "action": f"clicked at ({x}, {y})"}6869 elif action_type == "type":70 text = action["text"]71 pyautogui.typewrite(text, interval=0.02)72 return {"success": True, "action": f"typed {len(text)} chars"}7374 elif action_type == "key":75 key = action["key"]76 pyautogui.press(key)77 return {"success": True, "action": f"pressed {key}"}7879 elif action_type == "scroll":80 direction = action.get("direction", "down")81 amount = action.get("amount", 3)82 scroll = -amount if direction == "down" else amount83 pyautogui.scroll(scroll)84 return {"success": True, "action": f"scrolled {dir85```8687### Sandboxed Environment Pattern8889Computer use agents MUST run in isolated, sandboxed environments.90Never give agents direct access to your main system - the security91risks are too high. Use Docker containers with virtual desktops.9293Key isolation requirements:941. NETWORK: Restrict to necessary endpoints only952. FILESYSTEM: Read-only or scoped to temp directories963. CREDENTIALS: No access to host credentials974. SYSCALLS: Filter dangerous system calls985. RESOURCES: Limit CPU, memory, time99100The goal is "blast radius minimization" - if the agent goes wrong,101damage is contained to the sandbox.102103104**When to use**: ['Deploying any computer use agent', 'Testing agent behavior safely', 'Running untrusted automation tasks']105106```python107# Dockerfile for sandboxed computer use environment108# Based on Anthropic's reference implementation pattern109110FROM ubuntu:22.04111112# Install desktop environment113RUN apt-get update && apt-get install -y \114 xvfb \115 x11vnc \116 fluxbox \117 xterm \118 firefox \119 python3 \120 python3-pip \121 supervisor122123# Security: Create non-root user124RUN useradd -m -s /bin/bash agent && \125 mkdir -p /home/agent/.vnc126127# Install Python dependencies128COPY requirements.txt /tmp/129RUN pip3 install -r /tmp/requirements.txt130131# Security: Drop capabilities132RUN apt-get install -y --no-install-recommends libcap2-bin && \133 setcap -r /usr/bin/python3 || true134135# Copy agent code136COPY --chown=agent:agent . /app137WORKDIR /app138139# Supervisor config for virtual display + VNC140COPY supervisord.conf /etc/supervisor/conf.d/141142# Expose VNC port only (not desktop directly)143EXPOSE 5900144145# Run as non-root146USER agent147148CMD ["/usr/bin/supervisord", "-c", "/etc/supervisor/conf.d/supervisord.conf"]149150---151152# docker-compose.yml with security constraints153version: '3.8'154155services:156 computer-use-agent:157 build: .158 ports:159 - "5900:5900" # VNC for observation160 - "8080:8080" # API for control161162 # Security constraints163 security_opt:164 - no-new-privileges:true165 - seccomp:seccomp-profile.json166167 # Resource limits168 deploy:169 resources:170 limits:171 cpus: '2'172 memory: 4G173 reservations:174 cpus: '0.5'175 memory: 1G176177 # Network isolation178 networks:179 - agent-network180181 # No access to host filesystem182 volumes:183 - agent-tmp:/tmp184185 # Read-only root filesystem186 read_only: true187 tmpfs:188 - /run189 - /var/run190191 # Environment192 environment:193 - DISPLAY=:99194 - NO_PROXY=localhost195196networks:197 agent-network:198 driver: bridge199 internal: true # No internet by default200201volumes:202 agent-tmp:203204---205206# Python wrapper with additional runtime sandboxing207import subprocess208import os209from dataclasses im210```211212### Anthropic Computer Use Implementation213214Official implementation pattern using Claude's computer use capability.215Claude 3.5 Sonnet was the first frontier model to offer computer use.216Claude Opus 4.5 is now the "best model in the world for computer use."217218Key capabilities:219- screenshot: Capture current screen state220- mouse: Click, move, drag operations221- keyboard: Type text, press keys222- bash: Run shell commands223- text_editor: View and edit files224225Tool versions:226- computer_20251124 (Opus 4.5): Adds zoom action for detailed inspection227- computer_20250124 (All other models): Standard capabilities228229Critical limitation: "Some UI elements (like dropdowns and scrollbars)230might be tricky for Claude to manipulate" - Anthropic docs231232233**When to use**: ['Building production computer use agents', 'Need highest quality vision understanding', 'Full desktop control (not just browser)']234235```python236from anthropic import Anthropic237from anthropic.types.beta import (238 BetaToolComputerUse20241022,239 BetaToolBash20241022,240 BetaToolTextEditor20241022,241)242import subprocess243import base64244from PIL import Image245import io246247class AnthropicComputerUse:248 """249 Official Anthropic Computer Use implementation.250251 Requires:252 - Docker container with virtual display253 - VNC for viewing agent actions254 - Proper tool implementations255 """256257 def __init__(self):258 self.client = Anthropic()259 self.model = "claude-sonnet-4-20250514" # Best for computer use260 self.screen_size = (1280, 800)261262 def get_tools(self) -> list:263 """Define computer use tools."""264 return [265 BetaToolComputerUse20241022(266 type="computer_20241022",267 name="computer",268 display_width_px=self.screen_size[0],269 display_height_px=self.screen_size[1],270 ),271 BetaToolBash20241022(272 type="bash_20241022",273 name="bash",274 ),275 BetaToolTextEditor20241022(276 type="text_editor_20241022",277 name="str_replace_editor",278 ),279 ]280281 def execute_tool(self, name: str, input: dict) -> dict:282 """Execute a tool and return result."""283284 if name == "computer":285 return self._handle_computer_action(input)286 elif name == "bash":287 return self._handle_bash(input)288 elif name == "str_replace_editor":289 return self._handle_editor(input)290 else:291 return {"error": f"Unknown tool: {name}"}292293 def _handle_computer_action(self, input: dict) -> dict:294 """Handle computer control actions."""295 action = input.get("action")296297 if action == "screenshot":298 # Capture via xdotool/scrot299 subprocess.run(["scrot", "/tmp/screenshot.png"])300301 with open("/tmp/screenshot.png", "rb") as f:302303```304305## ⚠️ Sharp Edges306307| Issue | Severity | Solution |308|-------|----------|----------|309| Issue | critical | ## Defense in depth - no single solution works |310| Issue | medium | ## Add human-like variance to actions |311| Issue | high | ## Use keyboard alternatives when possible |312| Issue | medium | ## Accept the tradeoff |313| Issue | high | ## Implement context management |314| Issue | high | ## Monitor and limit costs |315| Issue | critical | ## ALWAYS use sandboxing |316
Full transparency — inspect the skill content before installing.