macos-control-mcp
Give AI agents eyes and hands on macOS.
README
macos-control-mcp
Give AI agents eyes and hands on macOS.
What is this?
An MCP server that lets AI agents see your screen, read text on it, and interact — click, type, scroll — just like a human sitting at the keyboard. Unlike blind script runners, this MCP gives agents state awareness: they screenshot the screen, OCR it to get text with pixel coordinates, then click exactly where they need to.
The See-Think-Act Loop
┌─────────────────────────────────────────────────┐
│ │
│ 1. SEE screenshot / screen_ocr │
│ ↓ "What's on the screen?" │
│ │
│ 2. THINK AI reasons about the content │
│ ↓ "I need to click the Save btn" │
│ │
│ 3. ACT click_at / type_text / press_key│
│ "Click at (425, 300)" │
│ │
│ ↻ repeat │
└─────────────────────────────────────────────────┘
This is what makes it powerful: the agent sees the result of every action and can course-correct, retry, or move on — just like you would.
Quick Start
No install needed — run directly with npx:
npx -y macos-control-mcp
On first run, a Python virtual environment is automatically created at ~/.macos-control-mcp/.venv with the required Apple Vision and Quartz frameworks. This takes ~60 seconds once and persists across updates.
Video Showcasing the MCP:
https://www.youtube.com/watch?v=aswlsElHV5o
Configure Your AI Client
All clients use the same command: npx -y macos-control-mcp
<details> <summary><strong>Claude Desktop</strong></summary>
Edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
Restart Claude Desktop after saving. </details>
<details> <summary><strong>Claude Code</strong></summary>
claude mcp add macos-control -- npx -y macos-control-mcp
</details>
<details> <summary><strong>VS Code / GitHub Copilot</strong></summary>
Add to .vscode/mcp.json in your workspace:
{
"servers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
</details>
<details> <summary><strong>Cursor</strong></summary>
Add to .cursor/mcp.json in your project:
{
"mcpServers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
</details>
<details> <summary><strong>Cline</strong></summary>
Open Cline extension settings → MCP Servers → Add:
{
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
</details>
<details> <summary><strong>Windsurf</strong></summary>
Add to ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"macos-control": {
"command": "npx",
"args": ["-y", "macos-control-mcp"]
}
}
}
</details>
Permissions
macOS requires two permissions for full functionality:
- Screen Recording — for screenshots and OCR
- Accessibility — for clicking, typing, and reading UI elements
Go to System Settings → Privacy & Security and add your terminal app (Terminal, iTerm2, VS Code, etc.) to both lists. You'll be prompted on first use.
Tools (19)
See the screen
| Tool | Description |
|---|---|
screenshot |
Capture full screen or app window as JPEG |
screen_ocr |
OCR the screen — returns text elements with pixel coordinates |
find_text_on_screen |
Find specific text and get clickable x,y coordinates |
Interact with the screen
| Tool | Description |
|---|---|
click_at |
Click at x,y coordinates (returns screenshot) |
double_click_at |
Double-click at x,y (returns screenshot) |
type_text |
Type text into the frontmost app |
press_key |
Press key combos (Cmd+S, Ctrl+C, etc.) |
scroll |
Scroll up/down/left/right |
App management
| Tool | Description |
|---|---|
launch_app |
Open or focus an application |
list_running_apps |
List visible running apps |
Accessibility tree
| Tool | Description |
|---|---|
get_ui_elements |
Get accessibility tree of an app window |
click_element |
Click a named UI element (returns screenshot) |
Browser automation
| Tool | Description |
|---|---|
execute_javascript |
Run JavaScript in the active browser tab |
get_page_text |
Get all visible text from the page (faster than OCR) |
click_web_element |
Click element by CSS selector (instant, precise) |
fill_form_field |
Fill a form field by CSS selector |
Utilities
| Tool | Description |
|---|---|
open_url |
Open URL in Safari or Chrome |
get_clipboard |
Read clipboard contents |
set_clipboard |
Write to clipboard |
Example Workflows
Fill out a web form
You: "Go to example.com/signup and fill in my details"
Agent:
1. open_url("https://example.com/signup")
2. screenshot() → sees the form
3. screen_ocr() → finds "Email" field at (300, 250)
4. click_at(300, 250) → clicks the email field
5. type_text("user@example.com")
6. find_text_on_screen("Submit") → gets button coordinates
7. click_at(350, 500) → submits the form
8. screenshot() → confirms success
Navigate an unfamiliar app
You: "Change the font size to 16 in TextEdit"
Agent:
1. launch_app("TextEdit")
2. screenshot() → sees the app
3. get_ui_elements("TextEdit") → finds menu items
4. press_key("t", ["command"]) → opens Fonts panel
5. screenshot() → sees the font panel
6. find_text_on_screen("Size") → locates the size field
7. click_at(x, y) → clicks size field
8. type_text("16")
9. press_key("return")
Extract info from an email
You: "Copy the tracking number from the shipping email in Mail"
Agent:
1. launch_app("Mail")
2. screenshot() → sees the inbox
3. find_text_on_screen("Your order has shipped") → locates the email
4. click_at(x, y) → opens the email
5. screenshot() → sees the email content
6. screen_ocr() → extracts all text including tracking number
7. set_clipboard("1Z999AA10123456784") → copies to clipboard
Take a screenshot and describe what's on screen
You: "What do I have open right now?"
Agent:
1. screenshot() → captures the full screen
2. Describes what it sees: apps, windows, content
3. list_running_apps() → ["Safari", "Slack", "VS Code", "Spotify"]
4. "You have Safari open on Twitter, Slack with 3 unread messages, and VS Code editing index.ts"
Debug a UI issue
You: "The submit button on this page isn't working, can you check what's happening?"
Agent:
1. screenshot() → sees the page
2. find_text_on_screen("Submit") → locates button at (500, 600)
3. get_ui_elements("Safari") → inspects accessibility tree
4. "The button has aria-disabled=true. There's a required field 'Phone' that's empty."
5. find_text_on_screen("Phone") → locates the field
6. click_at(x, y) → clicks the field
7. type_text("+1 555-0123")
8. click_at(500, 600) → clicks Submit again
9. screenshot() → confirms it worked
Control Spotify
You: "Pause the music and turn the volume down"
Agent:
1. launch_app("Spotify")
2. screenshot() → sees Spotify is playing
3. find_text_on_screen("Pause") → locates pause button
4. click_at(x, y) → pauses playback
5. find_text_on_screen("Volume") → locates volume slider area
6. click_at(x, y) → adjusts volume
7. screenshot() → confirms paused and volume lowered
Work with Finder
You: "Create a new folder on my Desktop called 'Project Assets' and move it to Documents"
Agent:
1. launch_app("Finder")
2. press_key("d", ["command", "shift"]) → opens Desktop
3. screenshot() → sees Desktop in Finder
4. press_key("n", ["command", "shift"]) → creates new folder
5. type_text("Project Assets")
6. press_key("return")
7. screenshot() → confirms folder created
Send a message in Slack
You: "Send 'build is green, ready to deploy' in the #engineering channel on Slack"
Agent:
1. launch_app("Slack")
2. screenshot() → sees Slack
3. press_key("k", ["command"]) → opens Quick Switcher
4. type_text("engineering")
5. press_key("return") → opens #engineering
6. screenshot() → confirms channel is open
7. click_at(x, y) → clicks message input
8. type_text("build is green, ready to deploy")
9. press_key("return") → sends message
10. screenshot() → confirms sent
Research and copy data from a website
You: "Look up the current price of AAPL on Google Finance and copy it"
Agent:
1. open_url("https://google.com/finance/quote/AAPL:NASDAQ")
2. screenshot() → sees the page loading
3. screen_ocr() → reads all text on the page
4. Finds the price: "$187.42"
5. set_clipboard("$187.42")
6. "Copied AAPL price $187.42 to your clipboard"
Multi-app workflow
You: "Take what's in my clipboard, search for it in Safari, and screenshot the results"
Agent:
1. get_clipboard() → "best mechanical keyboards 2025"
2. launch_app("Safari")
3. press_key("l", ["command"]) → focuses address bar
4. type_text("best mechanical keyboards 2025")
5. press_key("return") → searches
6. screenshot() → captures the search results
7. "Here are the search results for 'best mechanical keyboards 2025'"
Navigate System Settings
You: "Turn on Dark Mode"
Agent:
1. launch_app("System Settings")
2. screenshot() → sees System Settings
3. find_text_on_screen("Appearance") → locates the option
4. click_at(x, y) → opens Appearance settings
5. screenshot() → sees Light/Dark/Auto options
6. find_text_on_screen("Dark") → locates Dark mode option
7. click_at(x, y) → enables Dark Mode
8. screenshot() → confirms Dark Mode is on
Requirements
- macOS 13+ (Ventura or later)
- Node.js 18+
- Python 3.9+ (pre-installed on macOS — needed for OCR and mouse control)
How It Works
- Screenshots — native
screencaptureCLI - OCR — Apple Vision framework (VNRecognizeTextRequest) via Python bridge, returns text with bounding box coordinates
- Mouse — Quartz Core Graphics events via Python bridge for precise pixel-level control
- Keyboard & Apps — AppleScript via
osascriptfor key presses, app launching, and UI element interaction - Python env — auto-managed venv at
~/.macos-control-mcp/.venv/with only two packages (pyobjc-framework-Vision,pyobjc-framework-Quartz)
Troubleshooting
"Permission denied" or blank screenshots → Add your terminal to System Settings → Privacy & Security → Screen Recording
Clicks don't work → Add your terminal to System Settings → Privacy & Security → Accessibility
Python setup fails
→ Ensure python3 is in your PATH. Run python3 --version to check. Non-Python tools (keyboard, apps, clipboard) still work without it.
OCR returns empty results
→ Make sure Screen Recording permission is granted. Try a full-screen OCR first (without the app parameter).
"App not found" errors → Use the exact app name as shown in Activity Monitor (e.g., "Google Chrome" not "Chrome").
License
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.