MCP Servers

MCP Data Fetch Server

Securely fetches web content, extracts links and metadata, and downloads files through a sandboxed MCP server without JavaScript execution. Includes prompt-injection detection and comprehensive HTML sanitization for safe web data retrieval.

README

📂 MCP Data Fetch Server

MCP Data Fetch Server is secure, sandboxed server that fetches web content and extracts data via the Model Control Protocol (MCP). without executing JavaScript.

Features
Installation & Quick Start
Command‑Line Options
Integration with LM Studio
MCP API Overview
Available Tools
Security Features

🎯 Features

Secure web page fetching – strips scripts, iframes and cookie banners; no JavaScript execution.
Rich data extraction – retrieve links, metadata, Open Graph/Twitter cards, and downloadable resources.
Safe file downloads – size limits, filename sanitisation, and path‑traversal protection within a sandboxed cache.
Built‑in caching – optional cache directory reduces repeated network calls.
Prompt‑injection detection – validates URLs and fetched content for malicious instructions.

📦 Installation & Quick Start

# Clone the repository (or copy the MCPDataFetchServer.1 folder)
git clone https://github.com/undici77/MCPDataFetchServer.git
cd MCPDataFetchServer

# Make the startup script executable
chmod +x run.sh

# Run the server, pointing to a sandboxed working directory
./run.sh -d /path/to/working/directory

📌 Three‑step overview
1️⃣ The script creates a virtual environment and installs dependencies.
2️⃣ It prepares a cache folder (.fetch_cache) inside the project root.
3️⃣ main.py launches the MCP server, listening on stdin/stdout for JSON‑RPC requests.

⚙️ Command‑Line Options

Option	Description
`-d`, `--working-dir`	Path to the sandboxed working directory where all file operations are confined (default: `~/.mcp_datafetch`).
`-c`, `--cache-dir`	Name of the cache subdirectory relative to the working directory (default: `cache`).
`-h`, `--help`	Show help message and exit.

🤝 Integration with LM Studio (or any MCP‑compatible client)

Add an entry to your mcp.json configuration so that LM Studio can launch the server automatically.

{
  "mcpServers": {
    "datafetch": {
      "command": "/absolute/path/to/MCPDataFetchServer.1/run.sh",
      "args": [
        "-d",
        "/absolute/path/to/working/directory"
      ],
      "env": { "WORKING_DIR": "." }
    }
  }
}

📌 Tip: Ensure run.sh is executable (chmod +x …) and that the virtual environment can install the required Python packages on first launch.

📡 MCP API Overview

All communication follows JSON‑RPC 2.0 over stdin/stdout.

`initialize`

Request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "initialize",
  "params": {}
}

Response contains the protocol version, server capabilities and basic metadata (e.g., name = mcp-datafetch-server, version = 2.1.0).

`tools/list`

Request:

{
  "jsonrpc": "2.0",
  "id": 2,
  "method": "tools/list",
  "params": {}
}

Response: { "tools": [ …tool definitions… ] }. Each definition includes name, description and an input schema (JSON Schema).

`tools/call`

Generic request shape (replace <tool_name> and arguments as needed):

{
  "jsonrpc": "2.0",
  "id": 3,
  "method": "tools/call",
  "params": {
    "name": "<tool_name>",
    "arguments": { … }
  }
}

The server validates the request against the tool’s schema, executes the operation, and returns a ToolResult containing one or more content blocks.

🛠️ Available Tools

`fetch_webpage`

Securely fetches a web page and returns clean content in the requested format.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL to fetch (http/https only).
`format`	string	❌ (`markdown`)	Output format – one of `markdown`, `text`, or `html`.
`include_links`	boolean	❌ (`true`)	Whether to append an extracted links list.
`include_images`	boolean	❌ (`false`)	Whether to list image URLs in the output.
`remove_banners`	boolean	❌ (`true`)	Attempt to strip cookie banners & pop‑ups.

Example

{
  "jsonrpc": "2.0",
  "id": 10,
  "method": "tools/call",
  "params": {
    "name": "fetch_webpage",
    "arguments": {
      "url": "https://example.com/article",
      "format": "markdown",
      "include_links": true,
      "include_images": false,
      "remove_banners": true
    }
  }
}

Note: The tool sanitises HTML, removes scripts/iframes, and checks for prompt‑injection patterns before returning content.

`extract_links`

Extracts and categorises all hyperlinks from a page.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL of the page to analyse.
`filter`	string	❌ (`all`)	Return only `all`, `internal`, `external`, or `resources`.

Example

{
  "jsonrpc": "2.0",
  "id": 11,
  "method": "tools/call",
  "params": {
    "name": "extract_links",
    "arguments": {
      "url": "https://example.com/blog",
      "filter": "internal"
    }
  }
}

Note: Links are classified as internal (same domain) or external; resource links (images, PDFs…) can be filtered with resources.

`download_file`

Safely downloads a remote file into the sandboxed cache directory.

Name	Type	Required	Description
`url`	string	✅ (no default)	Direct URL to the file.
`filename`	string	❌ (auto‑generated)	Desired filename; will be sanitised and forced into the cache directory.

Example

{
  "jsonrpc": "2.0",
  "id": 12,
  "method": "tools/call",
  "params": {
    "name": "download_file",
    "arguments": {
      "url": "https://example.com/files/report.pdf",
      "filename": "report_latest.pdf"
    }
  }
}

Note: The server enforces a 100 MB download limit, validates the URL against blocked domains/extensions, and returns the relative path inside the working directory for cross‑agent access.

`get_page_metadata`

Extracts structured metadata (title, description, Open Graph, Twitter Cards) from a web page.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL of the page to inspect.

Example

{
  "jsonrpc": "2.0",
  "id": 13,
  "method": "tools/call",
  "params": {
    "name": "get_page_metadata",
    "arguments": { "url": "https://example.com/product/42" }
  }
}

Note: The tool returns a formatted text block with title, description, keywords, Open Graph properties and Twitter Card fields.

`check_url`

Performs a lightweight HEAD request to report status code, headers and size without downloading the body.

Name	Type	Required	Description
`url`	string	✅ (no default)	URL to probe.

Example

{
  "jsonrpc": "2.0",
  "id": 14,
  "method": "tools/call",
  "params": {
    "name": "check_url",
    "arguments": { "url": "https://example.com/resource.zip" }
  }
}

Note: The response includes the final URL after redirects, a concise status summary (✅ OK or ⚠️ Error), and selected HTTP headers such as Content‑Type and Content‑Length.

🔐 Security Features

Path‑traversal protection – all file operations are confined to the sandboxed working directory.
Prompt‑injection detection in URLs, fetched HTML and generated content.
Blocked domains & extensions (localhost, private IP ranges, executable/script files).
Content‑size limits – max 50 MB for page fetches, max 100 MB for file downloads.
HTML sanitisation – removes <script>, <iframe>, event handlers and other risky elements before processing.
Cookie/banner handling – optional removal of consent banners and pop‑ups during fetch.

Recommended Servers

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official

Featured

TypeScript

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official

Featured

Local

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official

Featured

Python

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official

Featured

TypeScript

E2B

Using MCP to run code via e2b.

Official

Featured

Neon Database

MCP server for interacting with Neon Management API and databases

Official

Featured

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official

Featured

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official

Featured

MCP Data Fetch Server

README

📂 MCP Data Fetch Server

Table of Contents

🎯 Features

📦 Installation & Quick Start

⚙️ Command‑Line Options

🤝 Integration with LM Studio (or any MCP‑compatible client)

📡 MCP API Overview

initialize

tools/list

tools/call

🛠️ Available Tools

fetch_webpage

extract_links

download_file

get_page_metadata

check_url

🔐 Security Features

Recommended Servers

🤝 Integration with LM Studio (or any MCP‑compatible client)

`initialize`

`tools/list`

`tools/call`

`fetch_webpage`

`extract_links`

`download_file`

`get_page_metadata`

`check_url`