mcp-arabic-toolkit
MCP server exposing Arabic text utilities: normalisation, tashkeel stripping, transliteration, heuristic dialect detection, and token counting.
README
mcp-arabic-toolkit
A small Model Context Protocol (MCP) server
exposing practical Arabic text utilities. Built with the official mcp Python
SDK (FastMCP).
Demonstrates: MCP server authoring / tool development
All tools are implemented for real -- deterministic string processing plus one
clearly-labelled heuristic. The pure logic lives in
arabic_tools.py (no mcp dependency), so it is
independently unit-tested; server.py is a thin MCP wrapper.
Tools
| Tool | Description | Example input | Example output |
|---|---|---|---|
normalise_arabic |
NFC-normalises, removes diacritics (harakat/tashkil) and tatweel, and optionally unifies letter variants (alef/yeh/teh-marbuta). | الْعَرَبِيَّةُ |
العربية |
strip_tashkeel |
Removes only the diacritics (and, by default, the tatweel); leaves letters as-is. | كــــتاب |
كتاب |
transliterate |
Documented, deterministic Arabic→Latin romanisation (simplified DIN 31635 / ALA-LC, ASCII digraphs). | كَتَبَ |
{"transliteration": "kataba", "scheme": "din31635-simplified-ascii"} |
detect_dialect |
Heuristic dialect guess (Egyptian/Levantine/Gulf/Maghrebi/MSA) from marker words. Not a trained classifier — see limits below. | شو بدك هلق؟ |
{"dialect": "levantine", "confidence": 1.0, ...} |
count_tokens |
Whitespace-token count plus character and Arabic-character statistics. | مرحبا يا عالم |
{"tokens": 3, "characters": 13, ...} |
About detect_dialect (read this)
detect_dialect is an honest heuristic, not a machine-learning model. It
counts hand-picked marker words/particles per dialect and returns the highest
scorer. Known limits:
- Only five coarse groups (Egyptian, Levantine, Gulf, Maghrebi, MSA).
- Unreliable on short input, mixed-dialect text, and code-switching.
confidenceis a crude ratio (winning hits / total hits), not a calibrated probability.- Falls back to MSA with
confidence: 0.0when no markers are found.
For production-grade detection, train a supervised classifier (e.g. fastText or a fine-tuned transformer) on a labelled corpus such as MADAR or NADI.
About transliterate
The romanisation is deterministic and documented but intentionally simple:
- No vowel inference — short vowels are produced only from explicit harakat.
- No context-sensitive rules — the article
الis alwaysal-(no sun-letter assimilation), and hamzat al-wasl is not elided. - Shadda doubles the preceding consonant; sukun emits no vowel.
- One-way (Arabic → Latin); not round-trippable.
Install
Requires Python 3.10+.
# Clone, then install the package (editable for local development):
pip install -e .
This pulls in the mcp SDK and registers a mcp-arabic-toolkit console script.
The tests themselves need only pytest (no mcp SDK):
pip install pytest
Run
# Option A: run the module directly (stdio transport)
python server.py
# Option B: run the installed console script
mcp-arabic-toolkit
Register with an MCP client
To use it from Claude Desktop (or any MCP client), add an entry to the client's MCP server config:
{
"mcpServers": {
"arabic-toolkit": {
"command": "python",
"args": ["/absolute/path/to/mcp-arabic-toolkit/server.py"]
}
}
}
Test
python -m pytest tests/ -v
The suite (tests/test_tools.py) imports the pure logic directly and covers
every tool with concrete examples (diacritic/tatweel removal, letter
unification, transliteration with and without harakat, each dialect, and token
counting).
Quick local check
python -c "import arabic_tools; print(arabic_tools.normalise_arabic('الْعَرَبِيَّةُ'))"
# -> العربية
Publishing to the MCP registry
This package ships a server.json manifest compatible with the
official MCP registry.
Exact metadata (server.json)
{
"$schema": "https://static.modelcontextprotocol.io/schemas/2025-07-09/server.schema.json",
"name": "io.github.benjiscollector/mcp-arabic-toolkit",
"description": "MCP server exposing Arabic text utilities: normalisation, tashkeel stripping, transliteration, a heuristic dialect detector, and token counting.",
"status": "active",
"repository": {
"url": "https://github.com/BenjisCollector/mcp-arabic-toolkit",
"source": "github"
},
"version": "0.2.0",
"packages": [
{
"registryType": "pypi",
"registryBaseUrl": "https://pypi.org",
"identifier": "mcp-arabic-toolkit",
"version": "0.2.0",
"transport": { "type": "stdio" }
}
]
}
The server name uses the io.github.<owner>/<repo> namespace, which the
registry verifies against GitHub ownership during publish.
Steps
- Build and publish the PyPI package so the registry has something to point
at:
python -m build twine upload dist/* - Install the registry publisher CLI (
mcp-publisher) — see the registry publishing guide. - Authenticate with GitHub so the CLI can verify the
io.github.*namespace:mcp-publisher login github - Publish from the directory containing
server.json:mcp-publisher publish
To list this server on the community modelcontextprotocol/servers README as well, see SUBMISSION.md for the exact entry text and PR steps.
License
MIT — see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.