X Archive Daemon
Archives posts from X into a local SQLite database and provides tools for semantic analysis and natural language search. It enables agents to manage low-cost data ingestion and retrieve relevant posts using local embeddings and topic labeling.
README
X Archive Daemon
Kisa aciklama: X uzerinden post toplayan, bunlari yerel SQLite arsivine yazan ve MCP ile ajanlara kullandiran daemon tabanli bir arsivleme ve akilli arama projesi.
X Archive Daemon archives posts from X into a local SQLite database, exposes them through a daemon-first architecture, and optionally adds a local semantic analysis layer for smarter retrieval.
Install First
If you work with coding agents, the easiest setup flow is:
- Give the repository link to the agent.
- Ask the agent to install dependencies.
- Ask the agent to create the local secret file.
- Ask the agent to start the daemon and the MCP bridge.
Manual setup:
npm install
Create .secrets/x.json:
{
"authMode": "bearer_token",
"bearerToken": "YOUR_X_BEARER_TOKEN"
}
Start the daemon:
npm run daemon:start
Start the MCP bridge:
npm run mcp:start
What This Project Does
This project has three layers:
ingest- fetch posts from X
- store them in SQLite
analysis- optional
- adds local embeddings, topic labels, and educational scoring
semantic search- searches analyzed posts locally
- gives the agent a small, relevant candidate set
Scenario:
ingest= put boxes into storageanalysis= attach labels to the boxessemantic search= find the right boxes quickly
Architecture
daemon- the real execution engine
- exposes:
GET /healthGET /toolsPOST /invoke
MCP- thin stdio bridge for agents
- exposes the same tools to the model
SQLite- stores posts, scopes, sync runs, billing estimates, and optional analysis results
Core Features
1. Smart, low-cost ingest
The system avoids paying twice for the same coverage.
Example:
- first you fetch the latest
50 - later you ask for the latest
100 - it does not re-fetch the first
50 - it fetches only the missing
50
This works for:
latest Ntimelinelatest Noriginal posts- exact repeated date windows
- exact repeated search queries
2. Original posts by default
When a user generically says "fetch posts" or "archive tweets", the default tool is:
ingest.accounts.original_backfill
This excludes:
- replies
- retweets
- quote tweets
That keeps the archive cleaner and cheaper.
3. Media URLs are stored, files are not downloaded
If a post contains images:
- files are not downloaded
- only
mediaUrlsare stored
This keeps disk and network usage low.
4. Optional analysis layer
Analysis is off by default.
That means:
- you can archive posts without analyzing them
- you can analyze the same archive later
- old and newly ingested posts are both supported
5. Local semantic search
Once analysis exists, you can search with natural language prompts like:
- "teaching posts about coding"
- "monolith vs microservices"
- "backend and architecture advice"
The system then:
- searches the analyzed local archive
- scores candidates locally
- lets the agent work on a narrow, relevant subset
Tool Groups
sources.accounts.resolveingest.accounts.backfillingest.accounts.original_backfillingest.accounts.syncingest.search.backfillarchive.posts.listarchive.posts.searcharchive.posts.semantic_searcharchive.accounts.listarchive.accounts.getarchive.billing.summaryanalysis.posts.runanalysis.labels.listarchive.insights.summary
MCP also exposes:
system.daemon.start
Risk Model
safe-read
- read-only
- does not change X or the local archive
operator-write
- writes into the local SQLite archive
- may consume X API credits
- does not create, delete, like, reply, or DM on X
Note:
analysis.posts.runis alsooperator-write- but it does not consume X API credits
- it only writes local analysis data and uses CPU
Performance Reference
Measured system:
- Apple Silicon
M4 mini 16 GBRAM- local embedding model on CPU
Measured analysis speed:
900posts = about30.69s100posts = about3.41s980posts = about33.42sexpected total
Important:
- this is not a large generative LLM benchmark
- this is local tagging + embedding + semantic retrieval preparation
Local Model
The current analysis layer uses one small local model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Its role is:
- post embeddings
- label-description embeddings
- query embeddings
- similarity-based tagging
- semantic retrieval
This layer does not generate final answers. It builds a local meaning layer on top of the archive.
How Analysis Works
When analysis runs, each post gets signals such as:
- educational score
- matched labels
- topic scores
- reply/noise/technical flags
Tagging is based on:
- rule-based signals
- embedding similarity
- a fixed label catalog with descriptions
Label Catalog
The repository includes a versioned label catalog with Turkish descriptions.
Examples:
software_architecturemonolith_vs_microservicesbackend_apidatabase_sqldatabase_indexingcachingdistributed_systemstesting_qaclean_codecode_reviewsecurity_appsecauthentication_authorizationci_cd_releaseai_assisted_codingvibe_codingprompting_for_engineeringtechnical_decision_making
Quick Start
Check daemon health:
curl http://127.0.0.1:3200/health
List available tools:
curl http://127.0.0.1:3200/tools
Estimate original-post ingest:
curl -X POST http://127.0.0.1:3200/invoke \
-H 'content-type: application/json' \
-d '{
"tool": "ingest.accounts.original_backfill",
"input": {
"username": "sampleauthor",
"searchMode": "recent",
"targetCount": 100,
"estimateOnly": true
}
}'
Run local analysis:
curl -X POST http://127.0.0.1:3200/invoke \
-H 'content-type: application/json' \
-d '{
"tool": "analysis.posts.run",
"input": {
"username": "sampleauthor",
"limit": 200,
"onlyUnanalyzed": true
}
}'
Run semantic search:
curl -X POST http://127.0.0.1:3200/invoke \
-H 'content-type: application/json' \
-d '{
"tool": "archive.posts.semantic_search",
"input": {
"username": "sampleauthor",
"query": "teaching posts about coding",
"educationalOnly": true,
"limit": 10
}
}'
Model Packaging Note
The repository is public and safe to clone, but the local model files are not stored in git history because of GitHub file size limits.
The codebase is public-ready. Model packaging is handled separately from normal git push history.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.