BioTrax
Unified genomic track, peak, and sequence retrieval tool for ENCODE, ChIP-Atlas, ReMap, GEO, and SRA/ENA with unified metadata, resolved DOI/PMID provenance, and direct FASTQ download without SRA toolkit.
README
BioTrax
The first unified genomic track, peak, and sequence retrieval tool.
One Python package and MCP server to discover, describe, and download data from ENCODE, ChIP-Atlas, ReMap, GEO, and SRA/ENA — with unified metadata, resolved DOI/PMID provenance, and ready-to-run pipeline outputs (nf-core/rnaseq samplesheets, RBP-RELI/TF-RELI library indexes).
No existing tool unifies peak/track retrieval from all five sources behind
one interface. BioTrax fills that gap and goes further: it is also a better
sra-prefetch, pulling FASTQs directly from ENA with md5 verification and
no SRA-toolkit dependency.
What problem it solves
| Before BioTrax | With BioTrax |
|---|---|
| 5 different APIs and 5 scripts | biotrax search-tracks --target QKI --assay eCLIP |
sra-prefetch + fasterq-dump setup required |
Direct ENA FASTQ download, no SRA toolkit |
| Manual nf-core samplesheet creation | Auto-generated from any SRA/GEO accession |
| No DOI/PMID linkage on downloaded files | Every leaf folder has metadata.json + README.md + resolved DOI |
| RELI peak library built by hand | make_reli_index exporter from any peak pull |
| "What did I download and from where?" | Append-only master index TSV across all downloads |
Sources
| Source | Data | Capabilities |
|---|---|---|
| ENCODE | All assays: eCLIP, ChIP-seq, ATAC-seq, DNase-seq, RNA-seq, Hi-C, ... | search, list_files, doi_resolve, design_note |
| ChIP-Atlas | Assembled TF/histone ChIP-seq + ATAC-seq peak BEDs (merged from all public experiments) | search, list_files |
| ReMap | Curated TF ChIP-seq peaks — primary TF-RELI library source (human hg38) | search, list_files |
| GEO | Supplementary track/peak files (BED, BigWig, narrowPeak) with rich series provenance | search, list_files, doi_resolve, design_note |
| SRA/ENA | Sequencing reads — any strategy (RNA-seq, ChIP-seq, ATAC-seq, ...) | search, list_runs, doi_resolve, design_note |
Install
# From source (recommended while pre-release):
pip install -e E:/Claude/BioTrax
# With uvx (runs without permanent install):
uvx --from E:/Claude/BioTrax biotrax list-sources
# Once published to PyPI:
pip install biotrax
Python >= 3.10 required. Dependencies: httpx>=0.27, fastmcp>=0.4.
MCP server (Claude / AI agents)
Add to Claude Code
claude mcp add biotrax -- E:\Claude\BioTrax\.venv\Scripts\python.exe E:\Claude\BioTrax\biotrax\server.py
Run manually (stdio transport)
E:\Claude\BioTrax\.venv\Scripts\python.exe E:\Claude\BioTrax\biotrax\server.py
Available MCP tools
| Tool | What it does |
|---|---|
search_tracks |
Search ENCODE, ChIP-Atlas, ReMap, GEO for peak/track datasets |
list_track_files |
List downloadable files for one dataset (with direct URLs) |
get_dataset |
Fetch full metadata for one dataset by accession |
download_tracks |
Download files to disk; writes manifest + metadata + README |
list_sources |
Introspect all sources, capabilities, and filter keys |
search_runs |
Search SRA/ENA for sequencing experiments |
download_fastq |
Resolve accessions, download FASTQs, emit nf-core samplesheet |
make_rnaseq_samplesheet |
Build nf-core/rnaseq samplesheet from SeqRun dicts |
make_reli_index |
Build RBP-RELI / TF-RELI library index from downloaded BED files |
CLI usage
# Search ENCODE for QKI eCLIP datasets (GRCh38, human default)
biotrax search-tracks --target QKI --assay eCLIP --sources ENCODE
# List peak files for an ENCODE experiment
biotrax list-files --source ENCODE --id ENCSR366YOG --output-type peaks
# Get full metadata (DOI, design note, file count)
biotrax get-dataset --source GEO --id GSE78509
# Search ChIP-Atlas for CTCF hg38 assembled peaks
biotrax search-tracks --target CTCF --sources ChIP-Atlas --genome GRCh38
# Search ReMap for all CTCF ChIP-seq datasets
biotrax search-tracks --target CTCF --sources ReMap
# Download FASTQs from an SRA study + write nf-core samplesheet
biotrax download-fastq --accessions SRP123456 --strandedness reverse
# Download FASTQs from a GEO series
biotrax download-fastq --accessions GSE151511
# Resolve a single SRR run
biotrax search-runs --raw-query tax_eq(9606) --limit 5
# List all sources and their filter keys
biotrax list-sources
Python library usage
from biotrax.sources.encode import ENCODEAdapter
adapter = ENCODEAdapter()
# Search for QKI eCLIP datasets
datasets = adapter.search(target="QKI", assay="eCLIP", genome="GRCh38")
# Returns: [Dataset(dataset_id='ENCSR366YOG', target='QKI', biosample='K562', ...),
# Dataset(dataset_id='ENCSR570WLM', target='QKI', biosample='HepG2', ...)]
# Get full metadata including DOI
ds = adapter.get_dataset("ENCSR366YOG")
# ds.publication_doi -> "10.1038/s41586-020-2077-3"
# ds.pmid -> "32728246"
# ds.design_note -> "eCLIP, QKI, K562 experiment from ENCODE. ..."
# List peak files with direct download URLs
files = adapter.list_files(ds, output_type="peaks")
# files[0].url -> "https://www.encodeproject.org/files/ENCFF786UOW/@@download/..."
# files[0].file_format -> "bed"
# files[0].md5 -> "..."
from biotrax.sources.sra_ena import SRAENAAdapter
from biotrax.exporters.nfcore_rnaseq import write_nfcore_samplesheet
adapter = SRAENAAdapter()
# Resolve a GSE to runs with ENA FASTQ URLs (no SRA toolkit needed)
runs = adapter.list_runs("GSE151511")
# runs[0].run_accession -> "SRR12345678"
# runs[0].fastq_urls -> ["https://ftp.sra.ebi.ac.uk/...R1.fastq.gz", ...]
# runs[0].fastq_md5 -> ["abc123...", ...]
# runs[0].library_layout -> "PAIRED"
# Write nf-core/rnaseq samplesheet
write_nfcore_samplesheet(runs, "samplesheet.csv", strandedness="reverse")
from biotrax.sources.geo import GEOAdapter
adapter = GEOAdapter()
# Fetch a GEO series with provenance
datasets = adapter.search(filters={"gse": "GSE78509"})
ds = datasets[0]
# ds.pmid -> "27068461"
# ds.publication_doi -> "10.1016/j.celrep.2016.03.052"
# ds.design_note -> "This SuperSeries is composed of..."
# List supplementary track files (BED, BigWig, narrowPeak)
files = adapter.list_files(ds)
# -> 37 TrackFile objects with HTTPS download URLs
from biotrax.sources.remap import RemapAdapter
adapter = RemapAdapter()
datasets = adapter.search(target="CTCF", organism="Homo sapiens")
# datasets[0].n_files -> 629 (1 bulk BED + 628 per-experiment BEDs)
# datasets[0].publication_doi -> "10.1093/nar/gkab996"
files = adapter.list_files(datasets[0])
# files[0].url -> "http://remap.univ-amu.fr/storage/remap2022/hg38/MACS2/TF/CTCF/..."
from biotrax.exporters.reli_index import write_reli_index, write_reli_index_annotated
from pathlib import Path
# Build RBP-RELI / TF-RELI library index from downloaded peak BEDs
write_reli_index(track_files, local_paths, "CLIPseq.my_run.index")
# Output (tab-delimited, no header, two columns):
# QKI_K562_ENCODE_ENCFF786UOW /path/to/ENCFF786UOW.bed.gz
# CTCF_K562_ENCODE_ENCFF001XYZ /path/to/ENCFF001XYZ.bed.gz
Download layout
Files land in a self-describing, deterministic hierarchy under F:\BioTrax\:
F:\BioTrax\
ENCODE\GRCh38\QKI__eCLIP__K562__ENCSR366YOG\
ENCFF786UOW.bed.gz
manifest.tsv <- file-level: filename, url, md5, size_bytes, format
metadata.json <- full Dataset: DOI, PMID, design_note, all fields
README.md <- human-readable per-dataset summary
ChIP-Atlas\GRCh38\TFs-and-others__CTCF__All-cell-types__05__hg38\
Oth.ALL.05.CTCF.AllCell.bed
manifest.tsv
metadata.json
README.md
SRA-ENA\SRP123456\SRR12345678\
SRR12345678_1.fastq.gz
SRR12345678_2.fastq.gz
manifest.tsv
metadata.json
README.md
GEO\GSE78509__IGF2BP1-H9ES-eCLIP\
GSM2071742_IGF2BP1_H9ES_Rep1_eCLIP.InputNormalizedPeaks.bed.gz
manifest.tsv
metadata.json
README.md
_index\
biotrax_downloads_index.tsv <- append-only master log of all downloads
Slug rule (deterministic, documented): spaces/slashes replaced with -;
consecutive - collapsed; leading/trailing - stripped; truncated at 80 chars.
Logical compound names use __ (double underscore) as separator so single
underscores in gene names (e.g. QKI_5) survive unchanged.
Unified schema
Dataset(
source, # "ENCODE" | "ChIP-Atlas" | "ReMap" | "GEO" | "SRA-ENA"
dataset_id, # primary accession (ENCSR000AKW, GSE12345, remap2022_CTCF_hg38)
target, # protein/antigen (e.g. "QKI", "CTCF") — None for GEO series
assay, # assay type (e.g. "eCLIP", "ChIP-seq", "RNA-seq")
biosample, # cell line / tissue (e.g. "K562", "HepG2")
organism, # defaults "Homo sapiens"
genome, # normalized assembly (e.g. "GRCh38") — default
n_files, # count of downloadable files (0 = unknown)
publication_doi, # resolved DOI (None if not found — never fabricated)
pmid, # PubMed ID
design_note, # <=2-sentence human-readable experiment setup
source_url, # canonical portal URL
retrieved, # ISO date of when BioTrax fetched this record
extra, # source-specific dict (never fabricated)
ignored_filters, # filter keys this source could not apply (transparency)
)
TrackFile(
source, dataset_id, file_id,
target, assay, biosample, genome,
output_type, # "peaks" | "signal" | "IDR thresholded peaks" | ...
file_format, # "bed" | "narrowPeak" | "bigBed" | "bigWig" | ...
url, # direct download URL (HTTPS)
size, # bytes (when available)
md5, # expected md5 (when available)
parent, # back-reference to owning Dataset
)
SeqRun(
run_accession, # e.g. SRR12345678
study, # SRP / PRJNA accession
sample, # SRS / SAMN accession
library_strategy, # "RNA-Seq" | "ChIP-Seq" | "ATAC-seq" | ...
library_layout, # "PAIRED" | "SINGLE"
instrument, # e.g. "Illumina NovaSeq 6000"
read_count, # total reads
organism, # defaults "Homo sapiens"
fastq_urls, # ENA HTTPS FASTQ URLs [R1, R2] or [SE]
fastq_md5, # parallel md5 list
parent, # back-reference to owning Dataset
)
Pipeline exporters
from biotrax.exporters.nfcore_rnaseq import write_nfcore_samplesheet
from biotrax.exporters.reli_index import write_reli_index, write_reli_index_annotated
from biotrax.exporters.worksheet import write_worksheet
from biotrax.exporters.descriptor import write_dataset_readme, write_set_readme
# nf-core/rnaseq samplesheet from SRA runs
# Columns: sample, fastq_1, fastq_2, strandedness
write_nfcore_samplesheet(runs, "samplesheet.csv", strandedness="reverse")
# Minimal RELI binary-compatible index (2-column, tab-delimited, no header)
write_reli_index(track_files, local_paths, "CLIPseq.my_run.index")
# Annotated RELI index with provenance columns
write_reli_index_annotated(track_files, local_paths, "CLIPseq.my_run.index.annotated.tsv")
# Generic tidy sample worksheet (TSV or CSV)
write_worksheet(datasets, track_files=files, out_path="results.tsv")
Flexible filtering
Every source exposes a filters dict for source-specific constraints.
Unrecognized keys are reported in Dataset.ignored_filters — never
silently dropped.
# ENCODE-specific filters
datasets = adapter.search(
target="QKI",
filters={
"assay_title": "eCLIP", # exact ENCODE assay_title
"lab": "Gene Yeo, UCSD", # lab filter
"status": "released", # experiment status
"date_released": "2020-01-01", # released on or after
}
)
# GEO-specific filters
datasets = geo_adapter.search(
filters={
"gse": "GSE78509", # exact accession lookup
"supplementary_file_type": "BW", # require BigWig supplementary
"date_from": "2018/01/01",
"min_samples": 4,
}
)
# ChIP-Atlas-specific filters
datasets = chipatlas_adapter.search(
target="CTCF",
filters={
"antigen_class": "TFs and others",
"cell_type_class": "Blood",
"threshold": "10", # MACS2 q-value threshold (05/10/20)
}
)
# SRA/ENA-specific filters
datasets = sra_adapter.search(
filters={
"library_strategy": "RNA-Seq",
"library_layout": "PAIRED",
"min_read_count": 10000000,
}
)
Use biotrax list-sources (CLI) or the list_sources() MCP tool to see all
filter keys per source.
Contributing: add a source adapter
BioTrax is architected so each source is an isolated, independently testable module. Adding a new source takes four steps:
1. Create biotrax/sources/myadapter.py
from biotrax.sources.base import SourceAdapter
from biotrax.core import Dataset, TrackFile
class MyAdapter(SourceAdapter):
name = "MySource"
description = "One-line description of the source."
supported_filters = ["filter_key_1", "filter_key_2"]
capabilities = {"search", "list_files", "download"}
def search(self, target=None, assay=None, biosample=None,
genome="GRCh38", organism="Homo sapiens",
filters=None, raw_query=None, limit=50):
filters = filters or {}
ignored = self._extract_ignored(filters, self.supported_filters)
# ... call your API, build Dataset objects ...
return datasets # list[Dataset]
def list_files(self, dataset_or_id, *, output_type=None,
file_format=None, genome="GRCh38"):
acc = self._dataset_id(dataset_or_id)
# ... fetch file list, build TrackFile objects ...
return track_files # list[TrackFile]
2. Set name, description, supported_filters, capabilities
The supported_filters list tells list_sources() (and AI agents) exactly
what filter keys this source understands. Any key in filters not in this
list is automatically collected into ignored_filters via
self._extract_ignored() — the transparency protocol.
3. Register in biotrax/server.py
from biotrax.sources.myadapter import MyAdapter
_PEAK_ADAPTERS["MySource"] = MyAdapter # or _ALL_ADAPTERS for seq sources
4. Add live-API tests in tests/test_myadapter.py
Use pytest -k myadapter to run only your tests. Follow the pattern in
tests/test_base_adapter.py for the ABC contract and
tests/test_exporters.py for exporter tests.
Anti-hallucination rule: verify against the live API before merging. If the API can't deliver something cleanly, document the caveat in the adapter docstring — never fabricate results.
Defaults
- Genome: GRCh38 (alias "hg38" accepted everywhere)
- Organism: Homo sapiens
- Download root:
F:\BioTrax - Output type default: peaks (for peak sources)
- Strandedness default: "auto" (for nf-core samplesheets)
License
MIT — see LICENSE.
Citation
If BioTrax is useful to your work, please also cite the databases it queries:
- ENCODE: ENCODE Project Consortium (2012) Nature 489:57-74
- ChIP-Atlas: Oki et al. (2018) NAR [doi:10.1093/nar/gky488]; Oki et al. (2024) NAR [doi:10.1093/nar/gkae358]
- ReMap: Hammal et al. (2022) NAR [doi:10.1093/nar/gkab996]
- GEO: Barrett et al. (2013) NAR [doi:10.1093/nar/gks1193]
- ENA/SRA: Leinonen et al. (2011) NAR [doi:10.1093/nar/gkq967]
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.