research-loop
Enables AI-moderated interviews that become cited, searchable transcripts, with a repository Q&A tool that returns exact transcript quotes.
README
research-loop
research-loop is a self-contained slice of an AI research platform: an AI-moderated interview turns a discussion guide into a real conversation, each finished session is automatically distilled into a summary, chapters, highlights and tags, and every transcript folds into a semantically searchable repository where every answer cites the exact transcript moment it came from. The whole repository is exposed as an MCP server, and an eval harness keeps the AI honest with published numbers.
It is one coherent product that touches all four project areas in Great Question's internship posting: semantic search across interview content, a realtime agentic AI moderator, MCP tool structuring, and evals across the tools and the moderator.
Requirements: Node 20–24. Node 26 breaks
next build(a Node 26 × Next 15.5.19 resolver incompatibility — see Known issues);npm run dev,npm run typecheck, andnpm testare unaffected.
60-second reviewer quickstart
The fastest path is MCP — the demo is meant to be queried, not clicked. Point Claude at the deployed server:
claude mcp add --transport http loop https://research-loop-ten.vercel.app/api/mcp \
--header "Authorization: Bearer <token-from-application>"
The live endpoint is gated by a bearer token (shared in the application
materials). Running locally, leave MCP_BEARER_TOKEN unset and drop the
--header. Once added, suggested first question:
"Ask the repository: why does this candidate want to work at Great Question?"
Claude calls the ask_repository tool and answers with cited quotes drawn
from the candidate's own AI-moderated interview — each citation deep-links to
the exact transcript moment. The demo is the cover letter.
The five MCP tools: list_sessions, get_session, search_repository,
ask_repository, get_eval_results.
Run it locally
git clone <repo-url> research-loop && cd research-loop
npm install
cp .env.example .env.local # then fill in keys (see below)
npm run db:init # apply the SQLite schema
npm run seed # seed guides + interview transcripts
npm run dev # http://localhost:3000
npm run db:init and the transcript-seeding stage of npm run seed run
without any API keys. The analysis stage (embeddings + summary/chapters/
highlights/tags) only runs when keys are present; after adding keys, run
npm run seed -- --analyze-only to backfill it.
Environment keys (see .env.example for the annotated list):
A single OpenRouter key powers everything — its OpenAI-compatible API serves both chat (Claude + GPT models) and embeddings.
| Var | Needed for |
|---|---|
OPENROUTER_API_KEY |
everything — analysis, Ask synthesis, embeddings, eval judge/rerank |
ANTHROPIC_MODEL |
quality model, OpenRouter id (default nex-agi/nex-n2-pro:free — runs free) |
ANTHROPIC_FALLBACK_MODELS |
comma-separated free fallback chain, tried in order only when the prior model 429s/errors (default openai/gpt-oss-120b:free,google/gemma-4-31b-it:free) |
ANTHROPIC_FAST_MODEL |
cheap model for rerank + persona bots (default openai/gpt-4o-mini) |
OPENAI_EMBED_MODEL |
embeddings (default openai/text-embedding-3-small) |
DATABASE_URL |
libSQL file (default file:./data/research-loop.db) |
PUBLIC_BASE_URL |
base for deep links in MCP/Ask responses |
MCP_BEARER_TOKEN |
optional — gate the MCP endpoint; unset = open |
Voice needs a direct OpenAI key. OpenRouter does not proxy OpenAI's Realtime API, so with an OpenRouter key voice mode is unavailable and the UI falls back to text mode (the dependable path anyway — see limitations).
No .env is committed and no key is required to typecheck or run the tests —
imports stay lazy with respect to the environment, so the token-free test suite
is green on a clean machine.
What's inside
┌──────────────────────────────────────────────────────────────┐
│ Next.js 15 app (App Router) — one deployable artifact │
│ │
│ / landing + "start interview" │
│ /interview/[id] text moderator (voice: experimental) │
│ /sessions completed sessions list │
│ /sessions/[id] transcript + chapters + highlights │
│ /ask repository Q&A with cited quotes │
│ /api/mcp MCP server (Streamable HTTP) │
│ /api/... session + interview + ask endpoints │
└───────────────┬───────────────────────────────────────────────┘
│
┌────────────┼─────────────────┬────────────────────┐
▼ ▼ ▼ ▼
OpenAI OpenRouter OpenRouter libSQL / SQLite
Realtime (chat: analysis, (embeddings: sessions, segments,
(voice — Ask synthesis, text-embedding- embeddings (blob),
direct judge, rerank) 3-small) chapters, highlights,
OpenAI key tags, eval_runs
only)
The single OpenRouter key serves both chat and embeddings; only the experimental voice path needs a direct OpenAI Realtime key.
Routes
| Route | Kind | Purpose |
|---|---|---|
/ |
page | Landing: pitch, a card per guide, the claude mcp add line |
/interview/[id] |
page | Text-mode interview client; ?mode=voice for the experimental WebRTC voice client |
/sessions |
page | List of all sessions with status, label, guide, date |
/sessions/[id] |
page | Transcript + summary + chapters + highlights + tags; #t=<ms> scroll-highlights a moment |
/ask |
page | Natural-language Q&A with inline cited quotes |
/api/mcp |
route | MCP Streamable HTTP endpoint (GET/POST/DELETE) |
/api/sessions |
route | POST create a session |
/api/interview/[id]/turn |
route | POST one text-mode moderator exchange |
/api/interview/[id]/segment |
route | POST persist one voice transcript segment |
/api/interview/[id]/end |
route | POST finish + analyze a session |
/api/ask |
route | POST repository Q&A (cited answer) |
/api/realtime/token |
route | POST mint an ephemeral OpenAI Realtime token |
The moderator brain — the "realtime agentic AI moderator" pattern
The moderator isn't a script reader. Its behavior is the sum of three things:
instructions (lib/moderator/instructions.ts — warm/neutral persona, a
probing rule of ≤2 follow-ups on shallow answers, no leading questions, a
time-box, a consent open and an "anything I expected you to ask?" close),
explicit state (ModeratorState in lib/moderator/textLoop.ts tracks
covered topic ids and probes-per-topic so guide progress is real state, not
vibes), and structured tool outputs (each turn returns a Zod-validated
{ utterance, covered_topic_ids, probe_topic_id, phase }). The same
buildInstructions drives both the voice (Realtime) session and the text loop,
so a prompt change is felt in both modes and is exercised by the evals. That
instructions + state + structured-output loop is exactly what "realtime
agentic AI moderator" means in the posting.
Eval results
Numbers below are from a
npm run evals -- --quickrun on the default models (quality/judgenex-agi/nex-n2-pro:free— the free quality model the demo actually ships, rerankopenai/gpt-4o-mini, embeddingsopenai/text-embedding-3-small).--quickis a 3-persona / 10-retrieval / 6-Ask subset; the full suite (npm run evals) covers all 12 personas, ~28 retrieval pairs and ~20 Ask questions. Regenerate any time —evals/REPORT.mdis overwritten on each run. The harness fails soft without a key.
The three suites (see evals/ and the moderator/retrieval/faithfulness design):
Moderator quality — scripted participant personas (terse, rambly, off-topic, hostile, over-sharer, …) run automated text interviews against the moderator; an LLM judge (rubric 1–5 with rationale and two calibration examples) scores each dimension; mean over seeds.
| Dimension | Score (1–5) |
|---|---|
| Coverage | 4.00 |
| Probing | 3.67 |
| Neutrality | 5.00 |
| Flow | 4.67 |
Retrieval quality — hand-written question → gold-segment pairs over the seeded sessions, embedding-only vs. embedding + rerank.
| Pipeline | recall@5 | recall@10 | MRR |
|---|---|---|---|
| Embedding-only | 90% | 100% | 0.814 |
| Embedding + rerank | 100% | 100% | 0.950 |
Citation faithfulness — ask_repository answers checked by a verifier:
does every cited quote exist and actually support the claim it's attached to?
| Metric | Value |
|---|---|
| % claims cited | 100% |
| % citations faithful | 100% |
| % answers w/ all quotes verbatim | 100% |
Evals are how I knew when to stop prompt-tuning — prompt changes were kept or reverted based on these numbers. The rerank's lift (MRR 0.81 → 0.95) is exactly the kind of signal the harness exists to surface.
Decisions & tradeoffs
- Realtime API, not a hand-rolled STT→LLM→TTS pipeline. OpenAI's Realtime API gives natural low-latency voice with built-in turn-taking; the moderator "brain" lives in the session instructions + tool calls. A hand-rolled pipeline would be more controllable but cost days of latency-tuning the demo doesn't need. Voice is shipped as experimental; text mode is the dependable path (see limitations).
- No vector DB, on purpose. Embeddings are stored as
float32blobs and scored with brute-force cosine over a few hundred segments — microseconds, no infra to babysit. At Great Question scale (tens of thousands of interview hours) this flips: you'd want an ANN index (HNSW/IVF), smarter chunking than one-row-per-turn, and any retrieval change gated behind the retrieval eval before it ships. Knowing when not to reach for a vector DB is the point. - Small MCP surface, on purpose. Five tools, not twenty-five. Each
description states when to use it and when not to (e.g.
search_repositorysays "useask_repositoryinstead when you want a synthesized answer"), and documents its exact response shape.lib/mcp/tools.tsis the single source of truth, consumed by both the route handler and the tests. - In-memory moderator state. Covered-topic / probe tracking lives in a
per-process map keyed by session id. A restart mid-interview resets that
tracking, but the transcript is durable in the DB and is replayed into
every
moderatorStepcall, so the model re-derives context — graceful degradation, not breakage. A multi-process deployment would persist it. - PII scrub before any third-party call.
lib/pii.tsredacts emails / phones / long digit runs, and analysis scrubs once and reuses the scrubbed text for both the model call and embeddings, so raw text never leaves the process — mirroring Great Question's PII-masking approach, cheaply. - Text-mode fallback de-risks voice. The same guide and the same brain run over a plain text loop, so the demo works with no mic and a live-demo mic failure can't sink it.
Security
The deployed app is public, so it's hardened to a level a reviewer can trust at a glance:
- Secrets never reach the client. The OpenRouter key and Turso token are
server-only env vars; no
NEXT_PUBLIC_exposure, nothing in the browser bundle. Verified against every"use client"component. - Every API route validates input with Zod (typed bodies, length caps), all DB access is parameterized (no SQL injection), and PII is scrubbed before any third-party call.
- The MCP endpoint requires a bearer token (
MCP_BEARER_TOKEN) in production; unauthorized requests get a 401. - Per-IP rate limiting on the cost-bearing and write endpoints, and generic error responses — internal errors are logged server-side, never returned to the caller (no stack traces, paths, or DB/model details leak).
- Security headers on every response: HSTS,
X-Content-Type-Options: nosniff,X-Frame-Options: DENY,Referrer-Policy,Permissions-Policy, and a CSP locking downframe-ancestors/object-src/base-uri. - Free by construction — no paid model in the quality path. The quality
model (analysis, Ask synthesis, eval judge) defaults to a free OpenRouter
reasoning model (
nex-agi/nex-n2-pro:free), and when it hits its daily capcomplete()inlib/llm/anthropic.tstransparently walks a chain of other free models (openai/gpt-oss-120b:free→google/gemma-4-31b-it:free, deliberately different providers) until one answers. So the live demo can't flake when one free model is rate-limited, and the quality pipeline still costs $0 — there's no card-on-file exposure to run up at all. (Only the rerank step and embeddings touch a paid model, and at pennies; see below.) - Cost containment. The cost-bearing endpoints (
ask_repositoryand the interview turns) sit behind the MCP bearer token / rate limiting, and OpenRouter enforces a hard credit cap as a final backstop — so even with the pennies-level rerank + embedding spend, nothing can run up a meaningful bill. npm auditis clean of runtime risk. The remaining advisories are all in dev/build tooling (vitest/vite test runner, postcss used only at build time) — none ship to production, and the only "fix" downgrades Next.js to v9, so it's intentionally not applied.
Honest caveats: rate limiting is in-memory per serverless instance (best-effort against a single-source flood, not a hard global cap — a production deploy would use a shared store), and the interview endpoints are intentionally open so reviewers can run a live interview.
Known issues
next buildfails on Node 26. Node 26 changedfs.readlinkSyncon regular files (EINVAL → EISDIR), which trips Next 15.5.19's own module resolver withEISDIR: illegal operation on a directory, readlink …. It is not an app-code bug —npm run dev,npm run typecheck, andnpm testall pass on Node 26. Use Node 20–24 to build (.nvmrcpins 22, and CI builds on Node 22). Don't "fix" it by upgrading deps.
Honest limitations
- Voice mode is experimental and disabled on the OpenRouter key. The
WebRTC + Realtime path is wired end to end, but OpenRouter doesn't proxy
OpenAI's Realtime API, so the live demo runs in text mode (the dependable,
fully-exercised path). Supply a direct
OPENAI_API_KEYto enable voice. - Research participants are synthetic. Three of the four seeded sessions
(Maya, Tomáš, Priya) are clearly labeled
(synthetic); only the candidate self-interview is a real person. Synthetic data is labeled everywhere it appears, including in the seed guard that requires the(synthetic)label. - Single-process state. Moderator progress state is in-memory (above).
- No auth on interview links beyond unguessable ids; the MCP endpoint is
open unless
MCP_BEARER_TOKENis set. The demo data is non-sensitive by design. - Eval n is small. ~12 personas, ~25 retrieval pairs, ~20 ask questions — enough to catch regressions and guide prompt tuning, not a statistical claim.
- Timestamps are estimated, not measured. Seeded transcripts have no real audio; turn durations are derived from word count (~150 wpm) so the timeline is deterministic and the gold set is stable.
Repo map
research-loop/
├── app/ Next.js routes (pages + /api, incl. /api/mcp)
├── lib/
│ ├── moderator/ guide, instructions builder, text-loop brain
│ ├── analysis/ summary/chapters/highlights/tags pipeline
│ ├── search/ cosine, retrieve, ask-with-citations
│ ├── mcp/ the 5 tool defs (single source of truth)
│ ├── llm/ OpenRouter client (chat + embeddings) + wrappers
│ ├── db/ schema, libSQL client, typed queries
│ ├── ui/ small markdown + time helpers
│ ├── env.ts lazy typed env access
│ └── pii.ts PII scrub
├── evals/ eval harness (npm run evals) → evals/REPORT.md
├── scripts/ init-db, seed, seed-data fixtures
└── tests/ vitest, token-free (CI-safe)
Scripts
| Script | What it does | Keys? |
|---|---|---|
npm run dev |
Next.js dev server | no |
npm run build |
production build (Node 20–24 only, see Known issues) | no |
npm run typecheck |
tsc --noEmit |
no |
npm test |
vitest, token-free | no |
npm run db:init |
apply the SQLite schema | no |
npm run seed |
seed guides + transcripts; analysis stage runs only with keys | partial |
npm run evals |
run the eval suites → evals/REPORT.md (spends tokens) |
yes |
Working on this repo with an agent? See AGENTS.md for the map, the load-bearing contracts, and how to verify a change.
License
MIT — see LICENSE.
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.