research-loop

research-loop

Enables AI-moderated interviews that become cited, searchable transcripts, with a repository Q&A tool that returns exact transcript quotes.

Category
Visit Server

README

research-loop

research-loop is a self-contained slice of an AI research platform: an AI-moderated interview turns a discussion guide into a real conversation, each finished session is automatically distilled into a summary, chapters, highlights and tags, and every transcript folds into a semantically searchable repository where every answer cites the exact transcript moment it came from. The whole repository is exposed as an MCP server, and an eval harness keeps the AI honest with published numbers.

It is one coherent product that touches all four project areas in Great Question's internship posting: semantic search across interview content, a realtime agentic AI moderator, MCP tool structuring, and evals across the tools and the moderator.

Requirements: Node 20–24. Node 26 breaks next build (a Node 26 × Next 15.5.19 resolver incompatibility — see Known issues); npm run dev, npm run typecheck, and npm test are unaffected.


60-second reviewer quickstart

The fastest path is MCP — the demo is meant to be queried, not clicked. Point Claude at the deployed server:

claude mcp add --transport http loop https://research-loop-ten.vercel.app/api/mcp \
  --header "Authorization: Bearer <token-from-application>"

The live endpoint is gated by a bearer token (shared in the application materials). Running locally, leave MCP_BEARER_TOKEN unset and drop the --header. Once added, suggested first question:

"Ask the repository: why does this candidate want to work at Great Question?"

Claude calls the ask_repository tool and answers with cited quotes drawn from the candidate's own AI-moderated interview — each citation deep-links to the exact transcript moment. The demo is the cover letter.

The five MCP tools: list_sessions, get_session, search_repository, ask_repository, get_eval_results.

Run it locally

git clone <repo-url> research-loop && cd research-loop
npm install
cp .env.example .env.local        # then fill in keys (see below)
npm run db:init                   # apply the SQLite schema
npm run seed                      # seed guides + interview transcripts
npm run dev                       # http://localhost:3000

npm run db:init and the transcript-seeding stage of npm run seed run without any API keys. The analysis stage (embeddings + summary/chapters/ highlights/tags) only runs when keys are present; after adding keys, run npm run seed -- --analyze-only to backfill it.

Environment keys (see .env.example for the annotated list):

A single OpenRouter key powers everything — its OpenAI-compatible API serves both chat (Claude + GPT models) and embeddings.

Var Needed for
OPENROUTER_API_KEY everything — analysis, Ask synthesis, embeddings, eval judge/rerank
ANTHROPIC_MODEL quality model, OpenRouter id (default nex-agi/nex-n2-pro:free — runs free)
ANTHROPIC_FALLBACK_MODELS comma-separated free fallback chain, tried in order only when the prior model 429s/errors (default openai/gpt-oss-120b:free,google/gemma-4-31b-it:free)
ANTHROPIC_FAST_MODEL cheap model for rerank + persona bots (default openai/gpt-4o-mini)
OPENAI_EMBED_MODEL embeddings (default openai/text-embedding-3-small)
DATABASE_URL libSQL file (default file:./data/research-loop.db)
PUBLIC_BASE_URL base for deep links in MCP/Ask responses
MCP_BEARER_TOKEN optional — gate the MCP endpoint; unset = open

Voice needs a direct OpenAI key. OpenRouter does not proxy OpenAI's Realtime API, so with an OpenRouter key voice mode is unavailable and the UI falls back to text mode (the dependable path anyway — see limitations).

No .env is committed and no key is required to typecheck or run the tests — imports stay lazy with respect to the environment, so the token-free test suite is green on a clean machine.


What's inside

┌──────────────────────────────────────────────────────────────┐
│ Next.js 15 app (App Router) — one deployable artifact         │
│                                                               │
│  /                       landing + "start interview"          │
│  /interview/[id]         text moderator (voice: experimental) │
│  /sessions               completed sessions list              │
│  /sessions/[id]          transcript + chapters + highlights   │
│  /ask                    repository Q&A with cited quotes     │
│  /api/mcp                MCP server (Streamable HTTP)          │
│  /api/...                session + interview + ask endpoints   │
└───────────────┬───────────────────────────────────────────────┘
                │
   ┌────────────┼─────────────────┬────────────────────┐
   ▼            ▼                 ▼                    ▼
 OpenAI      OpenRouter        OpenRouter           libSQL / SQLite
 Realtime    (chat: analysis,  (embeddings:         sessions, segments,
 (voice —    Ask synthesis,    text-embedding-      embeddings (blob),
 direct      judge, rerank)    3-small)             chapters, highlights,
 OpenAI key                                         tags, eval_runs
 only)

The single OpenRouter key serves both chat and embeddings; only the experimental voice path needs a direct OpenAI Realtime key.

Routes

Route Kind Purpose
/ page Landing: pitch, a card per guide, the claude mcp add line
/interview/[id] page Text-mode interview client; ?mode=voice for the experimental WebRTC voice client
/sessions page List of all sessions with status, label, guide, date
/sessions/[id] page Transcript + summary + chapters + highlights + tags; #t=<ms> scroll-highlights a moment
/ask page Natural-language Q&A with inline cited quotes
/api/mcp route MCP Streamable HTTP endpoint (GET/POST/DELETE)
/api/sessions route POST create a session
/api/interview/[id]/turn route POST one text-mode moderator exchange
/api/interview/[id]/segment route POST persist one voice transcript segment
/api/interview/[id]/end route POST finish + analyze a session
/api/ask route POST repository Q&A (cited answer)
/api/realtime/token route POST mint an ephemeral OpenAI Realtime token

The moderator brain — the "realtime agentic AI moderator" pattern

The moderator isn't a script reader. Its behavior is the sum of three things: instructions (lib/moderator/instructions.ts — warm/neutral persona, a probing rule of ≤2 follow-ups on shallow answers, no leading questions, a time-box, a consent open and an "anything I expected you to ask?" close), explicit state (ModeratorState in lib/moderator/textLoop.ts tracks covered topic ids and probes-per-topic so guide progress is real state, not vibes), and structured tool outputs (each turn returns a Zod-validated { utterance, covered_topic_ids, probe_topic_id, phase }). The same buildInstructions drives both the voice (Realtime) session and the text loop, so a prompt change is felt in both modes and is exercised by the evals. That instructions + state + structured-output loop is exactly what "realtime agentic AI moderator" means in the posting.


Eval results

Numbers below are from a npm run evals -- --quick run on the default models (quality/judge nex-agi/nex-n2-pro:free — the free quality model the demo actually ships, rerank openai/gpt-4o-mini, embeddings openai/text-embedding-3-small). --quick is a 3-persona / 10-retrieval / 6-Ask subset; the full suite (npm run evals) covers all 12 personas, ~28 retrieval pairs and ~20 Ask questions. Regenerate any time — evals/REPORT.md is overwritten on each run. The harness fails soft without a key.

The three suites (see evals/ and the moderator/retrieval/faithfulness design):

Moderator quality — scripted participant personas (terse, rambly, off-topic, hostile, over-sharer, …) run automated text interviews against the moderator; an LLM judge (rubric 1–5 with rationale and two calibration examples) scores each dimension; mean over seeds.

Dimension Score (1–5)
Coverage 4.00
Probing 3.67
Neutrality 5.00
Flow 4.67

Retrieval quality — hand-written question → gold-segment pairs over the seeded sessions, embedding-only vs. embedding + rerank.

Pipeline recall@5 recall@10 MRR
Embedding-only 90% 100% 0.814
Embedding + rerank 100% 100% 0.950

Citation faithfulnessask_repository answers checked by a verifier: does every cited quote exist and actually support the claim it's attached to?

Metric Value
% claims cited 100%
% citations faithful 100%
% answers w/ all quotes verbatim 100%

Evals are how I knew when to stop prompt-tuning — prompt changes were kept or reverted based on these numbers. The rerank's lift (MRR 0.81 → 0.95) is exactly the kind of signal the harness exists to surface.


Decisions & tradeoffs

  • Realtime API, not a hand-rolled STT→LLM→TTS pipeline. OpenAI's Realtime API gives natural low-latency voice with built-in turn-taking; the moderator "brain" lives in the session instructions + tool calls. A hand-rolled pipeline would be more controllable but cost days of latency-tuning the demo doesn't need. Voice is shipped as experimental; text mode is the dependable path (see limitations).
  • No vector DB, on purpose. Embeddings are stored as float32 blobs and scored with brute-force cosine over a few hundred segments — microseconds, no infra to babysit. At Great Question scale (tens of thousands of interview hours) this flips: you'd want an ANN index (HNSW/IVF), smarter chunking than one-row-per-turn, and any retrieval change gated behind the retrieval eval before it ships. Knowing when not to reach for a vector DB is the point.
  • Small MCP surface, on purpose. Five tools, not twenty-five. Each description states when to use it and when not to (e.g. search_repository says "use ask_repository instead when you want a synthesized answer"), and documents its exact response shape. lib/mcp/tools.ts is the single source of truth, consumed by both the route handler and the tests.
  • In-memory moderator state. Covered-topic / probe tracking lives in a per-process map keyed by session id. A restart mid-interview resets that tracking, but the transcript is durable in the DB and is replayed into every moderatorStep call, so the model re-derives context — graceful degradation, not breakage. A multi-process deployment would persist it.
  • PII scrub before any third-party call. lib/pii.ts redacts emails / phones / long digit runs, and analysis scrubs once and reuses the scrubbed text for both the model call and embeddings, so raw text never leaves the process — mirroring Great Question's PII-masking approach, cheaply.
  • Text-mode fallback de-risks voice. The same guide and the same brain run over a plain text loop, so the demo works with no mic and a live-demo mic failure can't sink it.

Security

The deployed app is public, so it's hardened to a level a reviewer can trust at a glance:

  • Secrets never reach the client. The OpenRouter key and Turso token are server-only env vars; no NEXT_PUBLIC_ exposure, nothing in the browser bundle. Verified against every "use client" component.
  • Every API route validates input with Zod (typed bodies, length caps), all DB access is parameterized (no SQL injection), and PII is scrubbed before any third-party call.
  • The MCP endpoint requires a bearer token (MCP_BEARER_TOKEN) in production; unauthorized requests get a 401.
  • Per-IP rate limiting on the cost-bearing and write endpoints, and generic error responses — internal errors are logged server-side, never returned to the caller (no stack traces, paths, or DB/model details leak).
  • Security headers on every response: HSTS, X-Content-Type-Options: nosniff, X-Frame-Options: DENY, Referrer-Policy, Permissions-Policy, and a CSP locking down frame-ancestors/object-src/base-uri.
  • Free by construction — no paid model in the quality path. The quality model (analysis, Ask synthesis, eval judge) defaults to a free OpenRouter reasoning model (nex-agi/nex-n2-pro:free), and when it hits its daily cap complete() in lib/llm/anthropic.ts transparently walks a chain of other free models (openai/gpt-oss-120b:freegoogle/gemma-4-31b-it:free, deliberately different providers) until one answers. So the live demo can't flake when one free model is rate-limited, and the quality pipeline still costs $0 — there's no card-on-file exposure to run up at all. (Only the rerank step and embeddings touch a paid model, and at pennies; see below.)
  • Cost containment. The cost-bearing endpoints (ask_repository and the interview turns) sit behind the MCP bearer token / rate limiting, and OpenRouter enforces a hard credit cap as a final backstop — so even with the pennies-level rerank + embedding spend, nothing can run up a meaningful bill.
  • npm audit is clean of runtime risk. The remaining advisories are all in dev/build tooling (vitest/vite test runner, postcss used only at build time) — none ship to production, and the only "fix" downgrades Next.js to v9, so it's intentionally not applied.

Honest caveats: rate limiting is in-memory per serverless instance (best-effort against a single-source flood, not a hard global cap — a production deploy would use a shared store), and the interview endpoints are intentionally open so reviewers can run a live interview.

Known issues

  • next build fails on Node 26. Node 26 changed fs.readlinkSync on regular files (EINVAL → EISDIR), which trips Next 15.5.19's own module resolver with EISDIR: illegal operation on a directory, readlink …. It is not an app-code bug — npm run dev, npm run typecheck, and npm test all pass on Node 26. Use Node 20–24 to build (.nvmrc pins 22, and CI builds on Node 22). Don't "fix" it by upgrading deps.

Honest limitations

  • Voice mode is experimental and disabled on the OpenRouter key. The WebRTC + Realtime path is wired end to end, but OpenRouter doesn't proxy OpenAI's Realtime API, so the live demo runs in text mode (the dependable, fully-exercised path). Supply a direct OPENAI_API_KEY to enable voice.
  • Research participants are synthetic. Three of the four seeded sessions (Maya, Tomáš, Priya) are clearly labeled (synthetic); only the candidate self-interview is a real person. Synthetic data is labeled everywhere it appears, including in the seed guard that requires the (synthetic) label.
  • Single-process state. Moderator progress state is in-memory (above).
  • No auth on interview links beyond unguessable ids; the MCP endpoint is open unless MCP_BEARER_TOKEN is set. The demo data is non-sensitive by design.
  • Eval n is small. ~12 personas, ~25 retrieval pairs, ~20 ask questions — enough to catch regressions and guide prompt tuning, not a statistical claim.
  • Timestamps are estimated, not measured. Seeded transcripts have no real audio; turn durations are derived from word count (~150 wpm) so the timeline is deterministic and the gold set is stable.

Repo map

research-loop/
├── app/                    Next.js routes (pages + /api, incl. /api/mcp)
├── lib/
│   ├── moderator/          guide, instructions builder, text-loop brain
│   ├── analysis/           summary/chapters/highlights/tags pipeline
│   ├── search/             cosine, retrieve, ask-with-citations
│   ├── mcp/                the 5 tool defs (single source of truth)
│   ├── llm/                OpenRouter client (chat + embeddings) + wrappers
│   ├── db/                 schema, libSQL client, typed queries
│   ├── ui/                 small markdown + time helpers
│   ├── env.ts              lazy typed env access
│   └── pii.ts              PII scrub
├── evals/                  eval harness (npm run evals) → evals/REPORT.md
├── scripts/                init-db, seed, seed-data fixtures
└── tests/                  vitest, token-free (CI-safe)

Scripts

Script What it does Keys?
npm run dev Next.js dev server no
npm run build production build (Node 20–24 only, see Known issues) no
npm run typecheck tsc --noEmit no
npm test vitest, token-free no
npm run db:init apply the SQLite schema no
npm run seed seed guides + transcripts; analysis stage runs only with keys partial
npm run evals run the eval suites → evals/REPORT.md (spends tokens) yes

Working on this repo with an agent? See AGENTS.md for the map, the load-bearing contracts, and how to verify a change.

License

MIT — see LICENSE.

Recommended Servers

playwright-mcp

playwright-mcp

A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.

Official
Featured
TypeScript
Magic Component Platform (MCP)

Magic Component Platform (MCP)

An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.

Official
Featured
Local
TypeScript
Audiense Insights MCP Server

Audiense Insights MCP Server

Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.

Official
Featured
Local
TypeScript
VeyraX MCP

VeyraX MCP

Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.

Official
Featured
Local
graphlit-mcp-server

graphlit-mcp-server

The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.

Official
Featured
TypeScript
Kagi MCP Server

Kagi MCP Server

An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.

Official
Featured
Python
E2B

E2B

Using MCP to run code via e2b.

Official
Featured
Neon Database

Neon Database

MCP server for interacting with Neon Management API and databases

Official
Featured
Exa Search

Exa Search

A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.

Official
Featured
Qdrant Server

Qdrant Server

This repository is an example of how to create a MCP server for Qdrant, a vector search engine.

Official
Featured