clbench-fireworks-rft
Enables reinforcement fine-tuning of Qwen3-8B on the CLBench poker task using the eval-protocol MCP-Gym framework on Fireworks infrastructure, with structured tool calls via MCP.
README
clbench-fireworks-rft
Reinforcement Fine-Tuning of Qwen3-8B on the CLBench exploitable_poker task, running on
Fireworks infrastructure via eval-protocol MCP-Gym.
This is a port of the sr-networks/clbench-verifiers
GRPO setup (Will Brown's verifiers framework + PrimeIntellect hosted training) onto Fireworks RFT.
The CLBench poker simulator, action parsing, and reward shaping carry over unchanged; only the
RL-framework glue is rewritten.
Why a port and not a copy
verifiers and Fireworks RFT are different abstractions. What transfers vs. what changes:
verifiers / Prime (upstream repo) |
Fireworks RFT (this repo) | status |
|---|---|---|
CLBenchEnv(vf.MultiTurnEnv) (env.py) |
poker_adapter.py + poker_mcp.py MCP gym |
ported |
task.reset()/.step()/.get_instance_outcomes() |
PokerEnv wrapper + poker_act tool |
ported |
rubric.py reward fns |
reward.py → test_poker_rft.py evaluator |
ported |
parsing.py + guided_json (vLLM) |
MCP tool params are the PokerAction schema |
obsolete (tool-calling enforces structure) |
cl-benchmark poker task |
imported unchanged | no change |
vf.RLTrainer GRPO + local vLLM |
firectl create reinforcement-fine-tuning-job |
platform |
TOML configs + prime train |
RFT job flags (see Launch) | re-expressed |
Key win: because the poker_act tool's parameters are exactly the PokerAction fields
(action / thinking / amount), the model emits structured tool calls and malformed-JSON
parse failures become impossible — the old guided_json + parse_failure_penalty machinery is no
longer needed.
Files
| file | role |
|---|---|
poker_adapter.py |
PokerEnv + PokerAdapter — wraps the CLBench task (the env.py port) |
poker_mcp.py |
PokerMcp(McpGym) — registers the poker_act tool, control-plane reward/termination |
server.py |
MCP-Gym server launcher (python server.py --port N) |
reward.py |
evaluator scoring (the rubric.py port): mean instance reward + illegal-action penalty |
test_poker_rft.py |
@evaluation_test binding dataset + gym + model + reward |
make_dataset.py |
generates poker_dataset.jsonl (one EvaluationRow per seed) |
poker_dataset.jsonl |
64-seed training dataset (regenerate with make_dataset.py) |
requirements.txt |
deps Fireworks installs into the rollout container |
validate_connection.py |
optional connectivity/structured-output smoke test (needs a served model) |
setup.sh |
installs deps + creates the bin/python 3.11 shim |
Setup
./setup.sh # deps + bin/python shim
export FIREWORKS_API_KEY="fw_..." # https://fireworks.ai/account/api-keys
firectl set-api-key "$FIREWORKS_API_KEY"
export PATH="$PWD/bin:$PATH" # python3.11 shim first (gym spawns `python server.py`)
firectl (Go binary) install:
brew tap fw-ai/firectl && brew trust fw-ai/firectl && brew install fw-ai/firectl/firectl
Launch an RFT job
Two paths. The direct firectl path is what we actually used (it avoids a CLI bug — see Gotchas).
A) Upload the evaluator, then launch with firectl ← used
# 1. upload the evaluator (env + reward) so Fireworks builds the rollout container
eval-protocol create rft \
--evaluator test_poker_rft.py::test_poker_rft \
--dataset poker_dataset.jsonl --mcp-server server.py \
--training-config-base-model accounts/fireworks/models/qwen3-8b \
--dry-run --skip-validation -y # uploads evaluator; ignore the poller timeout
# 2. confirm evaluator is ACTIVE, upload dataset, create the job
firectl create dataset clbench-poker-qwen3-8b-data poker_dataset.jsonl
firectl create reinforcement-fine-tuning-job \
--base-model accounts/fireworks/models/qwen3-8b \
--dataset clbench-poker-qwen3-8b-data \
--evaluator accounts/<ACCOUNT>/evaluators/test-poker-rftpytest-poker-rft \
--output-model clbench-poker-qwen3-8b \
--epochs 2 --learning-rate 1e-6 --temperature 1.0 \
--max-output-tokens 1024 --response-candidates-count 8
B) Pure eval-protocol (once the poller bug is fixed upstream)
eval-protocol create rft --evaluator test_poker_rft.py::test_poker_rft \
--dataset poker_dataset.jsonl --mcp-server server.py \
--training-config-base-model accounts/fireworks/models/qwen3-8b \
--training-config-output-model clbench-poker-qwen3-8b \
--training-config-epochs 2 --training-config-learning-rate 1e-6 \
--inference-parameters-temperature 1.0 --inference-parameters-max-output-tokens 1024 \
--inference-parameters-response-candidates-count 8
Config mapping from the Prime TOML
rollouts_per_example=8 → --response-candidates-count 8 (GRPO group size) · temperature=1.0 ·
max_tokens=1024 → --max-output-tokens 1024 · enable_thinking=false baked into the gym prompt ·
guided_json → tool-call schema (free).
Training runs
See RUNS.md for the full log. Summary:
| run | job id | base | output model | epochs | candidates | status |
|---|---|---|---|---|---|---|
| 1 | hj1u6nxa |
qwen3-8b (free) | clbench-poker-qwen3-8b | 2 | 8 | launched 2026-06-25, RUNNING |
Monitor: firectl get reinforcement-fine-tuning-job <job-id> · dashboard: https://app.fireworks.ai/dashboard
Gotchas (hard-won)
from __future__ import annotationsbreaks eval-protocol. It stringifies annotations, so FastMCP tool registration (issubclass("str", Context)) and the@evaluation_testsignature validator both fail. Do not use it inpoker_mcp.pyortest_poker_rft.py.- FastMCP (this version) crashes on
Optional[int]tool params while locating theContextarg.poker_actuses a plainint = -1sentinel instead. firectlneedsfirectl set-api-key; it does not readFIREWORKS_API_KEYautomatically. (firectl whoamiadditionally needs OIDCsignin— ignore it.)eval-protocol create rfthas a poller bug: it polls…/evaluators/<file>.py::<func>— the.py::makes the URL malformed → HTTP 400 → false 10-minute "evaluator not ready" timeout. The evaluator is actually ACTIVE; launch viafirectl.- macOS
pythonis often 2.7. The gym spawnspython server.py, sobin/pythonmust shim to the 3.11 interpreter that has the deps and be first onPATH. - Rollouts run on Fireworks, in a container built from
requirements.txt— so a local serverless deployment of the base model is not required for training (only for local pytest rollouts).
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.