Calibration Runbook: fitting and re-adjusting item difficulty
This is the operational guide for running the offline Rasch calibration: how to check you
have enough data, run the fit, read the result, and re-adjust difficulties as more responses
accumulate. It is the hands-on companion to the concept page and the code under
services/calibration/.
Who runs this: an engineer or data lead with DB access. It is offline, never part of the app request path. A run reads the response pool, fits each item’s difficulty, and writes a new versioned snapshot. It never edits a published version in place.
The one-paragraph model
Students and questions sit on one shared ability/difficulty ruler (1PL Rasch). The app scores a
child’s ability (θ) live in TypeScript; this job measures each question’s difficulty (δ)
offline from real answers. Until a question has enough real answers it keeps an estimated
difficulty (delta_prior, “Track B / provisional”); once it clears the threshold its measured
difficulty is trusted (“Track A”). This runbook is how a question graduates from Track B to
Track A, and how its measured difficulty is re-adjusted later.
When to run it
- First real calibration, after a pilot cohort has generated enough responses (see the readiness check below). Before that, every item is provisional and the app works fine on estimated difficulties.
- Re-calibration, at the end of each cohort / testing window, or whenever a meaningful batch of new responses has landed. Re-running re-measures every eligible item and mints a new version; running sessions keep reading the version they started on, so nothing shifts mid-session (V-6).
- You do not run it continuously or on the live server. It is a deliberate batch step.
The trust threshold
An item’s measured difficulty is only trusted once it has ≥ 300 real responses
(CALIBRATION_TRUST_THRESHOLD in both services/calibration/calibration/config.py and the
server’s modules/engine/engine/two-track.ts, they must match). Items below 300 are left
untouched on delta_prior. This is the “two-track” gate.
Step 0: Setup (once per machine)
cd services/calibration
python3 -m venv .venv && source .venv/bin/activate # .venv is gitignored
pip install -r requirements.txt # pinned girth/numpy/scipy, V-6
export DATABASE_URL="$(grep '^DATABASE_URL=' ../../apps/server/.env | cut -d= -f2- | tr -d '\"')"DATABASE_URL must point at the database that holds the real responses (the pilot/prod DB). The
calibration tables (calibration_run_log, calibration_version_snapshot, item_calibration_snapshot)
are owned by the server’s Prisma migrations, so they already exist there, apply
pnpm --filter server prisma migrate deploy first if you are on a fresh DB.
Step 1: Validate the calibration code works
Before trusting a real run, confirm girth recovers known difficulties:
python -m pytest tests/ -qThis simulates answers from items whose difficulty is known, fits them, and asserts the recovery (correlation ≥ 0.95, MAE < 0.15) and determinism (two runs identical). If this fails, do not run a real calibration, the environment or a dependency version is off.
The same command also runs an end-to-end test that builds a throwaway database, seeds synthetic responses, and drives the full load, fit, and write-back loop. It asserts the two-track gate (an item below 300 responses stays provisional) and a byte-identical re-run. That test needs Postgres and skips itself automatically when none is reachable, so the recovery check above still runs everywhere.
Step 2: Check you have the data you need
A calibration is only worth running if items have crossed 300 responses. Run this against the same
DATABASE_URL. It counts every (student, item) response across both sources the loader uses,
practice_response (practice) and diagnostic_session.inputResponses (diagnostics), and reports
how many items are calibration-ready per sub-skill:
WITH responses AS (
SELECT pr."itemId" AS item_id, pr."subSkillId" AS sub_skill_id
FROM practice_response pr
JOIN practice_queue pq ON pq.id = pr."queueId"
UNION ALL
SELECT resp->>'itemId' AS item_id, ds."subSkillId" AS sub_skill_id
FROM diagnostic_session ds
CROSS JOIN LATERAL jsonb_array_elements(ds."inputResponses") AS resp
),
per_item AS (
SELECT item_id, sub_skill_id, count(*) AS n
FROM responses
GROUP BY item_id, sub_skill_id
)
SELECT sub_skill_id,
count(*) FILTER (WHERE n >= 300) AS items_ready, -- will be calibrated (Track A)
count(*) AS items_seen, -- have any responses
max(n) AS most_responses_on_one_item
FROM per_item
GROUP BY sub_skill_id
ORDER BY items_ready DESC;Read it like this:
items_ready= 0 everywhere → not enough data yet; a real run would write an empty log and calibrate nothing. Wait for more responses.- A sub-skill needs at least 2 ready items (
MIN_ITEMS_PER_CELL) for its cell to fit, a lone ready item is skipped. - Healthy → several sub-skills with a handful of
items_readyeach.
Step 3: Dry run (fit, but write nothing)
See exactly what would be calibrated and the difficulties it would assign, without touching the DB:
python -m calibration.calibrate --dry-runIt prints, per eligible item: δ (the fitted difficulty, mean-centred), se (its standard
error), n (responses used), and infit (a fit statistic, see Step 5). Sanity-check the δ
ordering matches intuition (questions you expect to be hard come out higher). If it reports
0 eligible items, Step 2 already told you why.
Step 4: Run the real calibration
python -m calibration.calibrate --triggered-by cohort_endIn one transaction this:
- inserts a
calibration_run_logrow (the audit header), - mints the next
calibration_version_snapshot(a new monotonic version number), - writes one
item_calibration_snapshotrow per calibrated item (frozen, never updated), - updates each calibrated item’s
item_bankrow:deltaCalibrated,calibrationN,calibrationVersion = <new>, andcalibrationProvisional = false.
Items below 300 are not touched, they stay on delta_prior / Track B. A run with no eligible
items writes a clean status = completed_empty log and stops (it never mints a noise version).
Step 5: Review the result
Inspect the run and its fit quality:
-- the run header
SELECT id, calibration_version, status, items_calibrated, response_count,
fitting_method, config_version, started_at, completed_at
FROM calibration_run_log ORDER BY id DESC LIMIT 5;
-- per-item difficulties + fit statistics for the version you just minted
SELECT item_id, sub_skill_id, delta_calibrated, delta_se, calibration_n,
infit_mnsq, outfit_mnsq
FROM item_calibration_snapshot
WHERE calibration_version = (SELECT max(calibration_version) FROM calibration_version_snapshot)
ORDER BY sub_skill_id, delta_calibrated;What to check:
status = completedanditems_calibratedmatches your readiness count.infit_mnsq/outfit_mnsqare item-fit statistics. ~1.0 is ideal; the usual productive range is 0.5-1.5. A value well above 1.5 means the item behaves erratically (people who should pass it fail and vice-versa), flag it for content review, don’t silently ship it. Below ~0.5 means it’s almost too predictable (often a near-duplicate or a giveaway).delta_selarge on an item → its difficulty is still shaky; it cleared 300 but the responses were lopsided. Watch it next round.delta_calibratedis mean-centred per fit (Rasch difficulty is only defined up to a constant), so don’t compare absolute values across sub-skills, compare ordering within a sub-skill.
Step 6: Confirm it took effect in the app
Nothing else needs deploying. The server reads the latest published version automatically
(getCurrentCalibrationVersion() → MAX(calibration_version)), so the next diagnostic/practice
session pins your new version and serves the measured δ for the calibrated items (Track A);
everything else stays on delta_prior (Track B). A served item’s audit row carries
calibrationProvisional = false once it’s on a measured δ. Sessions already in flight keep their
pinned version, they don’t jump mid-stream.
Re-adjusting difficulty later (the recurring loop)
When more responses arrive, just run it again (Step 2 → Step 4). Key facts:
- A re-run re-measures every eligible item from the full response pool and writes a new version (e.g. 2, 3, …). It never overwrites version 1’s snapshot rows, they stay for audit and replay.
- An item already on Track A gets its δ re-adjusted to the new measurement; an item that has newly crossed 300 graduates from Track B to Track A for the first time.
item_bankalways reflects the latest version’s δ; the per-version history lives initem_calibration_snapshot.- Recommended cadence: once per cohort / testing window. Avoid re-running on tiny increments, let a meaningful batch accumulate so the re-measurement actually moves.
Rolling back / pinning a version
Versions are append-only; you never delete or edit one. To revert the app to an earlier
calibration, you do not undo anything, a session can be pinned to a specific
calibrationVersion (the two-track only trusts a measured δ when
calibrationVersion ≤ the session's pinned version). If a bad calibration shipped, the safe move
is to mint a corrected newer version from cleaned data; the old rows remain for the audit trail.
Determinism (V-6)
The same response matrix + the same pinned dependency versions (requirements.txt) + the same
CONFIG_VERSION produce byte-identical difficulties, girth’s MML fit uses no randomness. If
you change the convergence config or bump girth/numpy, bump CONFIG_VERSION in config.py (it is
stamped on every run-log row) so a difficulty change is traceable to a config change, not a data
change.
Troubleshooting
| Symptom | Cause / fix |
|---|---|
0 eligible items | No item has ≥ 300 responses yet (Step 2), or eligible items are lone in their sub-skill (need ≥ 2 per cell). Wait for more data. |
pytest recovery fails | A dependency version drifted, reinstall from the pinned requirements.txt; do not run a real calibration until it passes. |
relation ... does not exist | The calibration tables aren’t migrated into this DB, run pnpm --filter server prisma migrate deploy first. |
infit_mnsq ≫ 1.5 on an item | The item misfits the model, send it to content review; consider excluding it before the next run rather than serving a misleading δ. |
| A re-run didn’t change an item’s δ | Expected if no new responses landed for it since the last version, the fit is deterministic. |