Calibration Runbook: fitting and re-adjusting item difficulty

This is the operational guide for running the offline Rasch calibration: how to check you have enough data, run the fit, read the result, and re-adjust difficulties as more responses accumulate. It is the hands-on companion to the concept page and the code under services/calibration/.

Who runs this: an engineer or data lead with DB access. It is offline, never part of the app request path. A run reads the response pool, fits each item’s difficulty, and writes a new versioned snapshot. It never edits a published version in place.

The one-paragraph model

Students and questions sit on one shared ability/difficulty ruler (1PL Rasch). The app scores a child’s ability (θ) live in TypeScript; this job measures each question’s difficulty (δ) offline from real answers. Until a question has enough real answers it keeps an estimated difficulty (delta_prior, “Track B / provisional”); once it clears the threshold its measured difficulty is trusted (“Track A”). This runbook is how a question graduates from Track B to Track A, and how its measured difficulty is re-adjusted later.

When to run it

First real calibration, after a pilot cohort has generated enough responses (see the readiness check below). Before that, every item is provisional and the app works fine on estimated difficulties.
Re-calibration, at the end of each cohort / testing window, or whenever a meaningful batch of new responses has landed. Re-running re-measures every eligible item and mints a new version; running sessions keep reading the version they started on, so nothing shifts mid-session (V-6).
You do not run it continuously or on the live server. It is a deliberate batch step.

The trust threshold

An item’s measured difficulty is only trusted once it has ≥ 300 real responses (CALIBRATION_TRUST_THRESHOLD in both services/calibration/calibration/config.py and the server’s modules/engine/engine/two-track.ts, they must match). Items below 300 are left untouched on delta_prior. This is the “two-track” gate.

Step 0: Setup (once per machine)


cd services/calibration
python3 -m venv .venv && source .venv/bin/activate      # .venv is gitignored
pip install -r requirements.txt                         # pinned girth/numpy/scipy, V-6
export DATABASE_URL="$(grep '^DATABASE_URL=' ../../apps/server/.env | cut -d= -f2- | tr -d '\"')"

DATABASE_URL must point at the database that holds the real responses (the pilot/prod DB). The calibration tables (calibration_run_log, calibration_version_snapshot, item_calibration_snapshot) are owned by the server’s Prisma migrations, so they already exist there, apply pnpm --filter server prisma migrate deploy first if you are on a fresh DB.

Step 1: Validate the calibration code works

Before trusting a real run, confirm girth recovers known difficulties:


python -m pytest tests/ -q

This simulates answers from items whose difficulty is known, fits them, and asserts the recovery (correlation ≥ 0.95, MAE < 0.15) and determinism (two runs identical). If this fails, do not run a real calibration, the environment or a dependency version is off.

The same command also runs an end-to-end test that builds a throwaway database, seeds synthetic responses, and drives the full load, fit, and write-back loop. It asserts the two-track gate (an item below 300 responses stays provisional) and a byte-identical re-run. That test needs Postgres and skips itself automatically when none is reachable, so the recovery check above still runs everywhere.

Step 2: Check you have the data you need

A calibration is only worth running if items have crossed 300 responses. Run this against the same DATABASE_URL. It counts every (student, item) response across both sources the loader uses, practice_response (practice) and diagnostic_session.inputResponses (diagnostics), and reports how many items are calibration-ready per sub-skill:


WITH responses AS (
  SELECT pr."itemId" AS item_id, pr."subSkillId" AS sub_skill_id
  FROM practice_response pr
  JOIN practice_queue pq ON pq.id = pr."queueId"
  UNION ALL
  SELECT resp->>'itemId' AS item_id, ds."subSkillId" AS sub_skill_id
  FROM diagnostic_session ds
  CROSS JOIN LATERAL jsonb_array_elements(ds."inputResponses") AS resp
),
per_item AS (
  SELECT item_id, sub_skill_id, count(*) AS n
  FROM responses
  GROUP BY item_id, sub_skill_id
)
SELECT sub_skill_id,
       count(*) FILTER (WHERE n >= 300) AS items_ready,   -- will be calibrated (Track A)
       count(*)                          AS items_seen,    -- have any responses
       max(n)                            AS most_responses_on_one_item
FROM per_item
GROUP BY sub_skill_id
ORDER BY items_ready DESC;

Read it like this:

items_ready = 0 everywhere → not enough data yet; a real run would write an empty log and calibrate nothing. Wait for more responses.
A sub-skill needs at least 2 ready items (MIN_ITEMS_PER_CELL) for its cell to fit, a lone ready item is skipped.
Healthy → several sub-skills with a handful of items_ready each.

Step 3: Dry run (fit, but write nothing)

See exactly what would be calibrated and the difficulties it would assign, without touching the DB:


python -m calibration.calibrate --dry-run

It prints, per eligible item: δ (the fitted difficulty, mean-centred), se (its standard error), n (responses used), and infit (a fit statistic, see Step 5). Sanity-check the δ ordering matches intuition (questions you expect to be hard come out higher). If it reports 0 eligible items, Step 2 already told you why.

Step 4: Run the real calibration


python -m calibration.calibrate --triggered-by cohort_end

In one transaction this:

inserts a calibration_run_log row (the audit header),
mints the next calibration_version_snapshot (a new monotonic version number),
writes one item_calibration_snapshot row per calibrated item (frozen, never updated),
updates each calibrated item’s item_bank row: deltaCalibrated, calibrationN, calibrationVersion = <new>, and calibrationProvisional = false.

Items below 300 are not touched, they stay on delta_prior / Track B. A run with no eligible items writes a clean status = completed_empty log and stops (it never mints a noise version).

Step 5: Review the result

Inspect the run and its fit quality:


-- the run header
SELECT id, calibration_version, status, items_calibrated, response_count,
       fitting_method, config_version, started_at, completed_at
FROM calibration_run_log ORDER BY id DESC LIMIT 5;
 
-- per-item difficulties + fit statistics for the version you just minted
SELECT item_id, sub_skill_id, delta_calibrated, delta_se, calibration_n,
       infit_mnsq, outfit_mnsq
FROM item_calibration_snapshot
WHERE calibration_version = (SELECT max(calibration_version) FROM calibration_version_snapshot)
ORDER BY sub_skill_id, delta_calibrated;

What to check:

status = completed and items_calibrated matches your readiness count.
infit_mnsq / outfit_mnsq are item-fit statistics. ~1.0 is ideal; the usual productive range is 0.5-1.5. A value well above 1.5 means the item behaves erratically (people who should pass it fail and vice-versa), flag it for content review, don’t silently ship it. Below ~0.5 means it’s almost too predictable (often a near-duplicate or a giveaway).
delta_se large on an item → its difficulty is still shaky; it cleared 300 but the responses were lopsided. Watch it next round.
delta_calibrated is mean-centred per fit (Rasch difficulty is only defined up to a constant), so don’t compare absolute values across sub-skills, compare ordering within a sub-skill.

Step 6: Confirm it took effect in the app

Nothing else needs deploying. The server reads the latest published version automatically (getCurrentCalibrationVersion() → MAX(calibration_version)), so the next diagnostic/practice session pins your new version and serves the measured δ for the calibrated items (Track A); everything else stays on delta_prior (Track B). A served item’s audit row carries calibrationProvisional = false once it’s on a measured δ. Sessions already in flight keep their pinned version, they don’t jump mid-stream.

Re-adjusting difficulty later (the recurring loop)

When more responses arrive, just run it again (Step 2 → Step 4). Key facts:

A re-run re-measures every eligible item from the full response pool and writes a new version (e.g. 2, 3, …). It never overwrites version 1’s snapshot rows, they stay for audit and replay.
An item already on Track A gets its δ re-adjusted to the new measurement; an item that has newly crossed 300 graduates from Track B to Track A for the first time.
item_bank always reflects the latest version’s δ; the per-version history lives in item_calibration_snapshot.
Recommended cadence: once per cohort / testing window. Avoid re-running on tiny increments, let a meaningful batch accumulate so the re-measurement actually moves.

Rolling back / pinning a version

Versions are append-only; you never delete or edit one. To revert the app to an earlier calibration, you do not undo anything, a session can be pinned to a specific calibrationVersion (the two-track only trusts a measured δ when calibrationVersion ≤ the session's pinned version). If a bad calibration shipped, the safe move is to mint a corrected newer version from cleaned data; the old rows remain for the audit trail.

Determinism (V-6)

The same response matrix + the same pinned dependency versions (requirements.txt) + the same CONFIG_VERSION produce byte-identical difficulties, girth’s MML fit uses no randomness. If you change the convergence config or bump girth/numpy, bump CONFIG_VERSION in config.py (it is stamped on every run-log row) so a difficulty change is traceable to a config change, not a data change.

Troubleshooting

Symptom	Cause / fix
`0 eligible items`	No item has ≥ 300 responses yet (Step 2), or eligible items are lone in their sub-skill (need ≥ 2 per cell). Wait for more data.
`pytest` recovery fails	A dependency version drifted, reinstall from the pinned `requirements.txt`; do not run a real calibration until it passes.
`relation ... does not exist`	The calibration tables aren’t migrated into this DB, run `pnpm --filter server prisma migrate deploy` first.
`infit_mnsq` ≫ 1.5 on an item	The item misfits the model, send it to content review; consider excluding it before the next run rather than serving a misleading δ.
A re-run didn’t change an item’s δ	Expected if no new responses landed for it since the last version, the fit is deterministic.