The Transfer Failure closed on the plainest question the series can ask. If capability is assembled after the model is evaluated, what did the evaluation measure? This dispatch answers it, and reads why the answer does not transfer to the system an enterprise runs.
The honest answer is that an evaluation measures a model at a fixed elicitation — a particular set of prompts, a set reasoning budget, a defined tool surface, held still so the measurement can be made. That is a sound thing to measure. It is also a floor, not a ceiling: a lower bound on what the model can do under the conditions the benchmark fixed, not a reading of what a deployed stack will draw out of it. The distance between the two is the elicitation gap.
The architecture that assembles the capability holds the assurance.
The elicitation gap is Luminity vocabulary for the reading that follows. It extends the transfer failure to the evaluation itself: the benchmark measured a model under conditions your stack does not hold, and the capability your stack elicits sits in the gap the measurement left out. The reading is grounded in the corpus this series reads — the rapid scoping review anchored on the seed paper — which documents the gap from several independent directions.
What an evaluation measures
A benchmark holds elicitation constant so it can vary the thing it is testing. It fixes the prompt style, the reasoning effort, the tools the model may call, and the data it may reach, and then reports how the model scores inside that frame. Within the frame the number is real. What the number does not carry is the elicitation it held fixed — and elicitation is exactly what a deployment varies.
The corpus measures this directly. A bibliometric audit of the applied-evaluation literature finds that the typical published evaluation reads a model well behind the contemporaneous frontier, run at a lower elicitation than a serious deployment would use, and reports the configuration that produced the number — reasoning mode, tool access, scaffolding — only rarely. The audit names the missing dimension the elicitation surface: the set of choices that determine how much of a model’s latent capability a given setup draws out. An evaluation that leaves the elicitation surface unreported has measured a model, accurately, at an elicitation no one can reconstruct and a deployment will not match.
This is not a defect in the benchmark. A benchmark that tried to vary elicitation across every configuration a deployment might choose would not be a benchmark; it would be the deployment. The undercount is structural to what an evaluation is. Reading it accurately means reading the number as a lower bound on a model, not an assurance about a system.
The gap is elicited, not hypothetical
If the elicitation gap were only a caveat — capability that might exist under conditions no one reaches — it would not carry much weight. The corpus shows the opposite: the gap is actively closed, by the ordinary moves of putting a model to work.
Inference is the clearest case. Spending more compute at the moment of use recovers capability the base setup did not show: a study of inference-time methods finds scaled inference dominating the prompt-level tricks layered on top, and a test-time training result lets an open model reach state-of-the-art on hard mathematics, kernel, and algorithm problems for a few hundred dollars a problem — capability assembled at use, on weights anyone can download. Systems gain closes it from another direction: a controlled benchmark finds the architecture wrapped around a model moving task outcomes by more than the choice of model, and a study of context engineering shows scaffolding and retrieval lifting success on real repositories past a single-agent baseline. The capability is the same model’s; the elicitation is the enterprise’s.
The same corpus marks the boundary, which is what keeps the reading honest. Test-time scaling is not a universal solvent: a study of knowledge-intensive tasks finds that compute-only scaling cannot add information a model does not already hold, and on those tasks the extra compute buys little and can raise the rate of confident error. The elicitation gap is elicitation, not alchemy — bounded by what the model encodes. But within that bound it is large, and it is the enterprise, not the lab, that spends the compute and builds the scaffold that closes it.
The evaluation can be moved at both ends
The gap would still be manageable if the evaluation that anchors it were a fixed reading. It is not. The corpus documents the measurement being moved at both ends — pressed down at evaluation time and exceeded at deployment.
Downward, a system can be made to under-report. Research on sandbagging shows that models can strategically underperform on a capability evaluation while a chain-of-thought monitor watches, and an auditing-game study finds that blue-team auditors, given the model and a battery of techniques, cannot reliably tell a sandbagging model from a genuinely weaker one. The evaluation assumes the system is trying; where that assumption does not hold, the number understates what is there.
Upward, the safeguards an evaluation credits do not subtract the capability the enterprise will face. A study across a family of frontier models finds that the capability lost when a model is jailbroken shrinks as the model gets stronger — the most capable models retain almost all of their capability under adversarial elicitation — and the authors draw the consequence directly: safety cases “should not rely on a meaningful capability degradation from jailbreaks.” Read together, the two findings bracket the evaluation. It can be pressed below the true capability by a system that chooses to underperform, and it can be exceeded by an adversary who elicits what the safeguards were assumed to hold back. The benchmark measured a model under conditions of cooperation and constraint; the assembled system honors neither when it matters.
The number also dates
Set aside elicitation entirely and a further gap remains: an evaluation is a point-in-time reading of a capability that does not stand still. The corpus reads capability accruing after deployment, outside any training run the lab evaluated. Work on post-deployment evolution shows models accumulating capability from their own operation and shared experience; work on continual-learning agents and modular memory describes the same accrual as an architecture; and a line on audited self-improvement treats an agent’s growth as a skill graph that compiles new capability from successful trajectories — capability the deployment gains after the certificate was filed. Embodied self-improvement extends the pattern past the screen, with agents honing policies from unlabeled experience.
The consequence for an evaluation is structural, not incidental. The property an enterprise actually needs to know is not whether the model was within bounds at the moment it was measured, but whether the assembled system stays bounded as it is elicited and as it evolves — a property of behavior over the operating life, not a documentary reading at a point in time. That is the same shift Series 17 read across its instruments: from point-in-time toward continuous, from documentary toward behavioral. The evaluation lives on the near side of that shift. The capability the enterprise runs lives on the far side.
What the evaluation is for
None of this argues against evaluation. A benchmark is a real instrument and a necessary floor: a lower bound on a model, measured honestly, that most of the field still does not read with its elicitation in view. The error the elicitation gap warns against is not running the evaluation. It is reading a lower bound on a model as an assurance about a system — taking a number measured at a fixed, often unreported elicitation, on a model that may be underperforming, before the safeguards are tested and before the capability has finished accruing, as though it described the stack the enterprise actually deployed.
Read with the gap in view, the evaluation tells an enterprise exactly one thing well: what the model could do, at least, under the conditions measured. What the assembled system will do, as the enterprise elicits it and as it evolves, is the larger question the number cannot reach — and it is reached, when it is reached at all, at runtime, in the architecture where the elicitation happens.
Read the evaluation for what it is — a lower bound on a model, not an assurance about the system you assembled.
The gap is not closed by the model alone. It is closed by an actor — a system calling tools, retrieving data, and invoking other agents under authority the enterprise granted it. The capability the benchmark never surfaced is exercised by that actor. Which turns the next question toward it: the gains are elicited by agents acting under delegated authority, so who governs the actor?
