Two years of enterprise deployment have settled the easy question.
Large language models can do a great deal of legal work, and they are getting better at it quickly. The question that actually governs adoption was never capability. It is whether the work can be defended.
A litigator can use a model that is right ninety percent of the time. A litigator cannot use a model that is right ninety percent of the time and cannot tell which ninety percent, cannot show the reasoning, and cannot produce a record of how it arrived. That is a different axis entirely, and the distinction carries the whole series.
Capability asks whether the system can produce the right answer. Assurance asks whether the system can show the answer is right, trace how it was reached, and stand to that account afterward.
A board, an auditor, and a regulator can only see the second axis. They do not buy benchmark scores. They buy evidence.
The risk is real, and it belongs to the model
Start with what is no longer in dispute. When unaided models are asked specific, verifiable questions about US case law, they fabricate at rates that would end a career — between fifty-eight and eighty-eight percent in the foundational Stanford study, which also found that models frequently cannot tell when they are hallucinating and fail to correct a user’s mistaken legal premise [1]. This is not an end-user training problem. It is a property of the unaided model.
The pattern recurs wherever the task has consequences. In case-based argument generation, models manufacture arguments even when the factual basis for one is absent; the inability to decline is itself the failure [2]. The risk surface is wider than fabrication, too — recent work frames the prospect of systems that appear compliant under evaluation and defect once oversight weakens [8], and the discovery of latent vulnerabilities in legal frameworks themselves [7]. Read together, these establish a single fact: the legal-AI risk is real, intrinsic to the unaided model, and not self-correcting.
A better model is not a defensible system
The intuitive response is to wait for stronger models. The evidence does not support it. Capability and reliability are improving on different curves, and the gap between them is exactly where enterprise risk lives.
Contract-review benchmarks now place models at roughly the level of a junior legal assistant — useful, and not yet trustworthy on the consequential edges, where capable models still miss the subtle embedded flaws that matter most [3, 4]. Dedicated legal models have scaled cleanly, demonstrating that domain adaptation works [5, 6], without resolving the reliability questions above. A better model is a more capable junior associate. It is not, on its own, a defensible system. And if capability does not deliver assurance, assurance has to come from somewhere else.
The reframe
That somewhere else is structural. Read across the literature, the moves that actually produce defensible behavior are not properties of any model — they are properties of the architecture built around it. Generation gets grounded in authoritative sources. The consequential reasoning gets confined to a deterministic, logged substrate rather than left to probability. Output gets verified before it is surfaced. Compliance evidence gets produced as the system runs, mapped to the frameworks a US enterprise already answers to. And the whole runs over a substrate that keeps sensitive client data isolated.
Those moves stack, and the stack — not the model — is what an enterprise can defend. That is the architecture this series builds. Post 2 takes the load-bearing core: grounding, determinism and isolation, and verification. Post 3 takes the governance layer that turns a working system into a defensible one and reframes how the enterprise should buy.
The defensibility of a legal AI system is decided by its architecture, not its model.
The risk is real and intrinsic; capability gains do not close it; and the responses that work are architectural. The components of a defensible legal AI system already exist in the evidence. What has been missing is the architecture that assembles them — and the discipline to evidence each layer rather than trust the model. Stop selecting models. Start building, and evidencing, assurance.
