Talking Is Not Coordinating

This is Post 2 of 6 in Coordination by Construction — Series 19. The Coordination Gap Is an Architecture Problem established that the gap is structural; this post locates it at integration rather than communication. The posts that follow build the response — Coordination by Construction gives the architecture answers, Observable, Repairable Cooperation adds governance, The Human Is a Design Element places human judgment, and Can Training Fix Teamwork? tests whether better models close the gap. It runs alongside Series 17 — Assurance, which frames assurance as a property built into the architecture; coordination by construction is that same discipline applied to how agents work together.

The intuitive fix for agents that fail to coordinate is to let them talk more — and that instinct is misdirected.

The benchmarks already give agents a channel; what they reveal is that talk reshapes where agents work without changing whether their work fits together.

Agents talk; it does not make them coordinate

The cleanest demonstration is built into CooperBench, which gives two coding agents an open natural-language channel and then measures whether the channel changes outcomes. The agents use it — spending as much as 20% of their action budget on communication — and that communication measurably reduces low-level merge conflicts. Yet the difference in task success between agents that can talk and agents whose channel is disabled is not statistically significant (Khatua et al., 2026, CooperBench: Why Coding Agents Cannot be Your Teammates Yet, arXiv:2601.13295v2, preprint). Talk reshapes where agents edit without changing whether their solutions are compatible.

CalBench reaches the same dissociation from a different task — decentralized calendar scheduling across private calendars, scored against an optimal solver — and states it flatly: “raw message count is therefore insufficient as a proxy for coordination quality” (Zou et al., 2026, CalBench: Evaluating Coordination–Privacy Trade-offs in Multi-Agent LLMs, arXiv:2605.09823v2, preprint). In its varied-cost condition the least verbose model is among the weakest, the verbose ones are not reliably better, and there is no monotone relationship between messages sent and regret incurred. Two benchmarks, two domains, one finding: communication volume is not a coordination metric.

The failure has a location: integration

If talk is not the bottleneck, what is? Silo-Bench answers with precision. It runs agents through algorithmic tasks where information is distributed across the group, and separates two stages — exchanging information and integrating it into a correct answer. It names the result the Communication-Reasoning Gap: agents form sensible topologies and pass information competently, then fail at the integration stage (Zhang et al., 2026, Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems, arXiv:2603.01045v2, ACL 2026, preprint). On aggregation tasks the strongest model holds 88.0% partial correctness — it has nearly all the information — but only 62.0% success: a 26-point gap that is pure integration failure. Past fifty agents on the hardest tasks, success falls to zero while partial correctness stays positive. The information is present in the system and never assembled.

CooperBench supplies the same anatomy in the coding domain through a distinction worth carrying for the rest of this series: spatial versus semantic coordination. Spatial coordination is agreeing on who edits which lines; semantic coordination is agreeing on what the edits should mean. Communication solves the first and not the second — agents partition the file successfully and still ship incompatible logic because they never reconciled the design (Khatua et al., 2026). A merge can be conflict-free and still embed contradictory assumptions: the integration gap, restated for code.

Why talk does not reach integration: partner-modeling

Integration depends on each agent holding an accurate model of the other’s state and intentions — and that model is where current systems break. CooperBench attributes 42% of its cooperative failures to exactly this: an agent is told what its partner is doing and proceeds as if that work will not exist (Khatua et al., 2026). The information was available; it was not integrated.

Recent work on theory of mind sharpens why this happens and points at the fix. A-ToM, accepted to AAAI 2026, shows the problem is often not too little partner-modeling but partner-modeling at the wrong depth: when two agents reason about each other at mismatched levels of recursion, coordination breaks through either insufficient or excessive reasoning — two agents modeling each other at the same shallow depth can deadlock as reliably as agents not modeling each other at all (Mu et al., 2026, Adaptive Theory of Mind for LLM-based Multi-Agent Coordination, arXiv:2603.16264v1, AAAI 2026, preprint). The answer is instructive for how this series reads the whole problem: rather than make the model reason harder, they wrap it in a lightweight online-learning loop that infers the partner’s reasoning depth from observed behavior and matches it. Against fixed partners the adaptive agent reaches the ceiling of a correctly-aligned pair and out-coordinates the stronger fixed-depth agents. The improvement comes from structure around the model, not a more capable model.

Coordination can be induced — by design, not by scale

If integration is the deficit and partner-modeling is the mechanism, the actionable question is whether coordination can be deliberately produced. It can, and the lever is structural. Riedl builds an information-theoretic instrument that separates genuine cross-agent synergy from coincidental temporal coupling, and uses it to test what turns a group of agents from an aggregate into a collective (Riedl, 2026, Emergent Coordination in Multi-Agent Language Models, arXiv:2510.05174v4, preprint). Assigning each agent a distinct persona produces stable differentiation but no group-level integration. Adding one instruction — to think about what the other agents might do — produces a measurable phase change into stable, goal-directed complementarity, the synergy metric rising sharply. Neither capability nor agent count moved; a prompt-level structural choice did.

The caveat matters and the series keeps it visible: the induced coordination structure did not raise the raw task success rate, and the synergy-to-performance link reached only marginal significance — the lever is demonstrably real, but its conversion into top-line outcomes is not yet established on this benchmark (Riedl, 2026). The study also flags a deployment hazard: under ambiguous group feedback, a reasoning model locked into runaway mutual-modeling loops — partner-modeling without bound is its own failure mode.

The production view treats integration as a built artifact

The leading practitioner account already operates as if integration, not conversation, is the thing to engineer: Anthropic’s Research system has its subagents exchange references to a shared store rather than narrate everything through the channel, treating shared state as an artifact the system maintains rather than a belief each agent reconstructs from a transcript (Hadfield et al., 2025, How we built our multi-agent research system, Anthropic Engineering). Post 3 takes up the architectures that do exactly that.

What this means for an architecture

The instrumentation implication is direct. Message traffic is the wrong telemetry; a chatty agent fabric can be coordinating poorly and a quiet one well. What an architecture should measure and enforce is integration — whether agents share a verifiable model of each other’s commitments, whether the merged artifact embeds compatible assumptions, whether the system can detect that distributed information failed to assemble. The corpus offers three levers that act on integration rather than volume: make shared state an explicit artifact, align partner-modeling through a cheap adaptive loop rather than a bigger model (A-ToM), and induce complementarity through deliberate identity-and-perspective design while watching for runaway modeling (Riedl). Each is a construction, not a hope — the through-line into the next post, which turns from why communication is insufficient to the architectures that make integration verifiable by design.

The Hard Claim

Coordination fails at integration, not at communication — and message volume measures neither. Agents already talk; talk solves spatial coordination and leaves semantic integration unsolved.

Stop instrumenting conversation as if it were coordination and start instrumenting integration: shared verifiable state, aligned partner models, detectable assembly failure. The mechanism research shows these are buildable; what remains open is how reliably constructed coordination converts into task outcomes.

Coordination by Construction · Series 19 · 6 Posts

Post 01 · Published The Coordination Gap Is an Architecture Problem

Post 02 · Now Reading Talking Is Not Coordinating

Post 03 · Published Coordination by Construction

Post 04 · Published Observable, Repairable Cooperation

Post 05 · Published The Human Is a Design Element

Post 06 · Published Can Training Fix Teamwork?

The claim Coordination fails at integration, not communication — and message volume measures neither.
CooperBench An open channel reduces merge conflicts but does not significantly change task success; 42% of failures are partner-modeling failures.
CalBench No monotone relationship between messages sent and regret; raw message count is insufficient as a proxy for coordination quality.
Silo-Bench (ACL 2026) A 26-point gap between information held (88.0%) and answers reached (62.0%) — the Communication-Reasoning Gap.
The mechanism A-ToM (AAAI 2026): coordination breaks when partner-modeling depth is misaligned; fixed by an adaptive loop, not a bigger model. Riedl: one perspective-taking instruction induces coordination structure.
The implication Instrument and enforce integration, not conversation. Make shared state an artifact; align partner models by construction.

Series 17 · Post 01 Compression Debt Assurance
Series 17 · Post 02 Certification Boundary Assurance
Series 17 · Post 03 Audit Substrate Assurance
Series 17 · Post 04 Convergence Pattern Assurance
Series 17 · Post 05 Assurance as Architecture Assurance

Agents talk; it does not make them coordinate

The failure has a location: integration

Why talk does not reach integration: partner-modeling

Coordination can be induced — by design, not by scale

The production view treats integration as a built artifact

What this means for an architecture

Message Traffic Is the Wrong Telemetry. Instrument Integration, Not Conversation.

Like this:

Related

Talking Is Not Coordinating

Agents talk; it does not make them coordinate

The failure has a location: integration

Why talk does not reach integration: partner-modeling

Coordination can be induced — by design, not by scale

The production view treats integration as a built artifact

What this means for an architecture

Message Traffic Is the Wrong Telemetry. Instrument Integration, Not Conversation.

Share this:

Like this:

Related