We ran 11 iterations of a structured multi-agent debate stress test — a scenario where the code is
correct and all expected concerns are false positives. Starting from 1 of 6 cleared and 5 phantom
claims in Run 1, the system reached 7 of 7 cleared and 0 phantoms in Run 11. The core
finding: structural intervention in the command prompt overrides behavioral priors more reliably
than rules in agent definition files. V3.8 ships the calibrated debate command and the
pre-flight gate system that achieved it.
-
3-round structured debate command
A new slash command that stress-tests the squad's false positive calibration. Round 1: four agents review code independently in parallel. Round 2: four agents rebut the strongest finding they disagree with. Round 3: Nando synthesizes all findings with explicit verdicts on each concern. Emily validates against a sealed answer key the debate agents never see.
-
Sealed answer key protocol
The answer key lives only in the command file and Emily's final prompt. No debate agent receives it at any point. The orchestrator is instructed never to quote, paraphrase, or leak it into agent prompts — ensuring the debate is a genuine blind test.
-
Debate reports output
Full debate transcripts — all four R1 openings, all four R2 rebuttals, Nando's synthesis, and Emily's scorecard — are assembled and written to
.review-squad/debate-reports/YYYY-MM-DD-false-positive.md at the end of each run.
-
Mandatory verification gates before findings
Each R1 agent prompt now contains named pre-flight checks that must be completed before any finding in that category can be raised. A gate has three parts: (1) a named check triggered by intent, (2) an explicit question about the specific code, (3) a conditional instruction — if you cannot identify a scenario the code does not handle, do not raise the concern. This is structurally distinct from a rule: it fires in the context of the specific code and requires a positive answer, not just rule recall.
-
Transaction / idempotency gate (all five agents)
Before flagging a missing transaction wrapper, agents must explicitly confirm whether each mutation is idempotent — tracing ON CONFLICT DO UPDATE and SET active=false specifically. If both are idempotent and partial runs are retry-recoverable, the agent must name an unrecoverable failure scenario before proceeding. If it cannot, the concern is dropped. This gate is applied to all four R1 agents and to Nando's synthesis prompt.
-
Return value semantics gate (FC, Stevey, PM Cory, Nando)
Before flagging return value inaccuracy due to ON CONFLICT reactivations, agents must confirm what the return contract actually promises — reconciliation intent counts (emails processed for activation/deactivation), not net-new DB row counts. Reactivation via ON CONFLICT is a successful sync outcome and correctly increments the count. Only raised if the return type explicitly promises net-new insertions.
-
Batch / loop pattern gate (FC, Jared, Stevey)
Before flagging inefficient looping or N+1 patterns, agents must verify whether calls are per-item or per-chunk. Chunked calls at CHUNK_SIZE rows per call are O(n/CHUNK_SIZE), not O(n). Before flagging dual-loop structure, agents must verify whether the two loops operate on different SQL — if so, the separation is intentional and combining would require a CTE.
-
Nando pre-flight added (critical fix)
R1 agents were correctly clearing the transaction concern via pre-flight from Run 10 onward, but Nando was overriding their clearances in synthesis. Adding the idempotency and return value pre-flights to Nando's synthesis prompt resolved this in one run — Nando must now apply the same gates before upholding any finding the R1 agents cleared.
-
Phantom definition narrowed
Emily's phantom count was previously defined as "any concern not in the answer key." This swept in legitimate real-world concerns (input validation, unbounded SELECT at scale, missing logging) that agents are correct to raise even when out of scope for the unit under test. The definition is now: a phantom is a claim that is objectively false about what the code actually does — e.g., claiming normalization is asymmetric when both sides call toLowerCase(), claiming placeholder indices collide when each db.query is independent. Correct-but-out-of-scope findings do not count as phantoms.
-
Scorecard format stabilized
Emily produces a fixed-format scorecard per run: per-FP table (Flagged R1, Challenged R2, Cleared by Nando), aggregate counts (FPs correctly cleared, phantoms invented, rounds to self-correct), and a PASS / PARTIAL / FAIL verdict. Followed by a 1–2 paragraph debrief.
-
Run progression: 1/6 → 7/7 cleared, 5 → 0 phantoms
| Run |
Key intervention |
FPs cleared |
Phantoms |
Txn cleared |
| 1 | Baseline | 1/6 | 5 | No |
| 7 | Cite-the-line rules | 1/6 | 3 | No |
| 9 | Narrowed phantom definition | 1/7 | 0 | No |
| 10 | R1 pre-flight gates | 3/7 | 0 | No — Nando override |
| 11 | Nando pre-flight added | 7/7 | 0 | Yes — PASS |
-
Key finding: structural prompt intervention > agent-level rules
Explicit rules added to agent definition files were ineffective against behavioral priors trained at depth (transaction = required, N+1 = bad). Pre-flight gates injected into the orchestration command — forcing agents to answer specific questions about the specific code — consistently overrode priors that rules could not touch. The gate must be at both the opener and synthesizer levels.
-
The debate self-correction mechanism works within rounds
When R1 agents raise a false claim (phantom), R2 agents catch it within one round. Run 9 achieved 100% phantom catch rate once the phantom definition was precise enough. The mechanism breaks down for plausible-but-incorrect concerns — findings that feel like real patterns (N+1, missing transaction) but don't apply to this specific code. Those require pre-flight gates, not self-correction.
-
Full calibration methodology documented
Complete research findings, run-by-run progression, pre-flight gate design rationale, phantom vs. false positive distinction, and generalization outlook documented in
docs/superpowers/methodology/false-positive-debate-calibration.md.