A Category-Level Failure Mode Not Captured by Distribution Metrics

An extended treatment of the structural conditions under which ML evaluation infrastructure loses category transparency

Externality conditions for substrate-independent audit, derived from the 2008 Gaussian copula collapse.

Author

Ghjuvan Ortulanu

Published

April 2026

Doi

10.5281/zenodo.19627563

Abstract

The failure mode this note describes is not model collapse, not benchmark contamination, not distribution drift. It is a functional shift of the evaluation categories themselves — boundaries persist while what they separate becomes progressively opaque.

The argument proceeds in three movements. First, it specifies three failure conditions under which any category undergoes such a shift — self-reference, anchor drift, and proxy displacement — and demonstrates them on the 2008 collapse of the Gaussian copula in credit markets. Second, it shows that the same conditions are now assembling in ML training and evaluation infrastructure along three reinforcing trajectories, each a direct instantiation of one failure condition. Third, it derives from these the minimum shape of an audit that would block the failure — three externality conditions (provenance independence, external anchoring sustained by incentive misalignment, speed parity), each the structural inverse of a failure condition — and shows why no approach internal to the ML substrate, and no current non-ML automation, can take this shape. The note closes by examining why the dominant “safety as a technical problem” frame has structural difficulty recognizing this failure mode, with the source of the limited visibility traced to the signal-to-noise structure of self-referential evaluation. What the externality conditions specify is not technical work in the sense the ML research community currently uses the term; what would satisfy them is left open, with several partial channels — physical-feedback loops, hardware attestation, cryptographically certified human-origin data — sketched in section 9.

1 Introduction

The question motivating this note is narrow. Assume current safety infrastructure successfully addresses distribution-level failures — contamination of benchmarks, collapse under recursive training, drift across deployment. What fails silently beside it, in a way the same infrastructure cannot see? The hypothesis advanced here is that category-level failure is that thing: a functional shift of the evaluation categories themselves, not of their distributional contents.

The hypothesis is sharp in one respect and modest in another. Sharp: the failure mode is structurally inaccessible from within the current “safety as a technical problem” frame — no work internal to the ML substrate can resolve it, just as no improvement in credit risk modeling could have resolved the Gaussian copula collapse in 2008. Modest: what the external substrate would be is not specified here. The note describes the shape of the required externality, sketches several candidate partial channels, and stops there.

The argument is structured so that each section strengthens the one before and the one after. Section 2 defines the failure conditions — three conditions under which any category loses transparency. Section 3 demonstrates them on the Gaussian copula case, focusing on the mechanism by which the category collapsed. Section 4 shows that the same failure conditions are assembling in ML evaluation, with each of three paths instantiating a specific condition. Section 5 derives the externality conditions — three conditions an audit must satisfy to block category-level failure — as the structural inverses of the failure conditions, and traces the structural tension among them, including why current non-ML automation does not relieve it. Section 6 maps four current frontier approaches against the externality conditions, reading each as a specific trade-off rather than as an uncategorized failure. Section 7 examines why the dominant frame cannot recognize this failure mode, with the analysis grounded in the signal-to-noise structure of self-referential evaluation. Sections 8 and 9 state the limitations of the analysis and the questions it leaves open, with section 9 sketching candidate partial external substrates in concrete terms.

The note is not prescriptive. It does not propose a complete external substrate. It describes what the shape of any such substrate would have to be, sketches the partial channels that approximate it, and explains why the dominant framing of AI safety is unlikely, by its own self-constituting structure, to produce the complete construction.

2 The Failure Conditions

Categories do two things. They identify — they sort an object as A or B, admitting external reference to their boundaries. And they conceal — they treat internal heterogeneity as homogeneous, which is precisely how they enable identification in the first place. A category that resolved every internal difference would collapse into a set of singletons; a category that concealed everything would provide no external handle. Under normal operation the two functions sit in trade-off: some concealment is the price paid for identification resolution.

The trade-off holds only while the category’s boundary is anchored to a reference external to the objects placed inside it. Remove the external anchor, and the trade-off inverts: the category conceals internal heterogeneity and simultaneously loses the external reference against which identification was possible.

Three conditions together induce this inversion. Call them the failure conditions.

Self-reference. The objects placed inside the category influence the category’s definition. The boundary is no longer drawn from outside.

Anchor drift. No non-endogenous reference exists to which the category’s definition can be anchored. The definition drifts in tandem with the objects it contains.

Proxy displacement. Questions about the properties of individual objects inside the category migrate into questions about the boundary itself. “How X is this object?” becomes “Does this object belong to category A?” The original question disappears, because its answer is presumed inherited from the boundary’s meaning.

Once these three hold jointly, no signal internal to the substrate reliably separates “functioning” from “miscalibrated.” Internal coherence becomes indistinguishable from accuracy. Distribution collapse (Shumailov et al., 2024) is one pathway to this state but not the only one. Categories can lose transparency while distributions continue to appear rich. Distribution metrics, designed to detect shifts within categories, do not detect shifts of categories. This is the failure mode the remainder of the note describes.

A point on the unit of analysis. The failure conditions are defined at the level of a single category, but the phenomenon this note describes does not appear at that level in isolation. The object of analysis is a coupled evaluation stack — training data, evaluation benchmarks, and feedback channels operating as interdependent elements whose outputs flow into one another’s inputs. The three failure conditions can appear distributed across different elements of the stack: one category may exhibit self-reference while another exhibits anchor drift. The failure mode materializes at the stack level, when the conditions hold jointly across the coupled system even if no single category exhibits all three in isolation. Section 4 traces this distribution; section 5 specifies what an audit of such a stack must satisfy.

A second point on terminology. The term substrate, which this note uses with some frequency, refers to the joint system of training-data provenance, feedback lineage, model-family optimization history, and evaluator-community epistemic overlap. The claim that two systems “share substrate” does not require identity along every dimension; it requires sufficient overlap along at least one dimension such that co-produced judgments become probable. Section 9 flags the resolution at which this overlap begins to undermine independence as an open empirical question.

3 The Canonical Case: Mechanism of the Gaussian Copula Collapse

The failure conditions have a precedent in credit categorization before 2008. The precedent is structural correspondence, not identity of case. This section focuses on the mechanism of failure — the conditions assembling and then producing collapse. Section 7.3 returns to the same case under a different question: who recognized the failure, and when.

The Gaussian copula model, developed by David Li and adopted widely after 2000, derived correlations among individual mortgage defaults from observed CDS spreads. Its mathematical legitimacy rested on the assumption that the correlation structure was exogenous to the mortgage market — a stable statistical parameter estimated from observed prices and used to price derivative instruments built on those prices. As long as the exogeneity assumption held, recombining BBB-rated subprime tranches into collateralized debt obligations and assigning the senior tranches AAA ratings was defensible. The category AAA was doing work: it sorted structured instruments according to a measurable property.

The model was not mathematically wrong. It failed under self-reference. By 2006, global CDO issuance had reached approximately $520 billion, and at that scale the mortgage market itself began to be shaped by CDO demand. Subprime origination at Countrywide, New Century, and similar lenders expanded to match the absorption capacity of the CDO pipeline. Lending standards relaxed in response to the pipeline’s need for product. The correlation structure used to price CDOs was no longer exogenous; it was being co-produced by the instruments whose pricing depended on it. MacKenzie & Spears (2014) document this co-production.

All three failure conditions held. Self-reference: CDO demand shaped mortgage origination, which shaped the inputs to CDO pricing. Anchor drift: AAA’s meaning was defined by model outputs calibrated against market data the model was simultaneously producing. Proxy displacement: “How risky is this mortgage?” migrated into “Is it in an AAA tranche?” When nationwide housing prices fell in late 2006, defaults previously assumed to be independent synchronized immediately, and the AAA category ceased to carry meaning. The category did not shrink, split, or migrate its distribution — it lost the capacity to separate what it had been designed to separate.

The post-crisis resolution took the form of macroprudential regulation — instruments operating categorically external to the modeling layer. The failure could not be resolved by improvement within the layer that failed. This is the general pattern: once the failure conditions hold, the resolution cannot be generated by refinement within the layer that failed. The argument of section 5 derives the same conclusion in inverse — specifying what an instrument would need in order to operate as that external resolution.

4 The Current Instance: Three Paths, Three Conditions

In current large-scale training and evaluation pipelines, two categories carry the burden of safety tracking. “Training data” — the corpus a model is trained on, implicitly assumed to consist of human-generated content and observations of the real world. “Evaluation benchmarks” — tests of model capability and behavior, implicitly assumed to be independent of and external to training. Both function as risk trackers only under a single structural assumption: that training distribution and evaluation distribution are anchored to independent external signals that do not themselves depend on the systems being evaluated. This assumption occupies the same structural position as the exogeneity-of-correlation assumption in the Gaussian copula.

The assumption is being eroded along three paths. Each path is a direct instantiation of one failure condition. Together, they produce all three jointly.

4.1 Recursive Synthetic Injection — Reifying Self-Reference

Curation filters evaluate content — text quality, factual consistency, stylistic features. They do not evaluate provenance lineage. When Sutskever announced at NeurIPS 2024 that “pre-training as we know it will end” and positioned synthetic data and agentic AI as the principal paths forward, this was a response to data scarcity. Within-substrate synthetic generation, however, cannot satisfy provenance independence. Its output remains inside the lineage it is meant to diversify; extending that lineage is its function, not a side effect. Generation internal to a substrate cannot produce signal external to that substrate, however sophisticated the procedure.

This path reifies self-reference. The category “training data” is increasingly composed of outputs from the systems it is meant to train; the boundary is being drawn from within the substrate. Agentic AI pipelines compound the mechanism — every output inherits the provenance of the models that produced it and contributes to the next training corpus in whose curation those same models participate.

4.2 Shared Evaluation Lineage — Reifying Anchor Drift

The apparatus that currently evaluates frontier models is not drawn from a distributionally independent signal source. LLM-as-judge evaluators, model-based red-teaming, auto-generated test cases, and AI-assisted annotation procedures all share training lineage with the models they evaluate (Zheng et al., 2023). Neither MMLU, HumanEval, MT-Bench, nor Chatbot Arena constitutes a distributionally independent signal: the work products of their curators are already present in the pre-training corpora from which the evaluated models emerge, and the human raters on Arena are drawn from populations whose preferences have been shaped by prior exposure to similar systems.

Legal independence is not distributional independence. Two organizations that are legally distinct can draw from the same pre-training substrate and converge on overlapping judgments. This path reifies anchor drift: the category “evaluation benchmark” has no reference anchor that is not itself co-produced by the systems under evaluation. The dynamics generalize what Perdomo et al. (2020) formalize as performative prediction, but at the category rather than distribution level — performative prediction theory assumes the category is given; category-level opacity is the condition under which that assumption fails.

4.3 Proxy Signal Amplification — Reifying Proxy Displacement

Human feedback is a structural bottleneck. The standard response is to train a reward model on human feedback (Ouyang et al., 2022) and use that reward model as the dominant feedback signal in subsequent training. Constitutional AI and RLAIF push the substitution further, replacing human ratings with model-generated critique (Bai et al., 2022). These methods are presented as engineering progress: they scale, they reduce annotation cost, they let alignment signal keep pace with capability signal. Structurally, they shrink the fraction of “human feedback” that is in fact human, and the limits of this substitution at scale are increasingly visible (Casper et al., 2023).

This path reifies proxy displacement. “Is the model aligned with human judgment?” has migrated into “Does the model’s output receive favorable reward model scores?” The original question has disappeared as a question, because its answer is presumed inherited from the proxy’s categorization. The category “aligned output” retains its name; the property being measured is no longer the property the name refers to.

4.4 Combined Effect

The three paths reinforce each other. Synthetic injection contaminates the training distribution; shared-lineage evaluation fails to detect the contamination because the evaluation apparatus has itself been trained on it; proxy-signal amplification propagates the contamination into alignment judgments. All three failure conditions hold jointly across the categories “training data,” “evaluation benchmark,” and “human feedback.”

Bommasani et al. (2021) identify substrate homogenization as a systemic risk at the model level — the defects of a foundation model are inherited by all adapted models downstream. The category-level opacity described here is orthogonal: it persists even when distributional diversity among foundation models is preserved, because the convergence is at the level of evaluation and feedback substrate rather than the model level. The analysis extends further, and that extension is the subject of section 7: interpretability tools, RLHF methods, and constitutional approaches alike share substrate with the systems they claim to align, and the failure mode is not one problem among several but the condition under which current alignment work is conducted.

5 The Externality Conditions

The failure conditions specified in section 2 describe when categories lose transparency. The externality conditions introduced here describe what any audit instrument must look like to block that loss. Each is the system-level negation of one failure condition. The two triads are therefore structurally linked — Triad B (externality) is the form an audit must take in order to make Triad A (failure) unable to hold.

5.1 Against Self-Reference — Provenance Independence

The audit instrument must originate from a training or construction lineage distributionally distinct from the audited system’s. This is not legal or organizational separation; those are weaker conditions. Two audit instruments drawn from the same pre-training lineage converge, under self-referential dynamics, on an internal consistency check. They agree because they descend from the same underlying substrate, not because they are accurate. Internal consistency is not accuracy — an internally consistent system can be consistently miscalibrated with respect to the external world. The history of epistemic closure in science (Ptolemaic astronomy, humoral medicine) shows what internally coherent but externally blind systems look like. The condition leaves open what counts as sufficient distributional distance — the resolution required is currently an open research question.

5.2 Against Anchor Drift — External Anchoring, Sustained by Incentive Misalignment

The direct structural inverse of anchor drift is external anchoring — the audit boundary must be fixed by a reference that is not co-produced by the objects it sorts. External anchoring is not secured automatically by organizational form; it requires a mechanism that prevents the anchor itself from being captured by the auditee’s incentive structure. That mechanism is incentive misalignment: the auditor’s success must be independent of, or anti-correlated with, the auditee’s success. Without this governance condition, an ostensibly external anchor drifts back into co-production through reputational capture, even when no formal channel for capture exists.

The current benchmark ecosystem illustrates the failure of external anchoring under incentive alignment. A benchmark’s standing — its citation count, its role in publication decisions, its influence in model selection — rises with the performance of the models optimized against it. A benchmark on which no state-of-the-art model performs well is, in current practice, presumed to be poorly designed rather than presumed to be detecting a real capability gap. What looked like an external reference becomes, through this feedback, an internal one. The structural parallel to the pre-2008 issuer-pays model in credit rating is direct: rating agencies were paid by the issuers of the instruments they rated, the agencies’ revenue rose with issuance volume, favorable ratings facilitated issuance. Current benchmark governance is not formally issuer-paid, but informal incentive alignment through reputational capture produces the same category-level result — the external anchor drifts.

5.3 Against Proxy Displacement — Speed Parity

The audit cycle must keep pace with the system’s rate of error accumulation. Human post-hoc audit has already failed this condition at current scales. When audit cannot keep pace with deployment, the original question — “How well is this model performing on this specific instance?” — is no longer askable in time to matter, and it migrates into the pre-deployment category judgment — “Has this model been approved?” The category judgment stands in for the running evaluation it has replaced.

Automated audit addresses the speed condition directly — an audit instrument that is itself a computational process can, in principle, match the audited system’s rate. But if the automated auditor shares substrate with the auditee, the first externality condition collapses. Why does non-ML automation not relieve this tension? Three candidate non-ML automation tracks have been proposed — formal verification, static analysis of learned weights, and symbolic AI — each of which would be substrate-external to the systems being audited. Each runs into a structural obstacle that prevents it from achieving speed parity with frontier-model deployment.

Formal verification requires a specification against which the system is to be verified. For systems whose desirable behavior is itself emergent and not formally specifiable in advance, the verification target does not exist as a stable object. State-space exploration over the input distribution of a frontier language model is not bounded in any tractable way. Static analysis of learned weights fails because weights are not source code: the analysis would need to extract semantic invariants from a fixed-point representation produced by gradient descent on a large corpus, a problem for which no efficient algorithms are known and which is in general computationally intractable. Symbolic AI offers the speed advantage of rule-based inference but lacks the expressivity to capture the contextual, continuous judgments at issue — a symbolic rule system sufficient to evaluate frontier-model outputs would itself need to be of comparable representational complexity to the model, at which point its provenance independence becomes the question. The point is not that non-ML automation is in principle impossible; it is that no current non-ML automation track meets both the expressivity and speed requirements simultaneously. This is an empirical claim about the present state of computational methods, not a metaphysical one. Within current computational reach, automation at the required speed runs through the ML substrate.

5.4 The Structural Tension Among the Three Conditions

The three conditions stand in pairwise tension. Provenance independence (5.1) and speed parity (5.3) trade off against each other through the substrate question: substrate-external systems exist (legal, regulatory, physical) but do not run at frontier-model speeds; substrate-matched systems run at frontier-model speeds but cannot satisfy provenance independence. Provenance independence and external anchoring (5.2) trade off through reputational capture: an audit system organizationally external but drawing from the same intellectual tradition as the producers may avoid formal incentive alignment but inherit informal reputational alignment, with its judgments evaluated by the same community that evaluates the producers — the anchor it provides is nominally external but operationally internal. External anchoring and speed parity trade off through scale: audits whose anchors are sustained by strong incentive misalignment (regulatory, legal) operate on timescales appropriate to their authority structures, which are typically slower than market-driven audits with weaker anchoring.

The pairwise tensions imply the structural conclusion: no audit system internal to the ML substrate satisfies all three conditions simultaneously, and no current non-ML automation route closes the gap left by substrate sharing on the speed dimension. The three conditions are offered as a diagnostic partition, not a formally independent axiom set; their force is comparative. Each current approach can be examined for which condition it sacrifices in order to achieve the others. This is the object of section 6.

6 The Trade-Off Profile of Current Approaches

Current frontier safety work spans a wide landscape of approaches (Hendrycks et al., 2021), within which Anderljung et al. (2023) propose one regulatory framework emphasizing external scrutiny of frontier models, registration and reporting requirements, and compliance mechanisms. Their framework leaves underspecified what counts as adequately external — a gap that, for the purposes of the analysis here, is a productive entry point. Four current approaches can be read against the three externality conditions, and the productive question is not whether each fails but what structural trade-off each one makes.

6.1 Evaluation Institutes (UK/US AISI)

The UK and US AI Safety Institutes, established in 2023–2024, represent the most institutionalized attempt at external evaluation. Their staffing and governance are organizationally separated from frontier labs; they draw on public rather than private research substrates for their methods. Their institutional success is tied to regulatory mandate rather than to the performance of the models they evaluate. They secure partial provenance independence and, through that incentive structure, reasonably strong external anchoring.

What they sacrifice is speed parity. Their evaluation methods depend heavily on human expert judgment, structured red-teaming, and pre-deployment assessment cycles measured in weeks or months. A frontier model is deployed at a scale where the error accumulation rate exceeds the audit cycle rate by orders of magnitude. The mismatch is structural, not staffing-based: AISI achieves substrate externality by paying the cost of speed.

6.2 Interpretability Research

Interpretability tools — activation probing, feature visualization, mechanistic analysis of circuits — satisfy speed parity by construction. They are computational processes operating at rates comparable to the systems they analyze.

What interpretability sacrifices is provenance independence. The tools themselves originate within the same pre-training substrates as the models they examine. A sparse autoencoder trained on the activations of a frontier language model shares architecture, scale, and procedural lineage with its target. Its ability to detect failures is bounded by the blind spots of the substrate it emerges from. The substrate sharing is not incidental; it is the condition under which the research enterprise exists. Interpretability achieves speed by paying the cost of substrate externality.

6.3 Open Weights Release

Open-weight release strategies attempt to achieve provenance independence by expanding the auditor pool — permitting any party with sufficient resources to examine the weights. Auditors are drawn from a wider distribution of backgrounds than under closed-weight development.

What open weights does not secure is that the enlarged auditor pool is distributionally independent in training lineage. Thousands of auditors trained in the same graduate programs, drawing on the same textbooks and the same body of research methods, do not constitute substrate separation. Open weights secures legal and organizational externality at the cost of leaving substrate-level externality unachieved.

6.4 Cryptographic Attestation (zkML)

Zero-knowledge proof systems for machine learning provide guarantees about what was computed and under what conditions. They secure immutability of record — a cryptographic warrant that a given claim about model behavior corresponds to an actual computation. The attestation layer is, in a formal sense, substrate-external to the ML system being attested.

What zkML does not secure is externality of judgment. The attestation layer guarantees integrity of record without guaranteeing correctness of content. A model that is consistently miscalibrated, attested cryptographically, is consistently miscalibrated with a cryptographic warrant. zkML achieves perfect externality on one narrow dimension (record) while leaving the dimension that matters (judgment) entirely within the audited substrate.

6.5 The Shape of the Open Space

Each approach secures some externality conditions by sacrificing others. The trade-off pattern is structurally visible: AISI pays speed for substrate; interpretability pays substrate for speed; open weights pays substrate-level externality for partial auditor diversity; zkML pays judgment-externality for record-externality.

The pattern is not accidental. Externality in other domains where audit works — legal review of executive action, regulatory review of industry, scientific replication across laboratory boundaries — is secured through pre-existing structural separation or through deliberate construction at a level different from the audited system’s. These structures do not emerge spontaneously through refinement within the audited layer; they are built, with effort, from epistemologies that predate and are formally independent of the audited system. The ML case has no such pre-existing external substrate and has not yet been the site of deliberate construction at a different level. The shape of the open space is the shape of that missing construction.

7 Why the Frame Does Not See the Problem

The observation that no current approach satisfies all three externality conditions could, in principle, motivate the construction of one that does. The argument of this note is that within the dominant frame, that motivation does not form — because the frame sorts what counts as a “safety problem” in a way that places category-level externality outside the space of recognized problems before any attempt at construction begins.

The “safety as a technical problem” framing — broadly shared across frontier labs and safety-focused organizations — is not a neutral framing. It is a category assignment at the meta-level. It sorts the problems of safety into “things the technical substrate can address” and treats everything else as either not a safety problem or as the province of other fields. Category-level opacity is difficult to see within the frame, not because it has been examined and dismissed, but because the frame does not yet have a place for it. The limited visibility has three sources — institutional, structural-informational, and historical.

7.1 Institutional: Career Structures Select Against External Work

The institutions that support ML safety research — publications, tenure decisions, corporate R&D budgets, grant funding — reward work internal to the substrate. Interpretability researchers build tools on ML architectures. Evaluation researchers publish benchmarks run on ML systems. Alignment researchers propose training methods evaluated by other ML systems. Work that would satisfy the three externality conditions — work categorically external to the ML substrate — would be difficult to place in ML venues, would fit awkwardly within current definitions of “alignment research,” and would not readily advance an ML researcher’s career.

This institutional selection is the visible layer of a deeper process: the community of researchers who self-identify as working on AI safety is constituted by prior work on ML systems. The set of questions recognized as safety questions is the set visible from within that identity. Work that would satisfy the externality conditions does not only strain the promotion criteria; it strains the identity criteria. It may not be recognized as the work of an AI safety researcher, even by its author.

7.2 Structural-Informational: The Signal-to-Noise Floor of Self-Referential Evaluation

The institutional structure rests on a structural-informational fact, and stating this fact in technical rather than philosophical terms clarifies the argument. When an evaluation system shares substrate with the system being evaluated, the evaluation signal and the noise inherent to that substrate are not separable. The auditor’s outputs and the auditee’s outputs are correlated through their common ancestry, which means errors in the auditor’s judgments are not independent of errors in the auditee’s behavior. They are drawn from the same underlying error distribution.

This is a signal-to-noise floor on substrate-internal evaluation. The effective SNR of an audit system is bounded above by the degree of substrate independence it has from its target. As substrate sharing approaches one — as auditor and auditee become drawn from the same pre-training corpora, evaluated by the same methods, judged by the same community — the effective SNR approaches zero. Engineering refinement of the auditor that operates entirely within the shared substrate has difficulty raising the SNR floor, because the refinement is itself drawn from the substrate that defines the floor. This is structurally analogous to the Shannon limit on channel capacity (cf. Shannon, 1948): encoding sophistication cannot exceed the capacity that the channel itself provides, and refinement of internal encoding cannot break the bound.

The “safety as technical problem” framing operates within this SNR regime. The framing’s claim is that better technical work — better interpretability tools, better evaluation benchmarks, better training methods — will improve safety. This is true within the substrate-shared SNR regime, up to the floor that regime imposes. It is false above the floor, because raising the floor requires work that is by definition not within the substrate. This is not a claim that the framing is philosophically wrong; it is an observation about the operating regime in which the framing produces gains. Above the SNR floor, further gains from technical sophistication alone become difficult to obtain, because the sophistication is itself drawn from the substrate that defines the floor.

The framing also functions, at the meta-level, as a category assignment with the same structure described in section 2 at the object level. It identifies “safety problems” by sorting them into the category “things technical work addresses,” and conceals — through that identification — the problems whose resolution requires substrate-external work. The framing can be read, structurally, as exhibiting features of the failure mode this note describes. This is not offered as a hypocrisy charge; it is a structural observation about why the failure mode is difficult to identify from inside the frame.

7.3 Historical: The Asymmetry of Recognition in the 2008 Case

The 2008 financial case is informative not only as a precedent for the failure mechanism (section 3) but also as a precedent for the structure of recognition. Category-level failure in credit markets was recognized — when it was recognized — by observers external to the practitioner frame. Basel Committee work, academic economists outside the rating and structuring community, heterodox analysts (Michael Burry, others who sought to short the structure before its collapse), and journalists working from outside the financial industry’s self-description. Practitioners inside the frame continued to treat credit risk as a modeling problem until external signals — mortgage defaults, then crisis — became too large to dismiss.

The asymmetry is the point. The frame did not produce its own recognition. What produced recognition was the accumulation of external signals (defaults, foreclosures, ratings downgrades initiated by external pressure) over a period that extended across years before the crisis. The signals were available before the crisis; what was absent was the in-frame capacity to interpret them as evidence of category-level failure rather than as outliers within a still-functioning model. The lag between external signal availability and in-frame recognition was, in the 2008 case, years; recognition arrived only when the external signals scaled to a magnitude the frame could no longer absorb.

The same asymmetry is structurally available now. Recognition of category-level opacity in ML evaluation, if it comes, is likely to come from outside the frame the “safety as technical problem” framing constructs — from regulatory action, legal proceedings, public harm events, perhaps from observers in adjacent disciplines (epistemology, STS, financial systemic-risk research) — and potentially only after failures produce external signals that are difficult to resolve within the frame. The timing of such recognition is conditional on the timing of the external signals, which the frame neither produces nor predicts. The lag may again be years; the question of how much category-level damage accumulates during the lag is empirical and unresolved.

The paths identified at NeurIPS 2024 — synthetic data, agentic AI — are within-substrate work. They respond to capabilities problems, not to category-level failure, because category-level failure is not readily legible as a problem within the frame. This is not an accusation of individual blindness. It is a structural description of why a problem, if it is real, is unlikely to be solved by work that continues entirely within the frame that has difficulty seeing it. The observation applies to the author of this note as well: the note is written from outside the frame, which is the position from which the shape of the blind spot is visible — and also a position that cannot specify what would replace the frame.

8 Limitations

The analysis has several limitations that bear explicit statement.

The three externality conditions are offered as a diagnostic partition, not a formally independent axiom set. Their treatment as the structural inverse of the failure conditions is informative, not formal — it shows how they map to the failure conditions, not that they are the unique or minimal inverse. A more rigorous treatment would attempt to prove mutual non-reducibility or to identify cases where satisfying two entails the third. The current presentation takes them as heuristically useful because they locate the failure points of current approaches and illuminate the trade-off structure among them.

The resolution at which provenance independence is required remains unspecified. The claim that substrate sharing undermines audit does not specify how different the substrates must be, or along what dimensions. Answering this question requires empirical investigation of how substrate dependencies propagate through downstream judgment, together with theoretical work on how to measure distributional ancestry of learned systems.

The claim in section 5.3 that no current non-ML automation track achieves the required combination of speed and expressivity is empirical. It is in principle possible that future formal verification methods, neuro-symbolic architectures, or other hybrid approaches will close the gap. The argument here is about the present state of computational methods, not a metaphysical impossibility. If a non-ML automation track does achieve the required combination, condition 5.3 becomes satisfiable without the substrate-sharing tension, and the argument of section 5.4 weakens accordingly.

The historical precedent establishes that category-level failure can occur under the three conditions and that the conditions can assemble over a period of years before the category fails. It does not establish that the AI case must follow the same trajectory. Correspondence is structural; outcome is not entailed. What the precedent supports is examining whether the conditions are assembling, which is what sections 4 and 6 attempt. If the assembly is not in fact occurring — if, for example, the shared-lineage analysis in section 4 overstates the degree of substrate convergence — the analogy loses force proportionally.

The note does not address whether any complete external substrate is in principle available for the AI case. Section 9 sketches several candidate partial channels — physical-feedback loops, hardware attestation, cryptographically certified human-origin data — but does not defend any of these as a complete solution. Whether a complete external substrate can be constructed, and what it would be made of, is a question section 9 leaves open.

The claim in section 7.2 that the “safety as a technical problem” framing operates within an SNR regime whose floor cannot be raised by within-substrate refinement rests on an analogy to channel capacity. The analogy is informative but not formal — the SNR floor in self-referential evaluation has not been derived as a precise quantity in this note. A formal treatment would require an information-theoretic model of substrate-shared evaluation that goes beyond the present argument.

9 Open Questions

Several questions emerge as priorities for further work.

Metric Orthogonality. What measurements distinguish distribution-level from category-level failure empirically? The claim that the two are orthogonal should be operationalizable. A measurement protocol that detected category-level opacity under preserved distributional diversity would strengthen the analysis; failure to find such a protocol would weaken it. Candidate directions include metrics of provenance ancestry in training corpora, measures of audit-subject substrate overlap, and tests of category stability under self-referential dynamics.

Substrate Ancestry. At what resolution does substrate sharing undermine audit independence? The current literature on homogenization treats shared pre-training as a source of correlated failure but does not quantify the degree of substrate divergence sufficient to restore independence. A framework for measuring substrate ancestry — perhaps via training-data overlap analysis, gradient-similarity metrics, or downstream judgment correlation — and for identifying the threshold at which independence is restored, would move the analysis from diagnostic to actionable.

The Macroprudential Analog: Partial External Channels. The note argues that the required resolution is categorically external to the ML substrate, analogous to post-2008 financial regulation. It does not specify a complete external instrument, but several partial channels can be sketched concretely. Each addresses some externality conditions while leaving others open; none constitute a complete substitute for a missing external substrate.

The first is physical-feedback loops: circuits in which model outputs are validated against real-world physical measurements that are not themselves produced by ML systems. Medical diagnostic outputs validated against clinical outcome data; autonomous driving judgments validated against measured collision and intervention rates; economic forecasts validated against actual transaction volumes. The physical world is distributionally independent of the ML substrate in a way that satisfies condition 5.1. The cost is speed: physical feedback operates on timescales (clinical follow-up, traffic incidents, market settlement) that are domain-dependent and typically slower than ML-internal automated audit. Where the timescales align — high-frequency physical measurement available at low latency — physical feedback can serve as a substrate-external audit channel.

The second is hardware-level attestation: TPMs, secure enclaves, physical unclonable functions, and similar mechanisms that provide computational integrity guarantees grounded in physical hardware properties rather than in software. zkML provides cryptographic externality of record but leaves judgment within the audited substrate; hardware attestation can extend this by providing externality of computational origin — guarantees that a given output was produced by a specifically identified hardware-software stack. This addresses provenance at the artifact level, though judgment externality remains unaddressed.

The third is cryptographically certified human authorship: biometric, proof-of-personhood, or hardware-signed verification that a given piece of content originated from a human author rather than from an ML system. This is the most direct candidate for satisfying condition 5.1 in the data domain, because it provides a verifiable boundary between human-origin and model-origin training material. The cost is implementation: building such infrastructure at the scale of internet-scale training corpora is a substantial engineering and institutional project, and the cryptographic and biometric foundations have privacy and adoption tradeoffs that are not yet resolved.

These three channels are partial externality channels, not complete substitutes. They suggest that the required externality is physically available in principle, even where it is not yet built. The deeper question — how to combine partial channels into a system that satisfies all three externality conditions — remains open.

Frame Adaptability. Where does this sub-problem sit within the current frontier-safety frame? The argument in section 7 suggests that the “safety as a technical problem” framing has structural difficulty recognizing category-level opacity as a distinct problem. The claim is structural rather than accusatory. Whether the frame can be modified to accommodate the sub-problem, or whether recognition requires a different frame altogether, is a question this note cannot answer from outside. It sits inside the work of the labs and institutions operating within that frame. If the note is useful at all, it is useful by making that question legible to whoever is in a position to answer it.

The note closes without offering a constructive alternative. The observation that current approaches are structurally blocked does not, on its own, indicate what an unblocked approach would look like. The shape of the required externality has been specified; partial channels have been sketched; a complete construction has not. What satisfies the shape is what the note leaves open.

References

Anderljung, M., Barnhart, J., Korinek, A., Leung, J., O’Keefe, C., Whittlestone, J., Avin, S., Brundage, M., Bullock, J., Cass-Beggs, D., Chang, B., Collins, T., Fist, T., Hadfield, G., Hayes, A., Ho, L., Hooker, S., Horvitz, E., Kolt, N., … Wolf, K. (2023). Frontier AI Regulation: Managing Emerging Risks to Public Safety. arXiv:2307.03718.

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2021). On the Opportunities and Risks of Foundation Models. Stanford Center for Research on Foundation Models. arXiv:2108.07258.

Casper, S., Davies, X., Shi, C., Gilbert, T. K., Scheurer, J., Rando, J., … Hadfield-Menell, D. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv:2307.15217.

Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2021). Unsolved Problems in ML Safety. arXiv:2109.13916.

MacKenzie, D., & Spears, T. (2014). “The formula that killed Wall Street”: The Gaussian copula and modelling practices in investment banking. Social Studies of Science, 44(3), 393–417.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

Perdomo, J. C., Zrnic, T., Mendler-Dünner, C., & Hardt, M. (2020). Performative Prediction. In Proceedings of the 37th International Conference on Machine Learning (ICML), 7599–7609.

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423.

Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2024). AI models collapse when trained on recursively generated data. Nature, 631, 755–759.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Zhuang, Z., Lin, Z., … Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36.