What Theory Does Deep Learning Need? A Measurement and Deployment Perspective

A dynamics-and-measurement agenda (responding to a question: what would a consensus mathematical theory of deep learning look like in 100 years?)

If you’ve ever stared at two training runs that “should” have been the same — same codebase, same model size, same optimizer, same dataset — yet one converges cleanly and the other drifts into something brittle, you already know the gap this question points at.

We have lots of partial theories. But what practitioners often call “training magic” still shows up as: phase-like jumps, sudden “unlocking” of generalization, fragile dependence on data mixtures or schedules, and the uncomfortable feeling that we can describe what happened after the fact better than we can predict it beforehand.

I don’t think a future consensus theory will look like Euclidean geometry: a small set of axioms that deduce everything about learning. I suspect it will look more like statistical physics or dynamical systems: a theory of how a constrained stochastic system evolves in time, including when it switches regimes.

And the part I care most about is the least glamorous one: measurement. Not “what story can we tell”, but “what can we measure in a way that survives audit and falsification”.


The shape of a plausible “100-year theory”

Here’s a compact guess for what such a theory would need to treat as first-class objects.

1) The object is not just parameters; it’s a system state

Parameter-only lenses are too narrow. A training run is more like an evolving system state:

x(t) = (theta(t), s(t), D(t), Pi(t), B(t), C(t))

Where:

  • theta(t) is parameters,

  • s(t) is optimizer + training-internal state (moments, noise scale proxies, etc.),

  • D(t) is the data process (mixtures, curricula, filtering, licensing boundaries),

  • Pi(t) is a representation-geometry proxy (effective rank, spectra, sparsity, etc.),

  • B(t) is boundary conditions (training budget, gating cadence, rollout boundary),

  • C(t) is constraint structure (how “what’s reachable” gets shaped over time).

Once you write the object this way, a lot of “mystery” becomes: we were holding the wrong state variables fixed.

2) Training is a resultant force plus structured noise

Instead of pretending training is purely -grad L, the theory needs room for a resultant field and a noise structure:

dx = f(x; B, C) dt + Sigma(x; B, C) dW_t

Many “hyperparameter tricks” are interventions on the shape of f (resultant forces) or the spectrum of Sigma (noise). That framing is not the whole story, but it’s a cleaner starting point than “SGD magic”.

3) Constraints and boundaries define reachability

This is the part I keep coming back to, and I’m not fully confident I have it right yet.

One of the most underrated ideas — because it sounds obvious until you actually do experiments — is that boundaries are not “implementation details”. They are part of the constraint structure.

Architecture constraints, data constraints, compute constraints, and governance constraints together define what regimes are reachable in the first place. Change the boundary, change the reachable endpoints.

This matters both for science (what you can claim from experiments) and for deployment (what outcomes your process makes inevitable).

4) Time is not a background parameter: regimes and tempo matter

Many deep learning phenomena look like regime switching:

  • early: rapid fitting,

  • middle: representation reorganization,

  • late: stabilization / rigidity / attractor-like behavior.

A consensus theory won’t be “one smooth curve explains all”. It will detect regime changes, explain why they happen, and predict how phase boundaries move when constraints change.


The missing layer: measurement, estimators, and audit gates

This is the point where many “big perspectives” fail: they remain descriptive because they don’t specify what it means to measure “force”, “constraint”, or “information” in a way that survives adversarial scrutiny.

A workable scientific posture is: bind statements to an explicit estimator tuple:

E = (S_t, B, {F_hat, C_hat, I_hat}, W)

Where:

  • S_t is what you call “state” in your logs/telemetry,

  • B declares the boundary conditions,

  • {F_hat, C_hat, I_hat} are operational estimators (proxies),

  • W is the measurement window/hyperparameters.

Then add one minimal audit requirement: if you’re making a claim that depends on C_hat, you should show that multiple reasonable C_hat estimators agree on the structure your claim needs (rank order vs threshold alignment vs event structure). Otherwise you may be measuring an artifact.

This “measurement layer” is not optional. It’s the difference between a narrative and a falsifiable claim.


Tier-2 validation can’t be relabeling: the A/B/C novelty filter

If you want a real-world (Tier-2) case study to count as validation rather than a demo, it must pass a novelty gate:

  • A: How does the field describe this risk today (without your framework)?

  • B: How does your framework translate that into state/force/constraint/tempo terms?

  • C: What new, quantitative, falsifiable prediction follows that is not already contained in (A)?

If you can’t answer (C), you may still have a useful lens, but you haven’t earned scientific credit yet.


Three falsifiable Tier-2 predictions (examples)

These are intentionally “shape-of-curve” predictions: they can be wrong, and being wrong is informative.

  1. A tempo threshold where measurement coherence collapses.
    There exists a critical tau_c such that coherence between reasonable estimators (e.g., rho(C_hat_1, C_hat_2)) drops sharply once the update interval tau_u crosses tau_c. This would act as an early warning that governance and measurement can no longer keep up.

  2. IO accumulation changes shape under sustained tempo mismatch.
    When the governance/evaluation cycle tau_g lags the update cycle tau_u beyond some threshold r, the rate of “irreversible operations” (IOs) shifts from roughly linear accumulation to accelerating accumulation (often closer to exponential than linear). This predicts “false stability”: surface metrics look fine while correction capacity collapses.

  3. Boundary changes decide endpoints.
    Holding the “core algorithm” fixed, changing boundary conditions (data boundaries, rollout boundaries, gating cadence) changes the reachable regime structure and thus likely endpoints. This is testable with controlled comparisons.


A two-week pilot (for people who actually ship systems)

If you’re leading training or deployment, you’re probably thinking: “This sounds reasonable, but what would I do on Monday?”

Here’s a low-friction two-week pilot. It doesn’t require buying into a new framework. The goal is simply to surface whether tempo mismatch and irreversibility are already creeping into your workflow — using a few auditable signals.

Three dashboard metrics (cheap, but surprisingly revealing)

  • Validation Lag (VL): time from a change becoming effective to evaluation closure

  • Rollback Drill Pass Rate (RDPR): fraction of rollback rehearsals that succeed within a defined RTO/RPO

  • Gate Bypass Rate (GBR): rate of bypassing required gates for high-impact changes

A minimal IO register (classification, not blame)

Treat an “Irreversible Operation” (IO) as any change that materially reduces feasible future correction paths under bounded cost/time. The point of the register is not to slow everything down — it’s to flag the handful of changes that quietly make rollback, oversight, or auditing infeasible later.

  • Data IO (non-reproducible sources, irreversible filtering, major mixture shifts)

  • Evaluation IO (shortening/removing gates, allowing fast paths)

  • Alignment IO (policy/execution changes that reduce auditability or override capacity)

  • Deployment IO (expanding exposure scope, removing staging boundaries)

  • Supply-chain IO (non-auditable components in critical paths)

  • Optionality IO (removing redundancy, single-point dependency lock-in)

Pilot deliverable: label the last 30 to 90 days of high-impact changes and backtest these against incidents, rollbacks, and near misses. The pilot succeeds even if it finds problems — especially if it finds problems.


Where this connects to FIT

The lens behind the structure above is the FIT framework (Force–Information–Time) plus an explicit estimator selection layer (EST). You don’t need to buy the “universal framework” claim to use the engineering outputs: estimator tuples, coherence gates, novelty filters, and IO-focused tempo governance.

If any of this fails under honest testing, that failure is an asset, not a PR disaster.

The prediction I most want someone to falsify: the tempo threshold (tau_c) for coherence collapse. If you have logs showing that measurement coherence stayed stable even under rapid update cycles, I’d love to see them.


Links

I don’t expect a single post to settle this. So I try to publish the parts that are actually auditable and runnable:

If you spot a place where this framing doesn’t hold up, please say so - comments are fine, and repo issues are even better.

1 Like