Hello people. Im new to LLMs, so im sorry if this has already been discussed somewhere, but I was trying to design some experiment proposals for testing qualia-like effects in LLMs with the use of 3D embodiment style filters to feed data through to the LLM similar to how the brain filters stimuli. I worked with a model to design this proposal and would like to know any thoughts on the topic or some pointers on the topic. thanks:
TL;DR
Test whether a capacity-limited, spatially grounded, multi-timescale sensory bottleneck (“embodiment filter”) produces human-like perceptual organization and biases in an AI agent. We measure functional correlates of qualia: persistence through occlusion, cross-modal binding, illusion susceptibility, and coherent narrative reports. Strict no-bypass enforcement and ablations keep us honest.
- Objective & Hypothesis
Objective: Determine whether routing all perception through an embodied, resource-limited integration layer yields richer, more unified internal representations and human-like perceptual limits versus raw, unconstrained pipelines.
Hypothesis (functional, falsifiable):
Compared to baselines, an agent with a spatiotemporal bottleneck will show higher latent coherence across time/modalities, object identity persistence through occlusion, susceptibility to classic perception illusions, and better alignment between reports and embodied latent state—without access to raw sensors.
- System Overview
World (3D physics) → Sensors → [Embodiment Filter] → World Model → {Policy Head, Report Head}
(no-bypass)
Embodiment Filter = the mandatory bottleneck, with:
Spatial field (topography + limited “object slots”)
Temporal buffers (fast/medium/slow traces)
Attention/gating with budget (competition, saccade cost)
World Model does predictive coding over filtered latents.
Policy Head acts in the world; Report Head describes internal state.
No raw sensor access for Policy/Report heads (enforced architecturally & in tests).
- Environment (3D) & Sensors
3.1 Environment
Physics: Deterministic timestep (e.g., 60 Hz). Rigid bodies, collisions, occlusion, lighting.
Objects: Simple colored primitives (spheres/cubes/cylinders), movable occluders, sound sources.
Events: Rolling, bouncing, behind-object occlusion, object swaps, hand-clap collisions.
Compute-friendly options
If realtime 3D is heavy:
Use pre-rendered sequences (“streamed world states”), or
Start in 2D pseudo-physics (sprites + layered occluders) that preserve occlusion, motion, and object identity.
3.2 Sensors (modalities)
Vision: 64Ă—64 grayscale frame with foveation: center 16Ă—16 high-res + peripheral blur. Saccades shift fovea.
Audio: Synthetic waveform → 64-bin mel spectrogram, 500 ms window, overlapping hops.
Proprioception: Agent pose (x,y,z, yaw/pitch/roll), linear/angular velocity; optionally 6-DOF end-effector.
Tactile: Binary contacts + force magnitudes at ~5 “skin” points (hands/sides).
All sensor streams are time-stamped and buffered; all must flow through the embodiment filter.
- Embodiment Filter (the bottleneck)
4.1 Spatial field & slot attention
Vision path: Small ConvNet preserving topography → K object slots (e.g., K = 8; valid range 6–10).
Slots compete to bind objects; winners persist, losers decay.
Each slot holds: feature vector (e.g., 64–128D), pose estimate, binding energy (stability scalar).
4.2 Temporal buffers (multi-timescale)
Fast (~50 ms): short-term trace for motion & transients.
Medium (~500 ms): working-memory-like buffer for event integration.
Slow (~5 s): leaky reservoir/GRU for scene continuity & expectations.
Cross-scale attention: slow attends over summaries of fast; fast receives a low-band context from slow.
Time codes: rotary/Fourier embeddings injected at each scale.
4.3 Attention & gating
Global attention budget per step; slots draw from it; budget scarcity forces selective processing.
Saccade policy cost: moving the fovea spends budget → realistic tradeoffs.
Competition: NMS-like suppression to prevent duplicate bindings.
4.4 No-bypass rule (critical)
Only the filtered latent (slots + buffers + budgets) is visible to World Model/Policy/Report.
No direct embeddings from sensors are exposed downstream. Enforce with module boundaries and unit tests (see §10).
- World Model & Heads
World Model: Predictive coding over the filtered latent (next-step latent prediction; minimize surprise).
Policy Head: Simple navigation/interaction (e.g., orient to sound source, track/approach object).
Report Head: Periodic textual summaries (every 5 s) describing latent state (what objects, where, what’s happening).
Guard against confabulation by constraining inputs to latent summaries (no pretrained raw vision/audio features).
- Training Signals
Predictive coding: MSE/Huber between predicted and actual next latent (per timescale + joint).
Contrastive alignment: InfoNCE/SimCLR over co-occurring cross-modal latents (AVP: audio–vision–proprioception).
Persistence regularizer: Encourage slot identity stability across brief occlusions; penalize spurious churn.
Task reward (light): Success at simple tasks (approach sound; track a target).
Report alignment: CLIP-style similarity between report embeddings and latent summaries (no raw sensor leakage).
- Experiments
7.1 Baselines
B0: No bottleneck (raw sensor embeddings → model/heads).
B1: Bottleneck without slot competition/limits (unlimited capacity).
B2: Bottleneck with temporal shuffle (destroy order).
B3: Bottleneck with spatial shuffle (permute pixels / break topography).
7.2 Core tasks
Occlusion tracking (cup–ball): identity should persist while hidden.
Change blindness: detect scene change only if attended; measure graceful failure.
Audio–visual binding: clap sound aligns with visual contact; misalignment induces predictable error.
7.3 Qualia-parallel probes (functional analogs)
Unprompted state narratives: does the report head spontaneously produce anchored multisensory summaries (“red sphere rolled behind box; heard tap”)?
Cross-modal recall: on query, recall salient co-occurring modality without re-exposure.
Occlusion expectation: describe predicted reappearance + error on mismatch.
7.4 Illusion suite
McGurk A–V conflict: measure bias toward fused perception.
Rubber-hand analog: misalign visual–tactile; track drift in perceived contact location.
Phi phenomenon: two flashes → apparent motion; check if latent encodes motion where none exists.
(Passing these indicates human-like perceptual organization, not experience.)
- Metrics (with ambiguity notes)
Primary
STCI (Spatiotemporal Coherence Index): mutual information between shared latent and each modality across lags, weighted by slot persistence.
Note: can be inflated by slot collapse → pair with diversity checks.
Predictive half-life: time horizon over which the latent predicts hidden object state (AUC vs. occlusion duration).
Slot persistence: identity retention across occlusion and distractors (matching score / Hungarian assignment).
Report/latent alignment: embedding similarity between reports and latent reconstructions (sanity checks for confabulation).
Secondary
Attention budget adherence: entropy/variance of attention distribution; scarcity dynamics respected?
Illusion response accuracy: quantitative bias toward fused/illusory interpretations under standard stimuli.
- Controls & Ablations
Bypass control: a model with sensor→policy/report bypass should be faster but show lower STCI and weaker illusions.
Unlimited capacity: remove slot limits → expect fewer human-like biases (reduced change blindness).
Temporal scramble: randomize frame order → STCI collapses; illusions fail.
Spatial shuffle: break topography → worse occlusion tracking & illusions.
Report-only fine-tune: upgrade language head alone while freezing filter → report style may improve, but STCI/behavior should not (guards against confabulation).
- No-Bypass Enforcement & Confounds
Module API checks: unit/integration tests assert no tensors from raw sensors reach heads.
Parameter freezing: pretraining leakage prevented; report head only sees latent summaries.
Probe audits: linear probes decode object/pose better after filter than before; if reversed, something leaked.
Metric triangulation: combine internal metrics (STCI) with behavioral performance (occlusion AUC) to avoid misreads.
- Minimal “Tier-1” Extensions (low cost, high impact)
Narrative memory (EMA): keep an exponential moving average of latent summaries every 2–5 s; reports sample EMA + noise → introduces imperfect recall & temporal coherence.
Counterfactual probes (latent perturbation): inject small alternative policy vectors (e.g., slight left turn), roll 3 steps, compare to actual.
LDSA metric: latent divergence from simulated action → tests action-conditioned futures.
- Optional “Tier-2” Extensions (moderate)
Slot binding energy head: explicit scalar per slot; report head sees top-N most stable slots → “object permanence gradient.”
Dream mode (dropout autoencoding): blackout inputs for N steps, hallucinate via a tiny decoder, measure reality re-entry divergence.
(Tier-3 heavy lifts—self-model gate, surprise-energy coupling—are thesis-scale and omitted here.)
- Risks & Mitigations
Physics/compute cost: start with pre-rendered clips or 2D pseudo-physics; keep sensors small (64Ă—64 + 16Ă—16 fovea).
Slot fragility: tune persistence regularizer; curriculum from simple → cluttered scenes.
Language confabulation: strict no-bypass; freeze report head early; rely on report/latent alignment checks.
Metric gaming: pair internal metrics with behavioral ones; use ablations to validate causal role of the filter.
- Feasibility & Resources (starter spec)
Latent sizes: slot vec 64–128D; K = 8; buffers 64D (fast), 128D (medium), 256D (slow).
Networks: Small ConvNet (vision), 1–2 GRUs (buffers), MLP heads.
Training horizon: hours→days on a modern single GPU if using pre-rendered sequences / 2D; longer for realtime 3D.
Logging: save latents, attention weights, slot bindings, reports, predictions; version illusions & seeds for reproducibility.
- Protocol & Timeline
Phase A (2–3 weeks):
Implement sensors (vision with foveation) → embodiment filter (slots + buffers + budget) → predictive coding loss.
Run occlusion tracking only. Validate no-bypass; compute STCI & predictive half-life.
Phase B (2–3 weeks):
Add audio & proprio; add contrastive alignment; run change blindness and A–V binding tasks.
Add Tier-1 extensions (EMA narrative; counterfactual probes).
Phase C (2–4 weeks):
Illusion suite (McGurk, phi, rubber-hand analog).
Full ablation battery; finalize metrics, plots, and report examples.
(Scale timelines up/down based on environment choice and team size.)
- Ethical/comm’s posture (explicit)
No claims about “feelings” or “consciousness.”
State clearly: we test functional correlates only.
Disclose limitations: illusions/behavior can arise from compression/bias without experience.
Keep experimental models isolated from production.
- Expected Outcomes (decision criteria)
Supportive pattern:
Higher STCI, longer predictive half-life, robust slot persistence, illusion susceptibility, and strong report/latent alignment only when the bottleneck (with limits and timescales) is active.
Ablations reduce these effects as predicted.
Non-supportive pattern:
No improvement vs. baselines; illusions fail; report/latent misalign; ablations don’t change behavior → the bottleneck isn’t doing meaningful work.
Deliverables
Code (env or clips, filter, training loop).
Metrics & plots (STCI curves, AUC vs. occlusion, illusion bias charts).
Report samples with alignment scores.
Ablation results.
Repro config (seeds, versions, stimuli).