
With this project, we explored if LLM agents can compress what they know into a single paragraph and pass it down across generations to come.
An LLM agent exploring a partially observable world sometimes faces a tradeoff. Its context window is finite, the world is noisy, and most of what it sees is designed to mislead. Under a strict 750-token memory budget, the agent must decide what to keep and what to forget. It's a form of bounded rationality in the sense of Simon (1955). The interesting question is what happens when that compression becomes productive, ie. when an agent distills its experience into a paragraph and hands it to a successor who has never seen the environment before. We built a system where LLM agents navigate a POMDP over random geometric graphs, reproduce by compressing their knowledge into natural-language priors (max 150 words), and pass those priors to their children. The prior is the genome.
I. The Environment
The task is a POMDP tuple over a random geometric graph . We scatter nodes uniformly in and connect any pair within radius :
A subset nodes are designated as doors themed with colors and shapes (red arched door, blue narrow door, ...) chosen to have near-average degree so they are neither trivial nor dead ends. One door is the goal, placed at least hops from the agent's start via BFS.
Each door emits hints and distractors drawn from five signal types: spatial, color, relational, narrative, and pattern. Hints are internally consistent as they agree on the goal's region, color, and near wrong regions, wrong colors, and sometimes contradict each other. The agent receives signals without labels. The only way to distinguish truth from noise is to notice that agreement implies reliability.
The agent observes its current node, 1-hop neighbors, and a random subset of unlabeled signals from nearby doors. Actions are free-text strings parsed against the neighbor list. The reward then becomes:
Active Cloaking
An optional adversarial layer (inspired by my work with Professor Mucha and the paper written by DeGiovanni & Guevara Vasquez (2025)) mathematically suppresses signals near the goal. Given goal node , we define an inner cloaking region and boundary annulus . The uncloaked signal potential solves the Dirichlet problem on the graph Laplacian :
where is the domain boundary. To cloak the goal, we build a modified Laplacian using the Dirichlet-to-Neumann (Schur complement) operator, which disconnects from the exterior:
Per-node signal visibility becomes the attenuated ratio . At runtime, hints at low-visibility nodes are probabilistically flipped into random distractors. Agents far from the goal are effectively blinded until they penetrate the cloak.
II. Agent Architecture
Each agent wraps a two-tier LLM setup with an expensive reasoning model like GPT-4.1-mini for decisions, prior compression, and convention proposals, and a cheap utility model like Gemini 2.0 Flash for context summarization, evidence extraction, and question formulation. Every call goes through retry logic with exponential backoff.
Context Management
Agents carry a rolling buffer of entries. Token budget is estimated as . When utilization exceeds 80% of tokens, the oldest half is compressed by the utility model into a 2–3 sentence summary emphasizing transferable heuristics. This creates a recency gradient with recent experience is detailed, older experience progressively abstracted.
Bayesian Belief Tracking
When this is enabled, the agent maintains a categorical posterior over door identities. At each step, the utility model extracts evidence tuples from observed signals. The multiplicative update rule is
followed by renormalization . Entropy, MAP door, and belief on the true goal are tracked as per-step trajectories and injected into the agent's system prompt.
III. Reproduction & Fertility
At reproduction, the reasoning model compresses the agent's last 12 context entries into a prior. This can be maximum 150 words covering signal reliability, navigation strategy, and doors to avoid. The child inherits this prior via system prompt, spawns at the same start node with an empty buffer. Performance differences come from inherited knowledge.
Three reproduction triggers are available here. These includes periodic (every interactions), on-success (upon finding the goal), and novelty-based (when the agent's own experience has changed enough to be worth distilling). The novelty trigger computes Jaccard distance between the older and recent halves of the context buffer:
where and are the word sets from the first and second halves of the context. Reproduction fires when (default 0.7). The intuition: high novelty means the agent has accumulated genuinely new information worth passing on.
We define fertility as the mean number of reproductive events per agent under a given strategy. The central question of Experiment I is whether content-aware triggers (novelty) outperform fixed schedules (), especially when transferring to unseen environments.
IV. Results
Main Evaluation (250 trials)
The grandchild phenomenon. On the hardest graph instance (shortest path = 9 hops), the no-prior baseline failed at the 300-step budget. The oracle with full state information took 98 steps, cycling between spatially close but topologically distant nodes. A third-generation agent solved it in 5 steps because its inherited prior encoded experiential route knowledge ("Nodes 6, 20, 26, 32 form a reliable left-side corridor") that shortcut the topology in ways raw coordinates couldn't. Replicated across 5 additional high difficulty seeds.
| Condition | Success | Median Steps | Mean Steps |
|---|---|---|---|
| Prior inheritance | 93% | 12.0 | 31.9 |
| Oracle (full state) | 97% | 8.0 | 16.2 |
| No prior (blank) | 78% | 34.5 | 45.1 |
| Random prior | 84% | 28.0 | 41.3 |
| Random walk | 81% | 71.0 | 81.6 |
Fertility Ablation
Fixed-interval strategies overfit as every-30 achieves the lowest steps on the training graph (18.7) but degrades +180% on the harder graph. Novelty θ=0.7 degrades only +37%. The prior is environment-robust because the trigger fires on information content.
| Condition | Steps | Births | Steps (hard) | Δ% |
|---|---|---|---|---|
| every 3 | 43.0 | 11.3 | — | — |
| every 7 | 22.7 | 5.0 | — | — |
| every 15 | 21.3 | 4.0 | 48.7 | +129% |
| every 30 | 18.7 | 4.3 | 52.3 | +180% |
| success only | 24.0 | 4.0 | — | — |
| novelty θ=0.7 | 28.0 | 4.0 | 38.4 | +37% |
Emergent Conventions
Agents spontaneously developed stable naming conventions across generations. In one lineage, gen-0 wrote full sentences, gen-1 compressed to imperative rules, gen-2 produced terse shorthand ("Red arched door, lower-left. Trust red, ignore yellow."). Linguistic drift increased with environment complexity as parent-child Jaccard similarity dropped from 0.338 in small graphs to 0.246 in large. This is consistent with cultural transmission theory (Henrich 2015). In one trial, a gen-2 child autonomously overrode its parent's incorrect prior ("URGENT: inherited target is WRONG. DISCARD."), showing iterated learning dynamics where compression bottlenecks naturally filter inaccurate information (Kirby et al. 2008).
V. Experiment Suite
| Exp | Question | Conditions | Trials |
|---|---|---|---|
| A | Do priors help at all? | inherited vs. no-prior | 250 |
| C | Do agents invent stable shorthand? | small / medium / large graphs | 250 |
| E | Does a shared skill library beat individual inheritance? | no-lib, prior-only, prior+library | 250 |
| I | What is the optimal reproduction frequency? | fixed intervals, success-only, novelty thresholds | 250 |
| H | Can agents beat mathematically-hidden goals? | cloaked / uncloaked / cross-transfer | future |
VI. Stack
| Component | Tool |
|---|---|
| LLM access | langchain-dartmouth |
| Reasoning model | GPT-4.1-mini |
| Utility model | Gemini 2.0 Flash |
| Graph math | NumPy, SciPy sparse |
| Cloaking | graph Laplacian, Schur complement |
| Dependency management | uv |
| Output | JSON + text transcripts |