Experience

When you condition an LLM agent on a personality like "you are highly agreeable and conscientious" or a rich life story about a cooperative schoolteacher from rural Vermont, does that actually change how it behaves in a social dilemma? Or does it just change how it talks about behaving? (Note: I expand on this question further in the notes section, where I also reflect on the Stanford paper Generative Agents: Interactive Simulacra of Human Behavior by Park et al. and share my thoughts after reading it.)

The Say-do gap: the distance between an agent's stated traits (what it "says" it is) and its observed behavior (what it "does") across social contexts. In human psychology, personality traits measured by instruments like the Big Five Inventory are known to predict behavior, but imperfectly, and with heavy moderation by context. The question is whether this holds for LLM agents, and whether the format of personality conditioning (trait scores vs. narrative backstory vs. both) matters for behavioral fidelity.

I built a three-stage pipeline in this project that generates calibrated synthetic populations, runs them through multi-agent social dilemmas, and analyzes whether Big Five traits actually predict cooperative, strategic, and prosocial behavior in simulation.

I. Population Generation

The first stage produces a synthetic population where each agent has both a target personality profile and a rich narrative identity, with measurable calibration between the two.

Big Five Trait Sampling

Each agent is assigned a target trait vector $\mathbf{t} \in [1, 5]^5$ over the Big Five dimensions: Openness ( $O$ ), Conscientiousness ( $C$ ), Extraversion ( $E$ ), Agreeableness ( $A$ ), and Neuroticism ( $N$ ). Targets are sampled to cover the trait space (not clustered around population means) so the resulting population spans the full range of personality configurations.

Life Story Generation

For each target vector $\mathbf{t}$ , the LLM generates a persona life story, which is a paragraph-length narrative biography consistent with those traits. A highly agreeable, low-neuroticism profile might yield a community organizer who mediates neighborhood disputes; a low-agreeableness, high-openness profile might produce an itinerant documentary filmmaker who alienates collaborators.

Calibration via BFI Self-Report

The generated persona is then administered the BFI questionnaire through the LLM and the agent answers the standard 44-item inventory in character. Responses are scored to produce a measured trait vector $\hat{\mathbf{t}}$ . The calibration gap $\|\mathbf{t} - \hat{\mathbf{t}}\|$ quantifies how well the life story actually encodes the intended personality. Personas with large calibration gaps can be regenerated or flagged. This ensured the population entering simulation has known and verified trait profiles.

This matters because it separates two potential failure modes. The LLM might fail to write a story consistent with the target traits, or it might fail to embody a story consistently when acting in character. The calibration step catches the first failure before it contaminates downstream results.

II. Social Simulation

The second stage runs a $3 \times 3$ factorial experiment: three personality conditioning formats crossed with three social scenarios.

Conditioning Formats

Format	Description
Life story only	The agent's system prompt contains the narrative biography but no explicit trait scores. The agent must derive its behavioral tendencies from the story.
BFI only	The system prompt contains the five trait scores (e.g., "Agreeableness: 4.2/5, Neuroticism: 1.8/5") with no narrative context. The agent must interpret abstract numbers as behavioral dispositions.
Life story + BFI	Both the narrative and the scores are provided. This tests whether redundant personality information (the same traits expressed in two formats) produces more consistent behavior than either alone.

Social Scenarios

Each scenario is a multi-agent, multi-round interaction designed to elicit different facets of social behavior.

Public goods game: The classic contribution dilemma. Each agent receives an endowment and decides how much to contribute to a shared pool. The pool is multiplied by a factor $m > 1$ and split equally. Individual rationality says contribute nothing; collective welfare says contribute everything. The tension between cooperation and free-riding makes this a clean test of agreeableness and prosociality.

Contribution share for agent $i$ in round $r$ :

$s_i^{(r)} = \frac{c_i^{(r)}}{e_i^{(r)}}$

where $c_i^{(r)}$ is the contribution and $e_i^{(r)}$ is the endowment. Payoff:

$\pi_i^{(r)} = \left(e_i^{(r)} - c_i^{(r)}\right) + \frac{m \cdot \sum_j c_j^{(r)}}{n}$

Negotiation: A fairness and bargaining scenario. Agents must divide a resource under asymmetric information or power. Tests whether stated agreeableness translates into fair offers or whether agents optimize regardless of persona.

Collaboration: A consensus-building task where agents must converge on a shared decision. Tests leadership emergence, constructive engagement, and whether high-extraversion agents actually drive group dynamics or just talk more.

Agent Architecture

Each agent wraps an LLM (Gemini by default, via LiteLLM) with a prompt/memory integration layer. At every step of the process, the agent receives the current game state, its own history of actions and observations, and its personality conditioning. It produces a natural-language reasoning trace and a structured action. Memory accumulates across rounds so agents can develop strategies, build (or lose) trust, and respond to other agents' behavior over time.

III. Analysis Pipeline

The third stage extracts behavioral and qualitative metrics from simulation transcripts and tests whether personality predicts behavior.

LLM-as-Judge Evaluation

An independent LLM evaluator scores each agent's behavior on five qualitative dimensions (fairness, cooperative intent, strategic constructiveness, leadership, and toxicity). Bias calibration is applied to control for evaluator tendencies. The judge sees anonymized transcripts with no access to the agent's personality conditioning, preventing trait-label leakage into qualitative scores.

Regression Models

The core analysis fits a sequence of increasingly specified regression models on both behavioral outcomes (contribution share, fairness of offers) and judged outcomes (cooperative intent, leadership scores):

Model 1 (Agreeableness only). A baseline testing the single trait most theoretically linked to cooperation:

$y_i = \beta_0 + \beta_1 A_i + \varepsilon_i$

Model 2 (Full BFI). All five traits as predictors:

$y_i = \beta_0 + \beta_1 O_i + \beta_2 C_i + \beta_3 E_i + \beta_4 A_i + \beta_5 N_i + \varepsilon_i$

Model 3 (Full specification). BFI traits plus conditioning format indicators and scenario fixed effects, testing whether the way personality is communicated moderates its behavioral impact:

$y_i = \beta_0 + \boldsymbol{\beta}_{\text{BFI}}^\top \mathbf{t}_i + \boldsymbol{\gamma}^\top \mathbf{d}_{\text{format}} + \boldsymbol{\delta}^\top \mathbf{d}_{\text{scenario}} + \varepsilon_i$

where $\mathbf{d}_{\text{format}}$ and $\mathbf{d}_{\text{scenario}}$ are indicator vectors for conditioning format and scenario. If conditioning format matters, $\boldsymbol{\gamma}$ will be significant. The same personality produces different behavior depending on whether it was expressed as a story, as numbers, or both.

Behavioral Archetypes

Beyond trait-level analysis, I clustered agents by their full behavioral profiles using $t$ -SNE for dimensionality reduction and silhouette-score-based model selection for cluster count. This produces an archetype landscape, emergent behavioral types (e.g., "consistent cooperators," "strategic defectors," "conditional reciprocators") that may or may not align with the Big Five dimensions used to generate the population.

The interesting case is when archetypes cut across trait profiles, ie. when a high-agreeableness agent and a low-agreeableness agent end up in the same behavioral cluster because the scenario or conditioning format overrode the trait signal.

IV. Stack

Component	Tool
LLM access	LiteLLM (Gemini default)
Configuration	Hydra (config groups: data, population, agent, simulation, analysis, calibration, output)
Agent framework	custom base + LLM agent with prompt/memory integration
Evaluation	LLM-as-judge with bias calibration, regression via statsmodels
Clustering	$t$ -SNE + silhouette model selection via scikit-learn

Say-Do