Experience

Louvain community detection on induced subgraphs of Caltech (left) and Dartmouth (right)

Every university claims a unique culture, and I wondered how to measure it during my freshman year of college. The social genome is the idea that the coefficients of an Exponential-Family Random Graph Model (ERGM), a statistical model that estimates the probability of observing a given network as a function of structural and attribute-based terms, fitted to a campus friendship network constitute a quantitative signature, a set of "social genes" encoding why friendships form at one institution versus another. Using the Facebook100 dataset (Traud, Mucha & Porter, 2012), a digital fossil of complete Facebook networks from 100 U.S. universities circa September 2005, I fitted a consistent ERGM specification across Caltech (762 nodes), Dartmouth (7,677 nodes), and Cornell (18,621 nodes). The resulting coefficient vectors distinguish the three institutions and, in Dartmouth's case, refute the obvious hypothesis about what drives its social structure.

I. Data

The Facebook100 dataset provides one undirected, unweighted graph $G_u = (V_u, E_u)$ per university $u$ , where nodes are students and edges are reciprocal Facebook friendships confirmed within that university's `.edu` domain. Each node carries anonymized, self-reported attributes like gender, class year, major, second major, residence (dorm/house), and high school. The `.edu` gate makes each $G_u$ a self-contained social system, as in, no cross-university edges exist, so structural differences across universities can be attributed to institutional characteristics instead of inter-network dependencies.

The original `.mat` files were retrieved via the Internet Archive (the dataset has been removed from the public internet for privacy reasons) and converted to CSV via `scipy.io.loadmat`. For each university, the sparse adjacency matrix $\mathbf{A}$ was extracted and its upper triangle taken via $\text{triu}(\mathbf{A}, k=1)$ to avoid double-counting undirected edges.

	Caltech	Dartmouth	Cornell
$\lvert V \rvert$	762	7,677	18,621
$\lvert E \rvert$	16,651	304,065	790,753
Density	0.057	0.010	0.005

Density is computed as $\frac{2\lvert E \rvert}{\lvert V \rvert(\lvert V \rvert - 1)}$ .

II. The ERGM Framework

An Exponential-Family Random Graph Model defines a probability distribution over the space of all graphs on $n$ nodes. For an observed graph $\mathbf{y}$ on node set $V$ with covariate matrix $\mathbf{X}$ :

$P_\theta(\mathbf{Y} = \mathbf{y}) = \frac{\exp\!\bigl(\boldsymbol{\theta}^\top \mathbf{g}(\mathbf{y}, \mathbf{X})\bigr)}{\kappa(\boldsymbol{\theta})}$

where $\mathbf{g}(\mathbf{y}, \mathbf{X})$ is a vector of sufficient statistics computed on the graph (edge counts, homophily counts, structural terms), $\boldsymbol{\theta}$ is the parameter vector we estimate, and $\kappa(\boldsymbol{\theta}) = \sum_{\mathbf{y}' \in \mathcal{Y}} \exp\!\bigl(\boldsymbol{\theta}^\top \mathbf{g}(\mathbf{y}', \mathbf{X})\bigr)$ is the normalizing constant summed over all $2^{\binom{n}{2}}$ possible graphs. This constant is intractable for any nontrivial $n$ , which is why estimation requires simulation.

Model Specification

The specification is held constant across all three universities:

$\begin{aligned} G_u \sim\; & \texttt{edges} + \texttt{nodematch}(\text{year}) + \texttt{nodematch}(\text{residence}) \\ & + \texttt{nodematch}(\text{major}) + \texttt{nodematch}(\text{high\_school}) \end{aligned}$

In terms of the sufficient statistics vector $\mathbf{g}$ , these are:

$g_1(\mathbf{y}) = \sum_{i < j} y_{ij} \qquad \text{(total edge count)}$

$\begin{aligned} g_k(\mathbf{y}, \mathbf{X}) &= \sum_{i < j} y_{ij} \cdot \mathbf{1}[x_i^{(k)} = x_j^{(k)}] \\ & \quad k \in \{\text{year, residence, major, high\_school}\} \end{aligned}$

Each $\theta_k$ has a conditional log-odds interpretation: holding all else equal, if two nodes share attribute $k$ , the log-odds of a tie between them shifts by $\theta_k$ . The edges parameter $\theta_1$ captures baseline tie propensity (analogous to an intercept). A negative $\theta_1$ means ties are "costly". Most dyads are non-edges, as expected in any sparse social network.

Estimation

The normalizing constant $\kappa(\boldsymbol{\theta})$ makes direct MLE infeasible. Estimation proceeds via Monte Carlo Maximum Likelihood Estimation (MCMLE), which approximates the ratio $\kappa(\boldsymbol{\theta}) / \kappa(\boldsymbol{\theta}_0)$ by simulating graphs from $P_{\theta_0}$ and iteratively updating $\boldsymbol{\theta}$ . The MCMC sampler proposes tie toggles (flipping $y_{ij} \in \{0,1\}$ ) and accepts via Metropolis-Hastings. Convergence is assessed by monitoring the difference between observed and simulated sufficient statistics.

All models were fit using the `ergm` package in R with `control.ergm(parallel = num_cores, parallel.type = "PSOCK", MCMLE.maxit = 100)`. A deeper Caltech model augmented with a `gwesp` (geometrically-weighted edgewise shared partners) term to capture triadic closure ran for over two days!

The `gwesp` statistic takes the form:

$g_{\text{gwesp}}(\mathbf{y}) = e^\alpha \sum_{k=1}^{n-2} \Bigl\{1 - (1 - e^{-\alpha})^k\Bigr\} \cdot \text{EP}_k(\mathbf{y})$

where $\text{EP}_k(\mathbf{y})$ counts the number of edges with exactly $k$ shared partners, and $\alpha$ is a decay parameter controlling how quickly additional shared partners contribute diminishing returns to tie probability. This term captures the "friend of a friend" effect (triadic closure) as an endogenous structural force distinct from attribute-based homophily.

III. Exploratory Network Analysis

Before fitting ERGMs, I computed vital signs on the largest connected component of each $G_u$ . The adjacency matrix was symmetrized as $(\mathbf{A} + \mathbf{A}^\top)/2$ and loaded into `igraph` with vertex attributes attached.

Measure	Caltech	Dartmouth	Cornell
Average degree	43.7	79.2	84.9
Global clustering	0.291	0.151	0.136
Assortativity (residence)	0.070	0.118	0.161
Assortativity (major)	0.002	0.044	0.049
Modularity (Louvain)	0.399	0.431	0.471
Average path length $\bar{\ell}$	2.338	2.768	2.876
Diameter	6	8	8
Mean betweenness	0.002	0.000	0.000
Degree variance	1,367.6	5,591.2	7,395.4

Average degree is $\bar{d} = 2\lvert E \rvert / \lvert V \rvert$ . Global clustering is $C = 3 \times \text{triangles} / \text{connected triples}$ .

Caltech is dense and tightly clustered ( $C = 0.291$ ). Cornell has the highest average degree (84.9) despite the lowest density (0.005), might be a consequence of scale, and the highest modularity (0.471) and degree variance, the signature of a "city of neighborhoods" with both hyper-connected hubs and isolated individuals. Dartmouth's clustering (0.151) is lower than expected for a school whose identity revolves around residential and Greek life, which hints that its ties may be more bridging across groups than bonding within them.

Chi-squared tests were used to assess whether Louvain-detected communities align with known attributes (residence, major), following the methodology of Traud et al. (2012). This tests the null $H_0$ : community membership and attribute are independent, via:

$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$

where $O_{ij}$ and $E_{ij}$ are observed and expected counts in the contingency table of community $\times$ attribute.

IV. The Social Genomes

The social genomes revolve around three hypotheses and three coefficient vectors. Full ERGM summaries with standard errors, $z$ -values, and $p$ -values are reported below.

Caltech: "The Focused Silo"

Term	$\hat{\theta}$	SE	$z$	$p$
edges	$-3.593$	0.0125	$-286.47$	$< 0.0001$
nodematch(year)	$+1.306$	0.0175	$74.81$	$< 0.0001$
nodematch(residence)	$+1.804$	0.0173	$104.03$	$< 0.0001$
nodematch(major)	$+0.375$	0.0280	$13.39$	$< 0.0001$
nodematch(high\_school)	$-2.073$	0.0771	$-26.88$	$< 0.0001$

This confirmed the hypothesis. Shared residence is the strongest driver ( $\hat{\theta} = +1.80$ , corresponding to an odds ratio of $e^{1.80} \approx 6.05$ : students in the same House are roughly six times more likely to be friends, all else equal). Shared class year follows ( $e^{1.31} \approx 3.71$ ). Shared major is positive but modest ( $e^{0.37} \approx 1.45$ ). The surprising result is the strong negative coefficient on shared high school ( $-2.07$ ): once campus affiliations are accounted for, pre-college ties are effectively overwritten. The House system there must absorb the social world then.

Dartmouth: "The Social Bubble"

Term	$\hat{\theta}$	SE	$z$	$p$
edges	$-4.498$	0.0022	$-2076.28$	$< 0.0001$
nodematch(year)	$+0.102$	0.0056	$18.26$	$< 0.0001$
nodematch(residence)	$-0.191$	0.0054	$-35.67$	$< 0.0001$
nodematch(major)	$-0.150$	0.0070	$-21.28$	$< 0.0001$
nodematch(high\_school)	$-0.582$	0.0078	$-74.44$	$< 0.0001$

Hypothesis refuted! The edges coefficient ( $-4.50$ ) indicates ties are exceptionally costly. Every institutional attribute like residence ( $-0.19$ ), major ( $-0.15$ ), high school ( $-0.58$ ) has a negative effect on tie formation. The sole positive coefficient is nodematch(year), and its effect is negligible ( $+0.10$ , odds ratio $e^{0.10} \approx 1.11$ ). This genome implies that the primary drivers of friendship at Dartmouth are not the formal categories in the dataset but unobserved social foci (specific clubs, teams, Greek houses at a finer grain than the anonymized residence codes can tell us) powerful enough to render institutional assignments statistically irrelevant.

Cornell: "The Metropolis"

Term	$\hat{\theta}$	SE	$z$	$p$
edges	$-5.342$	0.0032	$-1655.51$	$< 0.0001$
nodematch(year)	$+0.089$	0.0079	$11.22$	$< 0.0001$
nodematch(residence)	$-0.229$	0.0079	$-28.97$	$< 0.0001$
nodematch(major)	$-0.041$	0.0138	$-2.93$	$0.0034$
nodematch(high\_school)	$-0.423$	0.0162	$-26.22$	$< 0.0001$

Hypothesis confirmed, again. The most negative edges term of all three ( $-5.34$ , baseline odds of a tie: $e^{-5.34} \approx 0.0048$ ). Every institutional attribute except class year is negative. The social landscape is so vast and diffuse that formal institutional structures have almost no organizing power; friendships fragment into voluntary, niche communities operating independently of the university's administrative categories.

V. Validation

Comparison with Traud et al. (2012)

ERGM Term	This Study	Traud et al.
edges	$-3.59$	$-4.98$
nodematch(residence)	$+1.80$	$+1.16$
nodematch(year)	$+1.31$	$+0.99$
nodematch(major)	$+0.37$	$+0.65$
nodematch(high\_school)	$-2.07$	$+2.85$

Both models agree on the dominance of residence and year. The sign flip on high school ( $+2.85$ vs. $-2.07$ ) is attributable to model specification: their model included a `gwesp` term for triadic closure, which explicitly accounts for endogenous clustering. By absorbing the "friend of a friend" effect into a dedicated structural term, their model leaves a residual positive high-school signal. My more parsimonious model, lacking a triangle term, allows the strong residence and year coefficients to explain away and invert the high-school effect. It's an evidence that on-campus affiliations at Caltech are powerful enough to overwrite pre-college ties entirely. The less negative edges term in my model ( $-3.59$ vs. $-4.98$ ) is consistent with this as without `gwesp`, some ambient clustering is absorbed into the intercept.

Dartmouth Sensitivity Analysis

To check whether Dartmouth's results were an artifact of non-undergraduate nodes (grad students, faculty, alumni), I constructed an undergraduate-only subgraph by filtering to class years 2005--2009 and extracting the largest connected component.

Measure	Full Network	Undergrad LCC
$\lvert V \rvert$	7,677	4,852
$\lvert E \rvert$	304,065	213,593
Density	0.010	0.018
$\bar{d}$	79.2	88.0
Clustering $C$	0.151	0.167
Assortativity (residence)	0.118	0.126
Modularity	0.431	0.445
$\bar{\ell}$	2.768	2.533
Diameter	8	6

Removing ~3,000 non-undergraduate nodes sharpens every metric. Density nearly doubles, average degree rises from 79 to 88, clustering increases, and the network becomes topologically tighter ( $\bar{\ell}$ drops, diameter shrinks from 8 to 6). The "Social Bubble" signatures are more pronounced in the core undergraduate population; the full-network ERGM was absorbing noise from peripheral nodes.

Subgroup Analysis: Class Year Dynamics

A Mann-Whitney $U$ test compared degree centrality distributions between the Class of 2008 (freshmen, $n = 1{,}079$ ) and the Class of 2005 (seniors, $n = 1{,}009$ ).

Class Year	$n$	Median Degree	Mean Degree	SD
Freshmen (2008)	1,079	59.0	77.7	76.0
Seniors (2005)	1,009	66.0	79.6	71.0

Result: $U = 525{,}388.5$ , $p = 0.168$ (not significant). Social embeddedness at Dartmouth does not stratify by class year. This is consistent with a close-knit environment where cross-cohort interaction is the norm and networks mature quickly, and even first-years reach connectivity levels comparable to seniors.

The Social Genome Project