Story Operators: Decomposing the Original $\to$ Sequel Transformation in Embedding Space
Quick Answer
This study analyzes the geometric transformation from original novels to their sequels using all-mpnet-base-v2 embeddings, revealing a taxonomy of sequels based on PCA decomposition.
Quick Take
This study analyzes the geometric transformation from original novels to their sequels using all-mpnet-base-v2 embeddings, revealing a taxonomy of sequels based on PCA decomposition. The findings include types such as formulaic, concentrated, and compositional, with specific examples from Project Gutenberg, including the structural shift in Twain's 'Tom Sawyer' to 'Huckleberry Finn'.
Key Points
- Utilizes all-mpnet-base-v2 embeddings from the PG19 corpus for analysis.
- Identifies three sequel types: formulaic, concentrated, and compositional.
- Highlights the structural shift in Twain's works as a dominant transformation axis.
- Findings are reproducible with released scripts and data.
- Cites Twain's letters to support the analysis of authorial intent.
Paper Resources
📖 Reader Mode
~2 min readAbstract:I treat a book as a point in a sentence-embedding space and a literary transformation as an operation on points. Given an original novel and its sequel, I ask what it takes, geometrically, to turn the first into the second. Using all-mpnet-base-v2 paragraph embeddings drawn from a precomputed index of the PG19 corpus, I form the displacement $d=\bar{x}_{\rm seq}-\bar{x}_{\rm orig}$ and greedily decompose it along a content basis obtained by PCA over the two books' own paragraphs. Each component is an interpretable axis anchored by real passages at its poles. Across thirteen verified author pairs from Project Gutenberg, the decomposition reveals a small taxonomy of sequels: formulaic (a tiny, low-rank change: Doyle's Holmes collections, $\|d\|=0.12$), concentrated (one dominant axis: Alcott's Little Women $\to$ Little Men, 75% on a single move), and compositional (many small axes: Twain, Burroughs's Barsoom, Nesbit). For the canonical case, Tom Sawyer $\to$ Huckleberry Finn, the dominant recovered axis is structural -- the collapse of sheltering domesticity into a picaresque road -- rather than the famous surface themes of vernacular voice or slavery, which ride later, smaller axes; and the transformation routes through adventure-journey space rather than diluting toward generic realism. I corroborate the recovered geometry against Twain's documented authorial intent (his 1875--76 letters to Howells), which names the first-person picaresque move years in advance, and I quantify, with an explicit representation caveat, how much of the realized transformation his stated intentions span. All computations are reproducible from the released scripts and data.
| Comments: | 8 pages, 3 figures |
| Subjects: | Computation and Language (cs.CL) |
| ACM classes: | I.2.7 |
| Cite as: | arXiv:2606.25379 [cs.CL] |
| (or arXiv:2606.25379v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.25379 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Frederick Zimmerman [view email]
[v1]
Wed, 24 Jun 2026 04:21:36 UTC (793 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.