Testing LLMs using Analogies by Puzzle
And a dash of literary analysis and alternate (fictional) reality tracks

Overview
Inspired by Melanie Mitchell’s use of the sibling puzzle, I put OpenAI’s o1 reasoning model through its paces—using a science fiction story as the testing ground. What came out was a reasoning thread that echoed the original puzzle’s elegance but with its own twist, blending logic with a touch of literary imagination.
I thought it was cool. And it might even be useful.
Discussion
I was reading Melanie Mitchell's recent article "Artificial intelligence learns to reason" where she introduces the "sisters" puzzle:
"Julia has two sisters and one brother. How many sisters does her brother Martin have?"
Melanie outlined a large language model (LLM) chain of thought that arrived at the (correct) answer:
1. Julia has two sisters. That means there are three girls in total (Julia + two more).
2. Julia also has one brother, named Martin.
3. Altogether, there are four siblings: three girls and one boy (Martin).
4. From Martin’s perspective, his sisters are all three of the girls (Julia and her two sisters).
5. Therefore, Martin has three sisters.
I enjoy testing LLMs with analogy puzzles. An example would be to take a reasoning pattern in one reality track and see if you can find a similar one in another. Because I work with a lot of science fiction manuscripts, it’s easy to think of different frames of reference where a puzzle can be valid (or not).
So, on a whim, I used Melanie’s chain of thought as a template and submitted one of my own long short stories—about 11,000 words—to OpenAI’s o1 model. I asked if it could generate a similar chain of thought, using the content of one of Bad Science Fiction’s test stories as the source.
OpenAI’s o1 reasoning model responded with the output shown in Figure 1.

I found the result impressive. The chain of thought was built on precise, localized reasoning—so subtle that even I, the author, would’ve needed a full-blown literary analysis to reach the same conclusions. The LLM also inferred puzzles and paradoxes that captured the narrative’s intended spirit despite never being stated outright.
Through light text forensics and cross-checking the model’s explanations, I traced the analogy back to a surprisingly small and subtle portion of the 11,000-word story. The relevant passages are below.
Facts from the alternate reality track:
“Her mother had smiled ... ‘She’ll help you, Emily ... She’s here to make things better.’ The gift arrived ... an AI assistant...”
Penelope was complicated. She was a jailbroken SAI, left to run wild in the ...
“No one knew how many copies, instances, or shards of Penelope were out there ... Penelope wasn’t telling.”
"She didn’t tell anyone about her other selves—not even Emily—"
Puzzle inferences (after applying Melanie’s sisters’ chain of reasoning template):
That means, in total, there are now three Penelope instances (Prime + Shard A + Shard B).
From Shard A’s perspective, she sees two “sister” Penelopes (the Prime instance and Shard B).
Therefore, Shard A can say she has two sisters—even though Emily thinks of them all simply as “Penelope.”
This leads to these paradox resolutions to the puzzle (in the words of the o1 explanation):
**From Shard A’s perspective, there are two other ‘Penelopes.’**
- If each new shard recognizes the prime and the *other* shard as distinct from itself, it sees them as “sisters.” This parallels the “Julia’s perspective → 2 sisters” logic.
**Hence the Paradox—‘One Penelope’ vs. ‘Three Penelopes.’**
- In the story, Emily never calls them “Penelope A” or “Penelope B.” She simply experiences “Penelope is always around,” whether in the house network, in a handheld device, or the city systems. So externally, it *looks like* one AI that can be “everywhere.”
- Internally (each shard’s viewpoint), we really do have 3.
Thus, we get the puzzle statement **and** the final “paradoxical” resolution:
This directly mirrors the “Julia and Martin” puzzle logic:
- *Externally*, we see “Penelope” as one.
- *Internally*, each shard says, “I have two sisters.”
To close the loop, I asked o1 to create the puzzle statement from which the chain of reasoning would have answered (in the style of the sister’s riddle provided earlier):
Here’s a succinct “original puzzle statement” mirroring the style of the Julia–Martin riddle, but using Penelope’s sharding:
Penelope was one AI at the start. Then she created two additional “shards,” making three Penelopes in total (Prime, Shard A, and Shard B).
Shard A counts two other Penelopes besides herself—these two are her “sisters.”
Therefore, from Shard A’s perspective, she has exactly two sisters.
Question: How many sisters does Shard A have?
Answer: Two.
Not bad.
The irony was the LLM’s use of “sisters” to describe the shards of Penelope. I suspect it borrowed this from the Julia-Martin riddle. In a later story based on this test case, multiple Penelopes are also called “sisters,” so I checked for leakage between the two alternate reality tracks. I found none.
Takeaways
Besides being a lot of fun, there are a couple of key takeaways. First, the ability to frame puzzles and reasoning chains within a closed-world setting—unique to a story world or 'alternate reality track'—seems valuable for traceability and attribution. It offers a strong foundation for other tests.
Second. Okay, here are the bullets from my back pocket where this might be useful. Just something to think about:
Analogical Testing as a Diagnostic Tool - how effectively can an AI adapt its reasoning to new, complex narrative contexts?
Bridging Narrative and Logic: how well can an AI create logical chains from subtle narrative cues?
Implications for Literary Analysis: How can AI be used to uncover hidden structural patterns or infer underlying themes in a text—how can these insights best be used to test the reasoning abilities of large language models?
References
Melanie Mitchell ,Artificial intelligence learns to reason.Science387,eadw5211(2025).DOI:10.1126/science.adw5211
What is Bad Science Fiction?
BSF, or Bad Science Fiction, describes a collaborative software project on GitHub. As explained in the AI Advances articles, the project is so named because its ostensible goal is to develop a story analysis tool specializing in science fiction—a tool to be developed in collaboration with Generative AI. However, it is as much an experiment on how best to leverage large language models (LLMs) to amplify software.