Testing LLMs using Analogies by Puzzle

And a dash of literary analysis and alternate (fictional) reality tracks

Mar 21, 2025

A reflective image titled 'Mirror Puzzle' featuring three identical humanoid or abstract figures arranged in perfect symmetry, like a mirrored triptych. Each figure appears nearly the same in shape and position, but differs subtly in color, texture, or pattern—suggesting they are reflections with hidden distinctions. The background is minimal and ethereal, with a soft gradient or fog-like ambiance, enhancing the paradoxical 'one versus three' theme. The composition should evoke mystery and introspection, representing Penelope shards as facets of a singular yet fractured identity. — The image was generated by ChatGPT at the author's request: a reflective image of three identical figures or shapes arranged symmetrically, where each shape subtly differs through color or texture. It visually suggests multiplicity and paradox, illustrating the "one versus three" puzzle of Penelope shards.

Overview

Inspired by Melanie Mitchell’s use of the sibling puzzle, I put OpenAI’s o1 reasoning model through its paces—using a science fiction story as the testing ground. What came out was a reasoning thread that echoed the original puzzle’s elegance but with its own twist, blending logic with a touch of literary imagination.

I thought it was cool. And it might even be useful.

Discussion

I was reading Melanie Mitchell's recent article "Artificial intelligence learns to reason" where she introduces the "sisters" puzzle:

"Julia has two sisters and one brother. How many sisters does her brother Martin have?"

Melanie outlined a large language model (LLM) chain of thought that arrived at the (correct) answer:

1. Julia has two sisters. That means there are three girls in total (Julia + two more).
2. Julia also has one brother, named Martin.
3. Altogether, there are four siblings: three girls and one boy (Martin).
4. From Martin’s perspective, his sisters are all three of the girls (Julia and her two sisters).
5. Therefore, Martin has three sisters.

I enjoy testing LLMs with analogy puzzles. An example would be to take a reasoning pattern in one reality track and see if you can find a similar one in another. Because I work with a lot of science fiction manuscripts, it’s easy to think of different frames of reference where a puzzle can be valid (or not).

So, on a whim, I used Melanie’s chain of thought as a template and submitted one of my own long short stories—about 11,000 words—to OpenAI’s o1 model. I asked if it could generate a similar chain of thought, using the content of one of Bad Science Fiction’s test stories as the source.

OpenAI’s o1 reasoning model responded with the output shown in Figure 1.

Figure 1. OpenAI’s o1 response to my request if it could generate an analogous chain of thought based on the content of my short story in the form of Melanie's template.

I found the result impressive. The chain of thought was built on precise, localized reasoning—so subtle that even I, the author, would’ve needed a full-blown literary analysis to reach the same conclusions. The LLM also inferred puzzles and paradoxes that captured the narrative’s intended spirit despite never being stated outright.

Through light text forensics and cross-checking the model’s explanations, I traced the analogy back to a surprisingly small and subtle portion of the 11,000-word story. The relevant passages are below.

Facts from the alternate reality track:

“Her mother had smiled ... ‘She’ll help you, Emily ... She’s here to make things better.’ The gift arrived ... an AI assistant...”

Penelope was complicated. She was a jailbroken SAI, left to run wild in the ... 

“No one knew how many copies, instances, or shards of Penelope were out there ... Penelope wasn’t telling.”

"She didn’t tell anyone about her other selves—not even Emily—"

Puzzle inferences (after applying Melanie’s sisters’ chain of reasoning template):

That means, in total, there are now three Penelope instances (Prime + Shard A + Shard B).

From Shard A’s perspective, she sees two “sister” Penelopes (the Prime instance and Shard B).

Therefore, Shard A can say she has two sisters—even though Emily thinks of them all simply as “Penelope.”

This leads to these paradox resolutions to the puzzle (in the words of the o1 explanation):

**From Shard A’s perspective, there are two other ‘Penelopes.’**  
   - If each new shard recognizes the prime and the *other* shard as distinct from itself, it sees them as “sisters.” This parallels the “Julia’s perspective → 2 sisters” logic.

**Hence the Paradox—‘One Penelope’ vs. ‘Three Penelopes.’**  
   - In the story, Emily never calls them “Penelope A” or “Penelope B.” She simply experiences “Penelope is always around,” whether in the house network, in a handheld device, or the city systems. So externally, it *looks like* one AI that can be “everywhere.”  
   - Internally (each shard’s viewpoint), we really do have 3.

Thus, we get the puzzle statement **and** the final “paradoxical” resolution:

This directly mirrors the “Julia and Martin” puzzle logic:  
- *Externally*, we see “Penelope” as one.  
- *Internally*, each shard says, “I have two sisters.”

To close the loop, I asked o1 to create the puzzle statement from which the chain of reasoning would have answered (in the style of the sister’s riddle provided earlier):

Here’s a succinct “original puzzle statement” mirroring the style of the Julia–Martin riddle, but using Penelope’s sharding:

Penelope was one AI at the start. Then she created two additional “shards,” making three Penelopes in total (Prime, Shard A, and Shard B).

Shard A counts two other Penelopes besides herself—these two are her “sisters.”

Therefore, from Shard A’s perspective, she has exactly two sisters.

Question: How many sisters does Shard A have?

Answer: Two.

Not bad.

The irony was the LLM’s use of “sisters” to describe the shards of Penelope. I suspect it borrowed this from the Julia-Martin riddle. In a later story based on this test case, multiple Penelopes are also called “sisters,” so I checked for leakage between the two alternate reality tracks. I found none.

Takeaways

Besides being a lot of fun, there are a couple of key takeaways. First, the ability to frame puzzles and reasoning chains within a closed-world setting—unique to a story world or 'alternate reality track'—seems valuable for traceability and attribution. It offers a strong foundation for other tests.

Second. Okay, here are the bullets from my back pocket where this might be useful. Just something to think about:

Analogical Testing as a Diagnostic Tool - how effectively can an AI adapt its reasoning to new, complex narrative contexts?
Bridging Narrative and Logic: how well can an AI create logical chains from subtle narrative cues?
Implications for Literary Analysis: How can AI be used to uncover hidden structural patterns or infer underlying themes in a text—how can these insights best be used to test the reasoning abilities of large language models?

References

Melanie Mitchell ,Artificial intelligence learns to reason.Science387,eadw5211(2025).DOI:10.1126/science.adw5211

What is Bad Science Fiction?

BSF, or Bad Science Fiction, describes a collaborative software project on GitHub. As explained in the AI Advances articles, the project is so named because its ostensible goal is to develop a story analysis tool specializing in science fiction—a tool to be developed in collaboration with Generative AI. However, it is as much an experiment on how best to leverage large language models (LLMs) to amplify software.

Nate’s AI-ish Substack

Discussion about this post