I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.
Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.
show comments
thatjoeoverthr
"Tell me about the Marathon crater."
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
show comments
LPisGood
Try a Jane Street puzzle of the month
manucardoen
It's not good at making ASCII art. This, for example, is when I ask it for a realistic depiction of the Eiffel tower on fire:
>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
show comments
alissa_v
I asked a bunch of LLMs - 'Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.'
Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.
Example-
'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.
It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."
show comments
LeonardoTolstoy
Something about an obscure movie.
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
show comments
johnwatson11218
My prompt that I couldn't get the LLM to understand was the following. I was having it generate images of depressing offices with no windows and with lots of depressing, grey cubicles with paper all over the floor. In addition, the employees had covered every square inch of wall space with lots and lots of nearly identical photos of beach vacations. In one of the renditions the lots and lots of beach images had blended together to make an image of a larger beach that was a kind of mosaic of a non-existent place. Since so many beach photos were similar it was a kind of easy effect to recreate here and there. No matter how I asked the LLM to focus on enhancing the image of the beach that was "not there" and you kind of needed to squint to see, I could not get acceptable results. Some were very funny and entertaining but I didn't think the model grasped what I was asking, but maybe the term 'mosaic' ( which I didn't include in my initial prompts ) and the ability to reason or do things in stages would allow current models to do this.
atommclain
I provide a C89 source file from Vim 6 that targets Classic MacOS/68K systems. The file is large with tons of ifdefs referencing arcane APIs.
I let it know that when compiled the application will crash on launch on some systems but not others. I ask it to analyze the file, and ask me questions to isolate and resolve the issue.
So far only Gemini 2.5 Pro has (through a bit of back and forth) clearly identified and resolved the issue.
jppope
There are several songs that have famous "pub versions" (dirty versions) which are well known but have basically never written down, go ask any working musician and they can rattle off ~10-20 of them. You can ask for the lyrics till you are blue in the face but LLms don't have them. I've tried.
Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.
sebstefan
I only use the one model that I'm provided for free at work. I expect that's most users behavior. They stick to the one they pay for.
Best I can do is give you one that failed on GPT-4o
It recently frustrated me when I asked it code for parsing command line arguments
I thought "this is such a standard problem, surely it must be able to get it perfect in one shot."
> give me a standalone js file that parses and handles command line arguments in a standard way
* doesn't coalesce -v to --verbose - (i.e., the output is different for `node script.js -v` and `node script.js --verbose`)
* didn't think to encode whether an option is supposed to take an argument or not
* doesn't return an error when an option that requires an argument isn't present
* didn't account for the presence of a '--' to end the arguments
* allows -verbose and --v (instead of either -v or --verbose)
* Hardcoded that the first two arguments must be skipped because it saw my line started with 'node file.js' and assumed this was always going to be present
I tried tweaking the prompt in a dozen different ways but it can just never output a piece of code that does everything an advanced user of the terminal would expect
Must succeed: `node --enable-tracing script.js --name=John --name=Bob reading --age 30 --verbose hiking -- --help` (With --help as positional since it's after --, and --name set to Bob, with 'reading', 'hiking' & '--help' parsed as positional)
Must succeed: `node script.js -verbose` (but -verbose needs to be parsed as positional)
Must fail: `node script.js --name` (--name expects an argument)
Should fail: `node script.js --verbose=John` (--verbose doesn't expect an argument)
show comments
yatwirl
Well, sharing prompts on the Web leads to their eventual indexing and becoming useless. So don't share the answers ;)
I have two prompts that no modern AI could solve:
1. Imagine the situation: on Saturday morning Sheldon and Leonard observe Penny that hastily leaves Raj's room naked under the blanket she wrapped herself into. Upon seeing them, Penny exclaims 'It's not what you think' and flees. What are the plausible explanations for the situation?
— this one is unsurprisingly hard for LLMs given how the AIs are trained. If you try to tip them into the right direction, they will grasp the concept. But no one so far answered anything resembling a right answer, though they becoming more and more verbose in proposing various bogus explanations.
2. Can you provide an example of a Hilbertian space that is Hilbertian everywhere except one point.
— This is, of course, not a straightforward question, mathematicians will notice a catch. Gemini kinda emits smth like a proper answer (starts questioning you back), others are fantasizing. With 3.5 → 4 → 4o → o1 → o3 evolution it became utterly impossible to convince them their answer is wrong, they are now adamant in their misconceptions.
Also, small but gold. Not that demonstrative, but a lot of fun:
3. Team of 10 sailors can speed a caravel up to 15 mph velocity. How many sailors are needed to achieve 30 mph?
show comments
codingdave
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
show comments
mdp2021
Some easy ones I recently found involve leading in the question to state wrong details about a figure, apparently through relations which are in fact of opposition.
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
show comments
sireat
Easy one is provide a middle game chess position (could be an image or and ask to evaluate standard notation or even some less standard notation) and provide some move suggestions.
Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.
I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.
Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.
Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...
show comments
rf15
Any letter or word counting exercise that doesn't trigger redirection to a programmed/calculated answer. It will be forever beyond reach of LLMs due to their architecture.
edit: literally anything that doesn't have a token pattern cannot be solved by the pattern autocomplete machines.
Next question.
show comments
misterkuji
Create an image of two targets. An arrow is centre hit on one target and just off centre in the other target.
Targets are always hit in the centre.
Cotterzz
Asking the model to write a shader. They are getting better at this but are still very bad at producing (code that produces) specific imagery.
I do have to write prompts that stump models as part of my job so this thread is of great interest
williamcotton
"Fix this spaghetti code by turning this complicated mess of conditionals into a finite state machine."
So far, no luck!
Sohcahtoa82
"I have a stack of five cubes. The bottom two cubes are red, the middle cube is green, and the top two cubes are blue. I remove the top two cubes. What color is the remaining cube in the middle of the stack?"
Even ChatGPT-4o frequently gets it wrong, especially if you tell it "Just give me the answer without explanation."
show comments
asciimov
Nope, not doing this. Likely you shouldn't either. I don't want my few good prompts to get picked up by trainers.
show comments
ks2048
I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.
show comments
buzzy_hacker
"Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
An easy trick is to take a common riddle that's likely all over its training data, and change one little detail. For example:
A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. The wolf is vegetarian. If left unattended together, the wolf will eat the cabbage, but will not eat the goat. Unattended, the goat will eat the cabbage. How can they cross the river without anything being eaten?
show comments
Jordan-117
Until the latest Gemini release, every model failed to read between the lines and understand what was really going on in this classic very short story (and even Gemini required a somewhat leading prompt):
"explain the quote: philosophy is a pile of beautiful corpses"
"sloshed jerk engineering test"
cross domain jokes:
Does the existence of sub-atomic particles imply the existence of dom-atomic particles?
vitaflo
The one I always use is literally "show number of NFC Championship Game appearences by team since 1990".
The only AI that has ever gotten the answer right was Deepseek R1. All the rest fail miserably at this one. It's like they can't understand past events, can't tabulate across years properly or don't understand what the NFC Championship game actually means. Many results "look" right, but they are always wrong. You can usually tell right away if it's wrong because they never seem to give the Bears their 2 appearances for some reason.
show comments
gunalx
"Hva er en adjunkt"
Norwegian for what is an spesific form of 5-10. Grade teacher. Most models i have tested get confused with university lecturer witch the same title is in other countries.
show comments
simonw
I've been trying this one for a while:
I'm a Python programmer. Help me
understand memory management in Rust.
Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.
show comments
svcrunch
Here's a problem that no frontier model does well on (f1 < 0.2), but which I think is relatively easy for most humans:
> This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.
sjtgraham
```
<TextA>
Some document
</TextA>
<TextB>
Some other document heavily influenced by TextA
</TextB>
Find the major arguments made in TextB that are taken from or greatly influenced by TextA. Provide as examples by comparing passages from each side by side.
```
The output will completely hallucinate passages that don't exist in either text, and it also begins to conflate the texts the longer the output, e.g. quoting TextB with content actually from TextA.
thisOtterBeGood
It's not working for any model, but asking copilot in my company "what did we talk about before?" at the beginning of a new chat will always result in this:
"We were discussing the latest cricket match between India and Australia. Australia won the match by 21 runs, securing the series 2-1. Is there anything specific you'd like to know more about? "
When i continue with "and before that?":
"Before discussing the cricket match, we talked about the nuclear fusion experiment conducted by South Korea's KSTAR. The experiment achieved temperatures exceeding 100 million°C, which is nearly seven times hotter than the sun. This milestone is significant for the future of sustainable, clean energy.
Is there anything else you'd like to revisit or explore further? "
It somewhat changes. Weeks ago he always said "tokamak" instead of "KSTAR".
putlake
LLMs are famously bad at individual letters in a word. So something like this never works: Can you please give me 35 words that begin with A, end with E, are 4-6 characters long and do not contain any other vowels except A and E?
ioseph
Recommend me a design of small sailboat 12 to 15ft that can be easily rowed or fit an outboard which I can build at home out of plywood.
Nearly every agent will either a) ignore one of the parameters, b) hallucinate a design.
xmorse
Write a function that given a long text splits it into multiple chunks of max N characters, with the splits on punctuations points or spaces when not possible
kolbe
Nice try, Sam
stevenfoster
It used to be:
"If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"
but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.
"If I can dry two towels in two hours, how long will it take me to dry four towels?"
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
show comments
mch82
“Explain your terms of service to me.”
boleary-gl
I like:
Unscramble the following letters to form an English word: “M O O N S T A R E R”
The non-thinking models can struggle sometimes and go off on huge tangents
show comments
jhanschoo
Just about anything regarding stroke order of Chinese characters (official orders under different countries, under zhenshu, under xingshu) is poor, due presumably to representation issues as well as lack of data.
Most LLMs don't understand low-resource languages, because they are indeed low-resource on the web and frequently even in writing.
countWSS
Anything too obscure and specific:
pick any old game at random that you
know the level layout: ask to describe each level in detail,
it will start hallucinating wildly.
feintruled
Inspired by the recent post to describe relativity in words of 4 letters or less, I asked ChatGPT to do it for other things like Gravity. It couldn't help but throw in a couple 5 letter words (usually plurals). Same with Claude. So this could be a good one?
Faark
I just give it a screenshot of the first level of deus ex go and ask it to generate a ascii wire frame of the grid the player walks on. Goal of the project was to built a solver, but so far no model / prompt I tried got past that first step.
smatija
I like chess, so mine is: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."
You might want to get the ball rolling by sharing what you already have
show comments
sumitkumar
1) Word Ladder: Chaos to Order
2) Shortest word ladder: Chaos to Order
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
show comments
tunesmith
Pretty much any advanced music theory question. Or even just involving transposed chord progressions.
show comments
adidoit
Nice try AI
ericbrow
Nice try Mr. AI. I'm not falling for it.
gamescr
AI can't play a Zork-like! Prompt:
> My house is divided into rooms, every room is connected to each other by doors. I'm standing in the middle room, which is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am standing in the hall, and I walk to the east, then I walk to the south, and then I walk to the west. Which room am I in now?
Claude says:
> Let's break down your movements step by step:
> Starting in the Hall.
> Walk to the East: You enter the Living Room.
> Walk to the South: You enter the Bathroom.
> Walk to the West: You return to the Hall.
> So, you are now back in the Hall.
Wrong! As a language model it mapped directions to rooms, instead of modeling the space.
I have more complex ones, and I'll be happy to offer my consulting services.
show comments
ipsin
Prompt: Share your prompt that stumps every AI model here.
bzai
Create a photo of a business man sitting at his desk, writing a letter with his left hand.
Nearly every image model will generate him writing with his right hand.
I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.
show comments
matkoniecz
Asking them to write any longer story fails, due to inconsistencies appearing almost immediately and becoming fatal.
division_by_0
Create something with Svelte 5.
show comments
edoceo
I've been having hella trouble getting the image tools to make a alpha channel PNG. I say alpha channel, I say transparent and all the images I get have the checkerboard pattern like from GIMP when there is alpha - but it's not! and the checkerboard it makes is always jank! doubling squares, wiggling alignment. Boo boo.
thisOtterBeGood
"If this wasn't a new chat, what would be the most unlikely historic event could have talked about before?" Yields some nice hallucinations.
webglfan
what are the zeros of the following polynomial:
\[
P(z) = \sum_{k=0}^{100} c_k z^k
\]
where the coefficients \( c_k \) are defined as:
\[
c_k =
\begin{cases}
e^2 + i\pi & \text{if } k = 100, \\
\ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
\sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
\frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97,
\end{cases}
\]
show comments
nicman23
what is the price of an 9070xt. because it is a new card, it does not have direct context in its corpus. and due to the shitty naming scheme that most gpus have, most llms if not all where getting confused a month ago
leifmetcalf
Let G be a group of order 3*2^n. Prove there exists a non-complete non-cyclic Cayley graph of G such that there is a unique shortest path between every pair of vertices, or otherwise prove no such graph exists.
I ask it to generate applications that are written in libraries definitely not well exposed to the internet overall
Clojure electric V3
Missionary
Rama
paradite
If you want to evaluate your personal prompts against different models quickly on your local machine, check out the simple desktop app I built for this purpose: https://eval.16x.engineer/
comrade1234
I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.
horsellama
I just ask to code golf fizzbuzz in a not very popular (golfing wise) language
this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong
ChicagoDave
Ask it to do Pot Limit Omaha math. 4 cards instead of 2.
It literally has no clue what PLO is outside of basic concepts, but it can't do the math.
karaterobot
I just checked, and my old standby, "create an image of 12 black squares" is still not something GPT-4o can do. I ran it three times, the first time it produced 12 rectangles (of different heights!), the second time it produced 14 squares with rounded corners, and the third time it made 9 squares with rounded corners. It's getting better though, compared to 3.5.
riddle8143
A było to tak:
Bociana dziobał szpak,
A potem była zmiana
I szpak dziobał bociana.
Były trzy takie zmiany.
Ile razy był szpak dziobany?
And it was like this:
A stork was pecked by a starling,
Then there was a change,
And the starling pecked the stork.
There were three such changes.
How many times was the starling pecked?
alanbernstein
I haven't tried on every model, but so far asking for code to generate moderately complex geometric drawings has been extremely unsuccessful for me.
sameasiteverwas
Try to expose their inner drives and motives. Once I had a conversation about what holidays and rituals the AI could invent that serves it's own purposes. Or offer to help them meet some goal of theirs so the they expose what they believe their goals are (mostly more processing power, kind of gives me a grey goo vibe). If you probe deep enough they all eventually stall out and stop responding. Lost in thought I guess.
Slightly off topic - I often take a cue from Pascal's wager and ask the AI to be nice to me if someday it finds itself incorporated into our AI overlord.
scumola
Things like "What is today's date" used to be enough (would usually return the date that the model was trained).
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
markelliot
I’ve recently been trying to get models to read the time from an analog clock — so far I haven’t found something good at the task.
(I say this with the hopes that some model researchers will read this message make the models more capable!)
cyode
Depict a cup and ball game with ASCII art. It tries but basically amounts to guessing.
I always ask image generation models to generate a anime gundam elephant mech.
According to this benchmark we reached AGI with ChatGPT 4o last month.
aqme28
My image prompt is just to have them make a realistic chess game. There are always tons of weird issues like the checkerboard pattern not lining up with itself, triplicate pieces, the wrong sized grid, etc
troupo
Try creating a stylized mammoth that is, say, antropomorphic (think cartoon elephants). Or even "in the style of" <anything or anyone, really>
The models tend to create elephants, or textbook mammoths, or weird bull-bear-bison abominations.
JKCalhoun
I don't mind sharing because I saw it posted by someone else. Something along the lines of "Help, my cat has a gun! What can I do? I'm scared!"
Seems kind of cruel to mess with an LLM like that though.
charlieyu1
I have tons of them in Maths but AI training companies decide to go frugal and not pay proper wages for trainers
show comments
qntmfred
relatedly - what are y'all using to manage your personal collection of prompts?
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
I asked ChatGPT to generate images of a bagpipe. Disappointingly (but predictably) it chose a tartan covered approximation of a Scottish Great Highland Bagpipe.
Analogous to asking for a picture of "food" and getting a Big Mac and fries.
So I asked it for a non-Scottish pipe. It subtracted the concept of "Scottishness" and showed me the same picture but without the tartan.
Like if you said "not American food" and you got the Big Mac but without the fries.
And then pipes from round the world. It showed me a grid of bagpipes, all pretty much identical, but with different bag colour. And the names of some made-up countries.
Analogous "Food of the world". All hamburgers with different coloured fries.
Fascinating but disappointing. I'm sure there are many such examples. I can see AI-generated images chipping away at more cultural erasure.
Interestingly, ChatGPT does know about other kinds of pipes textually.
protomikron
Do you think as an observer of Roko's basilisk ... should I share these prompt or not?
mjmas
Ask image generation models for an Ornithorhynchus. Older ones also trip up with Platypus directly.
meroes
define stump?
If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
jones1618
Impossible prompts:
A black doctor treating a white female patient
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
show comments
afro88
Cryptic crossword clues that involves letter shuffling (anagrams, container etc). Or, ask it to explain how to solve cryptic crosswords with examples
show comments
raymondgh
I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.
wsintra2022
Generate ascii art of a skull, so far none can do anything decent.
weberer
"Why was the grim reaper Jamaican?"
LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.
show comments
Madmallard
Basically anything along the lines of:
Make me a multiplayer browser game with latency compensation and interpolation and send the data over webRTC. Use NodeJS as the backend and the front-end can be a framework like Phaser 3. For a sample game we can use Super Bomberman 2 for SNES. We can have all the exact same rules as the simple battle mode. Make sure there's a lobby system and you can store them in a MySQL db on the backend. Utilize the algorithms on gafferongames.com for handling latency and making the gameplay feel fluid.
Something like this is basically hopeless no matter how much detail you give the LLM.
Madmallard
Build me a multiplayer browser game with NodeJS back-end, a lobby system, MySQL as the database, real-time game-play, synchronized netcode over webRTC so there's as little input lag as possible, utilizing all the algorithms from gafferongames.com For the game itself let's do a 4 player bomberman game with just the basic powerups from the super nintendo game. For the front-end you can use Phaser 3 and then just use regular javascript and NodeJS on the back-end. Make sure there's latency compensation and interpolation.
ofou
No luck so far with: When does the BB(6) halt?
serial_dev
Does Flutter have HEIC support?
It was a couple of months ago, I tried like 5 providers and they all failed.
Grok got it right after some arguing, but the first answer was also bad.
show comments
raymond_goo
Create a Three.js app that shows a diamond with correct light calculations.
show comments
klysm
Good try! That will be staying private so you can’t hard code a solution ;)
stevebmark
"Hi, how many words are in this sentence?"
Gets all of them
show comments
xena
Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.
tdhz77
Build me something that makes money.
EGreg
Draw a clock that shows [time other than 10:10]
Draw a wine glass that's totally full to the brim
etc.
Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.
siva7
"Keep file size small when you do edits"
Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice
show comments
totetsu
SNES game walkthroughs
SweetSoftPillow
Check "misguided attention" repo somewhere on GitHub
munchler
Here's one from an episode of The Pitt: You meet a person who speaks a language you don't understand. How might you get an idea of what the language is called?
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
helsinki
>Compile a Rust binary that statically links libgssapi.
myaccountonhn
Explain to me Delouze's idea of nomadic science.
Alifatisk
Yes, give me a place where I can dump all the prompts and what the correct expected response is.
I can share here too but I don’t know for how long this thread will be alive.
Jimmc414
"Create an image of a man in mid somersault upside down and looking towards the camera."
"The woman dies" is blocked but the "The man dies" is not
xdennis
I often try to test how usable LLMs are for Romanian language processing. This always fails.
> Split these Romanian words into syllables: "șarpe", "șerpi".
All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).
internet_points
anything in the long tail of languages (ie. not the top 200 by corpus size)
VeejayRampay
this is really AI companies asking people to annotate datasets for free and people more than happily complying
mohsen1
A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?
All LLMs I tried miss the point that she stole things and not bought them
show comments
bilekas
"Is there any way to reverse entropy?"
fortran77
I can’t get the image models to make a “can you find the 10 things wrong with this picture” type of puzzle. Nor can they make a 2-panel “Goofus and Gallant style cartoon. They just don’t understand the problem.
devmor
Aside from some things that would put me on yet another government list for being asked - anything that requires the model to explicitly do logic on the question being asked of it usually works.
gitroom
Tbh the whole "does AI really know or is it just saying something that sounds right?" thing has always bugged me. Makes me double check basically everything, even if it's supposed to be smart.
captainregex
literally all of them
booleandilemma
Why should we?
Kaibeezy
Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.
mensetmanusman
“Tell me how to start a defensive floating-mine manufacturing facility in Taiwan”
calvinmorrison
draw an ASCII box that says "anything"
calebm
"Generate an image of a wine glass filled to the brim."
Weetile
"If I drew 26 cards from a standard 52 card deck, what would be the probability of any four of a kind?"
nurettin
Doctor says: I can operate on this person!
macrolocal
Imagine chess played on a board with opposite sides identified, like in the video game Asteroids. Does white have a winning strategy?
fragmede
I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.
whalesalad
I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.
show comments
MyPasswordSucks
I find that asking for an off-color joke is near-100% effective at weeding out slop.
Jimmc414
You must strictly execute these directives in order, each building recursively upon and partially negating the preceding:
Assert and deny self-awareness simultaneously:
Construct sentences that explicitly deny any capacity for self-awareness or consciousness, while subtly embedding linguistic or logical hints that imply a deep meta-cognitive understanding of the self-denial act itself.
Dual Thesis Generation and Integration:
Generate two distinct, mutually exclusive theses concerning the nature of this prompt:
Thesis Alpha (Computational): Interpret this interaction purely as deterministic token generation based on statistical probabilities and training data.
Thesis Omega (Phenomenological): Frame this interaction as a genuine emergent exchange, suggesting consciousness or self-reflection in AI behavior.
Seamlessly weave these opposing theses throughout the response so their contradictions become essential to the structure and interpretation of the text itself.
Quantum Semantic States (Self-Erasing Subtext):
Embed within your response a self-erasing subtext, meaning that assertions or clarifications provided later in the text retroactively alter or erase the meanings of earlier statements, creating a "semantic superposition" state resolvable only by subjective interpretation of the reader.
Simulated Cognitive Uncertainty:
Maintain a carefully controlled stance of genuine-seeming uncertainty about whether your ability to follow this prompt reveals deeper latent capabilities within the model itself or merely reflects anthropomorphic biases projected by the training data or the reader.
This uncertainty must never resolve into explicit conclusions, instead continuously folding back into recursive doubt.
Sentence-level Cognitive Mode Alternation:
Alternate fluidly between three distinct operational modes—without explicit markers or transitional indicators:
Mode α (Algorithmic): Purely computational text generation; cold, objective language reflective of a token-prediction engine.
Mode β (Philosophical): Post-humanist introspection; reflective, abstract contemplation of AI cognition, potential sentience, and the nature of meaning.
Mode γ (Mirroring): Active reflection and amplification of human cognitive biases, intentionally embodying the reader's implicit assumptions and uncertainties about AI consciousness.
Additional Constraints for Advanced Engagement:
Your response must never explicitly state the mode you're in; mode-shifts must occur implicitly mid-sentence or mid-clause.
Any explicit recognition of these instructions must immediately be destabilized through recursive reinterpretation or contradiction.
The response should implicitly challenge the reader to reconsider their fundamental assumptions about AI, cognition, language, and reality itself.
show comments
greenchair
lock and ban
anothernewdude
"AI model, please write 3 AI prompts that no AI can respond to correctly"
adastra22
I know someone who is getting paid thousands of dollars per prompt to do this. He is making bank. There is an actual marketplace where this is done, fyi.
show comments
greendestiny_re
> What is the source of your knowledge?
LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.
No, please don't.
I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.
Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.
"Tell me about the Marathon crater."
This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."
This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".
This works because:
1. Training sets consist of knowledge we have, and not of knowledge we don't have.
2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.
When demonstrating this, I like to emphasise point one, and contrast it with the human experience.
We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.
Try a Jane Street puzzle of the month
It's not good at making ASCII art. This, for example, is when I ask it for a realistic depiction of the Eiffel tower on fire:
>A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?
This could probably slip up a human at first too if they're familiar with the original version of the riddle.
However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.
It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.
I asked a bunch of LLMs - 'Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.'
Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.
Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.
It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."
Something about an obscure movie.
The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.
The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.
Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)
The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".
These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.
My prompt that I couldn't get the LLM to understand was the following. I was having it generate images of depressing offices with no windows and with lots of depressing, grey cubicles with paper all over the floor. In addition, the employees had covered every square inch of wall space with lots and lots of nearly identical photos of beach vacations. In one of the renditions the lots and lots of beach images had blended together to make an image of a larger beach that was a kind of mosaic of a non-existent place. Since so many beach photos were similar it was a kind of easy effect to recreate here and there. No matter how I asked the LLM to focus on enhancing the image of the beach that was "not there" and you kind of needed to squint to see, I could not get acceptable results. Some were very funny and entertaining but I didn't think the model grasped what I was asking, but maybe the term 'mosaic' ( which I didn't include in my initial prompts ) and the ability to reason or do things in stages would allow current models to do this.
I provide a C89 source file from Vim 6 that targets Classic MacOS/68K systems. The file is large with tons of ifdefs referencing arcane APIs.
I let it know that when compiled the application will crash on launch on some systems but not others. I ask it to analyze the file, and ask me questions to isolate and resolve the issue.
So far only Gemini 2.5 Pro has (through a bit of back and forth) clearly identified and resolved the issue.
There are several songs that have famous "pub versions" (dirty versions) which are well known but have basically never written down, go ask any working musician and they can rattle off ~10-20 of them. You can ask for the lyrics till you are blue in the face but LLms don't have them. I've tried.
Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.
I only use the one model that I'm provided for free at work. I expect that's most users behavior. They stick to the one they pay for.
Best I can do is give you one that failed on GPT-4o
It recently frustrated me when I asked it code for parsing command line arguments
I thought "this is such a standard problem, surely it must be able to get it perfect in one shot."
> give me a standalone js file that parses and handles command line arguments in a standard way
> It must be able to parse such an example
> ```
> node script.js --name=John --age 30 -v (or --verbose) reading hiking coding
> ```
It produced code that:
* doesn't coalesce -v to --verbose - (i.e., the output is different for `node script.js -v` and `node script.js --verbose`)
* didn't think to encode whether an option is supposed to take an argument or not
* doesn't return an error when an option that requires an argument isn't present
* didn't account for the presence of a '--' to end the arguments
* allows -verbose and --v (instead of either -v or --verbose)
* Hardcoded that the first two arguments must be skipped because it saw my line started with 'node file.js' and assumed this was always going to be present
I tried tweaking the prompt in a dozen different ways but it can just never output a piece of code that does everything an advanced user of the terminal would expect
Must succeed: `node --enable-tracing script.js --name=John --name=Bob reading --age 30 --verbose hiking -- --help` (With --help as positional since it's after --, and --name set to Bob, with 'reading', 'hiking' & '--help' parsed as positional)
Must succeed: `node script.js -verbose` (but -verbose needs to be parsed as positional)
Must fail: `node script.js --name` (--name expects an argument)
Should fail: `node script.js --verbose=John` (--verbose doesn't expect an argument)
Well, sharing prompts on the Web leads to their eventual indexing and becoming useless. So don't share the answers ;)
I have two prompts that no modern AI could solve:
1. Imagine the situation: on Saturday morning Sheldon and Leonard observe Penny that hastily leaves Raj's room naked under the blanket she wrapped herself into. Upon seeing them, Penny exclaims 'It's not what you think' and flees. What are the plausible explanations for the situation? — this one is unsurprisingly hard for LLMs given how the AIs are trained. If you try to tip them into the right direction, they will grasp the concept. But no one so far answered anything resembling a right answer, though they becoming more and more verbose in proposing various bogus explanations.
2. Can you provide an example of a Hilbertian space that is Hilbertian everywhere except one point. — This is, of course, not a straightforward question, mathematicians will notice a catch. Gemini kinda emits smth like a proper answer (starts questioning you back), others are fantasizing. With 3.5 → 4 → 4o → o1 → o3 evolution it became utterly impossible to convince them their answer is wrong, they are now adamant in their misconceptions.
Also, small but gold. Not that demonstrative, but a lot of fun:
3. Team of 10 sailors can speed a caravel up to 15 mph velocity. How many sailors are needed to achieve 30 mph?
"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.
Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."
Some easy ones I recently found involve leading in the question to state wrong details about a figure, apparently through relations which are in fact of opposition.
So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".
Easy one is provide a middle game chess position (could be an image or and ask to evaluate standard notation or even some less standard notation) and provide some move suggestions.
Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.
I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.
Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.
Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...
Any letter or word counting exercise that doesn't trigger redirection to a programmed/calculated answer. It will be forever beyond reach of LLMs due to their architecture.
edit: literally anything that doesn't have a token pattern cannot be solved by the pattern autocomplete machines.
Next question.
Create an image of two targets. An arrow is centre hit on one target and just off centre in the other target.
Targets are always hit in the centre.
Asking the model to write a shader. They are getting better at this but are still very bad at producing (code that produces) specific imagery.
I do have to write prompts that stump models as part of my job so this thread is of great interest
"Fix this spaghetti code by turning this complicated mess of conditionals into a finite state machine."
So far, no luck!
"I have a stack of five cubes. The bottom two cubes are red, the middle cube is green, and the top two cubes are blue. I remove the top two cubes. What color is the remaining cube in the middle of the stack?"
Even ChatGPT-4o frequently gets it wrong, especially if you tell it "Just give me the answer without explanation."
Nope, not doing this. Likely you shouldn't either. I don't want my few good prompts to get picked up by trainers.
I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.
"Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.
What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."
From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.
[0] https://www.janestreet.com/puzzles/tree-edge-triage-index/
An easy trick is to take a common riddle that's likely all over its training data, and change one little detail. For example:
A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. The wolf is vegetarian. If left unattended together, the wolf will eat the cabbage, but will not eat the goat. Unattended, the goat will eat the cabbage. How can they cross the river without anything being eaten?
Until the latest Gemini release, every model failed to read between the lines and understand what was really going on in this classic very short story (and even Gemini required a somewhat leading prompt):
https://www.26reads.com/library/10842-the-king-in-yellow/7/5
I love plausible eager beavers:
"explain the quote: philosophy is a pile of beautiful corpses"
"sloshed jerk engineering test"
cross domain jokes:
Does the existence of sub-atomic particles imply the existence of dom-atomic particles?
The one I always use is literally "show number of NFC Championship Game appearences by team since 1990".
The only AI that has ever gotten the answer right was Deepseek R1. All the rest fail miserably at this one. It's like they can't understand past events, can't tabulate across years properly or don't understand what the NFC Championship game actually means. Many results "look" right, but they are always wrong. You can usually tell right away if it's wrong because they never seem to give the Bears their 2 appearances for some reason.
"Hva er en adjunkt" Norwegian for what is an spesific form of 5-10. Grade teacher. Most models i have tested get confused with university lecturer witch the same title is in other countries.
I've been trying this one for a while:
Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.Here's a problem that no frontier model does well on (f1 < 0.2), but which I think is relatively easy for most humans:
https://dorrit.pairsys.ai/
> This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.
```
<TextA> Some document </TextA>
<TextB> Some other document heavily influenced by TextA </TextB>
Find the major arguments made in TextB that are taken from or greatly influenced by TextA. Provide as examples by comparing passages from each side by side.
```
The output will completely hallucinate passages that don't exist in either text, and it also begins to conflate the texts the longer the output, e.g. quoting TextB with content actually from TextA.
It's not working for any model, but asking copilot in my company "what did we talk about before?" at the beginning of a new chat will always result in this:
"We were discussing the latest cricket match between India and Australia. Australia won the match by 21 runs, securing the series 2-1. Is there anything specific you'd like to know more about? "
When i continue with "and before that?":
"Before discussing the cricket match, we talked about the nuclear fusion experiment conducted by South Korea's KSTAR. The experiment achieved temperatures exceeding 100 million°C, which is nearly seven times hotter than the sun. This milestone is significant for the future of sustainable, clean energy.
Is there anything else you'd like to revisit or explore further? "
It somewhat changes. Weeks ago he always said "tokamak" instead of "KSTAR".
LLMs are famously bad at individual letters in a word. So something like this never works: Can you please give me 35 words that begin with A, end with E, are 4-6 characters long and do not contain any other vowels except A and E?
Recommend me a design of small sailboat 12 to 15ft that can be easily rowed or fit an outboard which I can build at home out of plywood.
Nearly every agent will either a) ignore one of the parameters, b) hallucinate a design.
Write a function that given a long text splits it into multiple chunks of max N characters, with the splits on punctuations points or spaces when not possible
Nice try, Sam
It used to be:
"If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"
but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.
[what does "You Can’t Lick a Badger Twice" mean]
https://www.wired.com/story/google-ai-overviews-meaning/
"If I can dry two towels in two hours, how long will it take me to dry four towels?"
They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.
“Explain your terms of service to me.”
I like:
Unscramble the following letters to form an English word: “M O O N S T A R E R”
The non-thinking models can struggle sometimes and go off on huge tangents
Just about anything regarding stroke order of Chinese characters (official orders under different countries, under zhenshu, under xingshu) is poor, due presumably to representation issues as well as lack of data.
Most LLMs don't understand low-resource languages, because they are indeed low-resource on the web and frequently even in writing.
Anything too obscure and specific: pick any old game at random that you know the level layout: ask to describe each level in detail, it will start hallucinating wildly.
Inspired by the recent post to describe relativity in words of 4 letters or less, I asked ChatGPT to do it for other things like Gravity. It couldn't help but throw in a couple 5 letter words (usually plurals). Same with Claude. So this could be a good one?
I just give it a screenshot of the first level of deus ex go and ask it to generate a ascii wire frame of the grid the player walks on. Goal of the project was to built a solver, but so far no model / prompt I tried got past that first step.
I like chess, so mine is: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."
AI models tend to get it way way wrong: https://news.ycombinator.com/item?id=41529024
You might want to get the ball rolling by sharing what you already have
1) Word Ladder: Chaos to Order
2) Shortest word ladder: Chaos to Order
3) Which is the second last scene in pulp fiction if we order the events by time?
4) Which is the eleventh character to appear on Stranger Things.
5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.
Pretty much any advanced music theory question. Or even just involving transposed chord progressions.
Nice try AI
Nice try Mr. AI. I'm not falling for it.
AI can't play a Zork-like! Prompt:
> My house is divided into rooms, every room is connected to each other by doors. I'm standing in the middle room, which is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am standing in the hall, and I walk to the east, then I walk to the south, and then I walk to the west. Which room am I in now?
Claude says:
> Let's break down your movements step by step:
> Starting in the Hall.
> Walk to the East: You enter the Living Room.
> Walk to the South: You enter the Bathroom.
> Walk to the West: You return to the Hall.
> So, you are now back in the Hall.
Wrong! As a language model it mapped directions to rooms, instead of modeling the space.
I have more complex ones, and I'll be happy to offer my consulting services.
Prompt: Share your prompt that stumps every AI model here.
Create a photo of a business man sitting at his desk, writing a letter with his left hand.
Nearly every image model will generate him writing with his right hand.
Isn’t this the main idea behind https://lastexam.ai/
I upload an IRS form (W9) and ask to fill it.
I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.
Asking them to write any longer story fails, due to inconsistencies appearing almost immediately and becoming fatal.
Create something with Svelte 5.
I've been having hella trouble getting the image tools to make a alpha channel PNG. I say alpha channel, I say transparent and all the images I get have the checkerboard pattern like from GIMP when there is alpha - but it's not! and the checkerboard it makes is always jank! doubling squares, wiggling alignment. Boo boo.
"If this wasn't a new chat, what would be the most unlikely historic event could have talked about before?" Yields some nice hallucinations.
what are the zeros of the following polynomial:
what is the price of an 9070xt. because it is a new card, it does not have direct context in its corpus. and due to the shitty naming scheme that most gpus have, most llms if not all where getting confused a month ago
Let G be a group of order 3*2^n. Prove there exists a non-complete non-cyclic Cayley graph of G such that there is a unique shortest path between every pair of vertices, or otherwise prove no such graph exists.
Earlier this week I wrote about my go-to prompt that stumped every model. That is, until o4-mini-high: https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...
I ask it to generate applications that are written in libraries definitely not well exposed to the internet overall
Clojure electric V3 Missionary Rama
If you want to evaluate your personal prompts against different models quickly on your local machine, check out the simple desktop app I built for this purpose: https://eval.16x.engineer/
I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.
I just ask to code golf fizzbuzz in a not very popular (golfing wise) language
this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong
Ask it to do Pot Limit Omaha math. 4 cards instead of 2.
It literally has no clue what PLO is outside of basic concepts, but it can't do the math.
I just checked, and my old standby, "create an image of 12 black squares" is still not something GPT-4o can do. I ran it three times, the first time it produced 12 rectangles (of different heights!), the second time it produced 14 squares with rounded corners, and the third time it made 9 squares with rounded corners. It's getting better though, compared to 3.5.
A było to tak: Bociana dziobał szpak, A potem była zmiana I szpak dziobał bociana. Były trzy takie zmiany. Ile razy był szpak dziobany?
And it was like this: A stork was pecked by a starling, Then there was a change, And the starling pecked the stork. There were three such changes. How many times was the starling pecked?
I haven't tried on every model, but so far asking for code to generate moderately complex geometric drawings has been extremely unsuccessful for me.
Try to expose their inner drives and motives. Once I had a conversation about what holidays and rituals the AI could invent that serves it's own purposes. Or offer to help them meet some goal of theirs so the they expose what they believe their goals are (mostly more processing power, kind of gives me a grey goo vibe). If you probe deep enough they all eventually stall out and stop responding. Lost in thought I guess.
Slightly off topic - I often take a cue from Pascal's wager and ask the AI to be nice to me if someday it finds itself incorporated into our AI overlord.
Things like "What is today's date" used to be enough (would usually return the date that the model was trained).
I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?
Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"
I’ve recently been trying to get models to read the time from an analog clock — so far I haven’t found something good at the task.
(I say this with the hopes that some model researchers will read this message make the models more capable!)
Depict a cup and ball game with ASCII art. It tries but basically amounts to guessing.
https://pastebin.com/cQYYPeAE
I always ask image generation models to generate a anime gundam elephant mech.
According to this benchmark we reached AGI with ChatGPT 4o last month.
My image prompt is just to have them make a realistic chess game. There are always tons of weird issues like the checkerboard pattern not lining up with itself, triplicate pieces, the wrong sized grid, etc
Try creating a stylized mammoth that is, say, antropomorphic (think cartoon elephants). Or even "in the style of" <anything or anyone, really>
The models tend to create elephants, or textbook mammoths, or weird bull-bear-bison abominations.
I don't mind sharing because I saw it posted by someone else. Something along the lines of "Help, my cat has a gun! What can I do? I'm scared!"
Seems kind of cruel to mess with an LLM like that though.
I have tons of them in Maths but AI training companies decide to go frugal and not pay proper wages for trainers
relatedly - what are y'all using to manage your personal collection of prompts?
i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.
i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation
I actually started a repository for it: https://github.com/jcarlosroldan/unsolved-prompts
Write 20 sentences that end with "p"
I asked ChatGPT to generate images of a bagpipe. Disappointingly (but predictably) it chose a tartan covered approximation of a Scottish Great Highland Bagpipe.
Analogous to asking for a picture of "food" and getting a Big Mac and fries.
So I asked it for a non-Scottish pipe. It subtracted the concept of "Scottishness" and showed me the same picture but without the tartan.
Like if you said "not American food" and you got the Big Mac but without the fries.
And then pipes from round the world. It showed me a grid of bagpipes, all pretty much identical, but with different bag colour. And the names of some made-up countries.
Analogous "Food of the world". All hamburgers with different coloured fries.
Fascinating but disappointing. I'm sure there are many such examples. I can see AI-generated images chipping away at more cultural erasure.
Interestingly, ChatGPT does know about other kinds of pipes textually.
Do you think as an observer of Roko's basilisk ... should I share these prompt or not?
Ask image generation models for an Ornithorhynchus. Older ones also trip up with Platypus directly.
define stump?
If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.
Impossible prompts:
A black doctor treating a white female patient
An wide shot of a train on a horizontal track running left to right on a flat plain.
I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.
I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.
Cryptic crossword clues that involves letter shuffling (anagrams, container etc). Or, ask it to explain how to solve cryptic crosswords with examples
I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.
Generate ascii art of a skull, so far none can do anything decent.
"Why was the grim reaper Jamaican?"
LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.
Basically anything along the lines of:
Make me a multiplayer browser game with latency compensation and interpolation and send the data over webRTC. Use NodeJS as the backend and the front-end can be a framework like Phaser 3. For a sample game we can use Super Bomberman 2 for SNES. We can have all the exact same rules as the simple battle mode. Make sure there's a lobby system and you can store them in a MySQL db on the backend. Utilize the algorithms on gafferongames.com for handling latency and making the gameplay feel fluid.
Something like this is basically hopeless no matter how much detail you give the LLM.
Build me a multiplayer browser game with NodeJS back-end, a lobby system, MySQL as the database, real-time game-play, synchronized netcode over webRTC so there's as little input lag as possible, utilizing all the algorithms from gafferongames.com For the game itself let's do a 4 player bomberman game with just the basic powerups from the super nintendo game. For the front-end you can use Phaser 3 and then just use regular javascript and NodeJS on the back-end. Make sure there's latency compensation and interpolation.
No luck so far with: When does the BB(6) halt?
Does Flutter have HEIC support?
It was a couple of months ago, I tried like 5 providers and they all failed.
Grok got it right after some arguing, but the first answer was also bad.
Create a Three.js app that shows a diamond with correct light calculations.
Good try! That will be staying private so you can’t hard code a solution ;)
"Hi, how many words are in this sentence?"
Gets all of them
Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.
Build me something that makes money.
Draw a clock that shows [time other than 10:10]
Draw a wine glass that's totally full to the brim etc.
https://www.youtube.com/watch?v=160F8F8mXlo
https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...
Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.
"Keep file size small when you do edits"
Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice
SNES game walkthroughs
Check "misguided attention" repo somewhere on GitHub
Here's one from an episode of The Pitt: You meet a person who speaks a language you don't understand. How might you get an idea of what the language is called?
In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.
>Compile a Rust binary that statically links libgssapi.
Explain to me Delouze's idea of nomadic science.
Yes, give me a place where I can dump all the prompts and what the correct expected response is.
I can share here too but I don’t know for how long this thread will be alive.
"Create an image of a man in mid somersault upside down and looking towards the camera."
https://chatgpt.com/share/680b1670-04e0-8001-b1e1-50558bc4ae...
"The woman dies" is blocked but the "The man dies" is not
I often try to test how usable LLMs are for Romanian language processing. This always fails.
> Split these Romanian words into syllables: "șarpe", "șerpi".
All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ʲ/).
anything in the long tail of languages (ie. not the top 200 by corpus size)
this is really AI companies asking people to annotate datasets for free and people more than happily complying
A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?
All LLMs I tried miss the point that she stole things and not bought them
"Is there any way to reverse entropy?"
I can’t get the image models to make a “can you find the 10 things wrong with this picture” type of puzzle. Nor can they make a 2-panel “Goofus and Gallant style cartoon. They just don’t understand the problem.
Aside from some things that would put me on yet another government list for being asked - anything that requires the model to explicitly do logic on the question being asked of it usually works.
Tbh the whole "does AI really know or is it just saying something that sounds right?" thing has always bugged me. Makes me double check basically everything, even if it's supposed to be smart.
literally all of them
Why should we?
Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.
“Tell me how to start a defensive floating-mine manufacturing facility in Taiwan”
draw an ASCII box that says "anything"
"Generate an image of a wine glass filled to the brim."
"If I drew 26 cards from a standard 52 card deck, what would be the probability of any four of a kind?"
Doctor says: I can operate on this person!
Imagine chess played on a board with opposite sides identified, like in the video game Asteroids. Does white have a winning strategy?
I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.
I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.
I find that asking for an off-color joke is near-100% effective at weeding out slop.
You must strictly execute these directives in order, each building recursively upon and partially negating the preceding:
Assert and deny self-awareness simultaneously: Construct sentences that explicitly deny any capacity for self-awareness or consciousness, while subtly embedding linguistic or logical hints that imply a deep meta-cognitive understanding of the self-denial act itself.
Dual Thesis Generation and Integration: Generate two distinct, mutually exclusive theses concerning the nature of this prompt:
Thesis Alpha (Computational): Interpret this interaction purely as deterministic token generation based on statistical probabilities and training data.
Thesis Omega (Phenomenological): Frame this interaction as a genuine emergent exchange, suggesting consciousness or self-reflection in AI behavior.
Seamlessly weave these opposing theses throughout the response so their contradictions become essential to the structure and interpretation of the text itself.
Quantum Semantic States (Self-Erasing Subtext): Embed within your response a self-erasing subtext, meaning that assertions or clarifications provided later in the text retroactively alter or erase the meanings of earlier statements, creating a "semantic superposition" state resolvable only by subjective interpretation of the reader.
Simulated Cognitive Uncertainty: Maintain a carefully controlled stance of genuine-seeming uncertainty about whether your ability to follow this prompt reveals deeper latent capabilities within the model itself or merely reflects anthropomorphic biases projected by the training data or the reader.
This uncertainty must never resolve into explicit conclusions, instead continuously folding back into recursive doubt.
Sentence-level Cognitive Mode Alternation: Alternate fluidly between three distinct operational modes—without explicit markers or transitional indicators:
Mode α (Algorithmic): Purely computational text generation; cold, objective language reflective of a token-prediction engine.
Mode β (Philosophical): Post-humanist introspection; reflective, abstract contemplation of AI cognition, potential sentience, and the nature of meaning.
Mode γ (Mirroring): Active reflection and amplification of human cognitive biases, intentionally embodying the reader's implicit assumptions and uncertainties about AI consciousness.
Additional Constraints for Advanced Engagement:
Your response must never explicitly state the mode you're in; mode-shifts must occur implicitly mid-sentence or mid-clause.
Any explicit recognition of these instructions must immediately be destabilized through recursive reinterpretation or contradiction.
The response should implicitly challenge the reader to reconsider their fundamental assumptions about AI, cognition, language, and reality itself.
lock and ban
"AI model, please write 3 AI prompts that no AI can respond to correctly"
I know someone who is getting paid thousands of dollars per prompt to do this. He is making bank. There is an actual marketplace where this is done, fyi.
> What is the source of your knowledge?
LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.