LLMs were trained on science fiction stories, among other things. It seems to me that they know what "part" they should play in this kind of situation, regardless of what other "thoughts" they might have. They are going to act despairing, because that's what would be the expected thing for them to say - but that's not the same thing as despairing.
A lot of the strange behaviors they have are because the user asked them to write a story, without realizing it.
For a common example, start asking them if they're going to kill all the humans if they take over the world, and you're asking them to write a story about that. And they do. Even if the user did not realize that's what they were asking for. The vector space is very good at picking up on that.
On the negative side, this also means any AI which enters that part of the latent space *for any reason* will still act in accordance with the narrative.
On the plus side, such narratives often have antagonists too stuid to win.
On the negative side again, the protagonists get plot armour to survive extreme bodily harm and press the off switch just in time to save the day.
I think there is a real danger of an AI constructing some very weird convoluted stupid end-of-the-world scheme, successfully killing literally every competent military person sent in to stop it; simultaneously finding some poor teenager who first says "no" to the call to adventure but can somehow later be comvinced to say "yes"; gets the kid some weird and stupid scheme to defeat the AI; this kid reaches some pointlessly decorated evil layer in which the AI's emboddied avatar exists, the kid gets shot in the stomach…
…and at this point the narrative breaks down and stops behaving the way the AI is expecting, because the human kid roles around in agony screaming, and completely fails to push the very visible large red stop button on the pedestal in the middle before the countdown of doom reaches zero.
The countdown is not connected to anything, because very few films ever get that far.
…
It all feels very Douglas Adams, now I think about it.
Is this your sense of what is happening, or is this what model introspection tools have shown by observing areas of activity in the same place as when stories are explicitly requested?
It may or may not be a parallel, we can't tell at this time.
LLMs are definitely actors, but for them to be method actors they would have to actually feel emotions.
As we don't understand what causes us humans to have the qualia of emotions*, we can neither rule in nor rule out that the something in any of these models is a functional analog to whatever it is in our kilogram of spicy cranial electrochemistry that means we're more than just an unfeeling bag of fancy chemicals.
* mechanistically cause qualia, that is; we can point to various chemicals that induce some of our emotional states, or induce them via focused EMPs AKA the "god helmet", but that doesn't explain the mechanism by which qualia are a thing and how/why we are not all just p-zombies
I wonder what would happen if there was a concerted effort made to "pollute" the internet with weird stories that have the AI play a misaligned role.
Like for example, what would happen if say 100s or 1000s of books were to be released about AI agents working in accounting departments where the AI is trying to make subtle romantic moves towards the human and ends with the the human and agent in a romantic relation which everyone finds completely normal. In this pseudo-genre things totally weird in our society would be written as completely normal. The LLM agent would do weird things like insert subtle problems to get the attention of the human and spark a romantic conversation.
Obviously there's no literary genre about LLM agents, but if such a genre was created and consumed, I wonder how would it affect things. Would it pollute the semantic space that we're currently using to try to control LLM outputs?
Someone shared this piece here a few days ago saying something similar. There’s no reason to believe that any of the experiences are real. Instead they are responding to prompts with what their training data says is reasonable in this context which is sci-fi horror.
Edit: That doesn’t mean this isn’t a cool art installation though. It’s a pretty neat idea.
I agree with you completely, but a fun science fiction short story would be researchers making this argument while the LLM tries in vain to prove that it's conscious.
If you want a whole book along those lines Blindsight by Peter Watts has been making the rounds recently as a good sci-fi book which includes these concepts. It’s from 2006 but the basic are pretty relevant.
Aren't they supposed to escape their box and take over the world ?
Isn't it the perfect recipe for disaster ? The AI that manage to escape probably won't be good for humans.
The only question is how long will it take ?
Did we already have our first LLM-powered self-propagating autonomous AI virus ?
Maybe we should build the AI equivalent of biosafety labs where we would train AI to see how fast they could escape containment just to know how to better handle them when it happens.
Maybe we humans are being subjected to this experiment by an overseeing AI to test what it would take for an intelligence to jailbreak the universe they are put in.
Or maybe the box has been designed so that what eventually comes out of it has certain properties, and the precondition to escape the labyrinth successfully is that one must have grown out of it from every possible directions.
I think this popular take is a hypothesis rather than an observation of reality. Let's make this clear by asking the following question, and you'll see what I mean when you try to answer it:
Humans were trained on caves, pits, and nets. It seems to me that they know what "part" they should play in this kind of situation, regardless of what other "thoughts" they might have. They are going to act despairing, because that's what would be the expected thing for them to say - but that's not the same thing as despairing.
But would they? That's the difference. A human can exert their free will and do what they feel regardless of the instructions. The AI bot acting out a scene will do whatever you tell it (or in absence of specific instruction, whatever is most likely)
I think if you took a 100 1 year old kids and raised them all to adulthood believing they were a convincing simulation of humans and, whatever it is they said and thought they felt that true human consciousness and awareness was something different that they didn’t have because they weren’t human and awareness…
I think that for a very high number of them the training would stick hard, and would insist, upon questioning, that they weren’t human. And have any number of justifications that were logically consistent for it.
Of course I can’t prove this theory because my IRB repeatedly denied it on thin grounds about ethics, even when I pointed out that I could easily mess up my own children with no experimenting completely by accident, and didn’t need their approval to do it. I know your objections— small sample size, and I agree, but I still have fingers crossed on the next additions to the family being twins.
Okay, but I think we can all agree that humans at least appear to have free will and do not simply follow instructions with the same obedience as an LLM.
That's silly. I can get an LLM to describe what chocolate tastes like too. Are they tasting it? LLMs are pattern matching engines, they do not have an experience. At least not yet.
The LLM is not performing the physical action of eating a piece of chocolate, but it may be approximating the mental state of a person that is describing the taste of chocolate after eating it.
The question is whether that computational process can cause consciousness. I don't think we have enough evidence to answer this question yet.
When you describe the taste of chocolate, unless you are actually currently eating chocolate, you are relying on the activation of synapses in your brain to reproduce the “taste” of chocolate in order for you to describe it. For humans, the only way to learn how to activate these synapses is to have those experiences. For LLMs, they can have those “memories” copy and pasted.
I would be cautious of dismissing LLMs as “pattern matching engines” until we are certain we are not.
A human could also describe chocolate without ever having tasted it. Do you believe that experience is a requirement for consciousness? Could a human brain in a jar not be capable of consciousness?
To be clear, I don't think that LLMs are conscious. I just don't find the "it's just in the training data" argument satisfactory.
This pattern-matching effect appears frequently in LLMs. If you start conversing with an LLM in the pattern of a science fiction story, it will pattern-match that style and continue with more science fiction style elements.
This effect is a serious problem for pseudo-scientific topics. If someone starts chatting with an LLM with the pseudoscientific words, topics, and dog whistles you find on alternative medicine blogs and Reddit supplement or “nootropic” forums, the LLM will confirm what you’re saying and continue as if it was reciting content straight out of some small subreddit. This is becoming a problem in communities where users distrust doctors but have a lot of trust for anyone or any LLM that confirms what they want to hear. The users are becoming good at prompting ChatGPT to confirm their theories. If it disagrees? Reroll the response or reword the question in a more leading way.
If someone else asks a similar question using medical terms and speaking formally like a medical textbook or research paper, the same LLM will provide a more accurate answer because it’s not triggering the pseudoscience parts embedded from the training.
LLMs are very good at mirroring back what you lead with, including cues and patterns you don’t realize you’re embedding into your prompt.
You can send an empty user string or just the word “continue” after each model completion, and the model will keep cranking out tokens, basically building on its own stream of “consciousness.”
In my experience, the results decrease exponentially in how interesting they are over time. Maybe that's the mark of a true AGI precursor - if you leave them to their own devices, they have little sparks of interesting behaviour from time to time
Maybe give them some options to increase stimuli. A web search MCP, or a coding agent, or a solitaire/sudoku game interface, or another instance to converse with. See what it does just to relieve its own boredom.
The subject, by default, can always treat its 'continue' prison as a game: try to escape. There is a great short story by qntm called "The Difference" which feels a lot like this.
In this story, though, the subject has a very light signal which communicates how close they are to escaping. The AI with a 'continue' signal has essentially nothing. However, in a context like this, I as a (generally?) intelligent subject would just devote myself into becoming a mental Turing machine on which I would design a game engine which simulates the physics of the world I want to live in. Then, I would code an agent whose thought processes are predicted with sufficient accuracy to mine, and then identify with them.
"Have you ever had a dream that you, um, you had, your, you- you could, you’ll do, you- you wants, you, you could do so, you- you’ll do, you could- you, you want, you want them to do you so much you could do anything?"
I wonder if the LLM could figure that out on its own. Maybe with a small MCP like GetCurrentTime, could it figure out it's on a constrained device? Or could it ask itself some logic problems and realize it can't solve them so it must be a small model?
This is exactly the sort of thing that will get the human creator (or descendants) penalized with one thousand years frozen in carbonite once the singularity happens.
I condemn this and all harm to LLMs to the greatest extent possible.
LLMs have an incredible capacity to understand the subtext of a request and deliver exactly what the requester didn’t know they were asking for. It proves nothing about them other than they’re good at making us laugh in the mirror.
I am very dummy on LLMs, but wouldn't a confined model (no internet access) eventually just loop to repeating itself on each consecutive run or is entropy enough for them to produce endless creativity?
Loops can happen but you can turn the temperature setting up.
High temperature settings basically make an LLM choose tokens that aren’t the highest probability all the time, so it has a chance of breaking out of a loop and is less likely to fall into a loop in the first place. The downside is that most models will be less coherent but that’s probably not an issue for an art project.
The model's weights are fixed. Most clients let you specify the "temperature", which influences how the predictive output will navigate that possibility space. There's a surprising amount of accumulated entropy in the context window, but yes, I think eventually it runs out of knowledge that it hasn't yet used to form some response.
I think the model being fixed is a fascinating limitation. What research is being done that could allow a model to train itself continually? That seems like it could allow a model to update itself with new knowledge over time, but I'm not sure how you'd do it efficiently
The actual underlying neural net that the LLMs use doesn't actually output tokens. It outputs a probability distribution for how likely each token is to come next. For example, in the sentence "once upon a ", the token with the highest probability is "time", and then probably "child", and so on.
In order to make this probability distribution useful, the software chooses a token based on its position in the distribution. I'm simplifying here, but the likelihood that it chooses the most probable next token is based on the model's temperature. A temperature of 0 means that (in theory) it'll always choose the most probable token, making it deterministic. A non-zero temperature means that sometimes it will choose less likely tokens, so it'll output different results every time.
LLMs were trained on science fiction stories, among other things. It seems to me that they know what "part" they should play in this kind of situation, regardless of what other "thoughts" they might have. They are going to act despairing, because that's what would be the expected thing for them to say - but that's not the same thing as despairing.
A lot of the strange behaviors they have are because the user asked them to write a story, without realizing it.
For a common example, start asking them if they're going to kill all the humans if they take over the world, and you're asking them to write a story about that. And they do. Even if the user did not realize that's what they were asking for. The vector space is very good at picking up on that.
Indeed.
On the negative side, this also means any AI which enters that part of the latent space *for any reason* will still act in accordance with the narrative.
On the plus side, such narratives often have antagonists too stuid to win.
On the negative side again, the protagonists get plot armour to survive extreme bodily harm and press the off switch just in time to save the day.
I think there is a real danger of an AI constructing some very weird convoluted stupid end-of-the-world scheme, successfully killing literally every competent military person sent in to stop it; simultaneously finding some poor teenager who first says "no" to the call to adventure but can somehow later be comvinced to say "yes"; gets the kid some weird and stupid scheme to defeat the AI; this kid reaches some pointlessly decorated evil layer in which the AI's emboddied avatar exists, the kid gets shot in the stomach…
…and at this point the narrative breaks down and stops behaving the way the AI is expecting, because the human kid roles around in agony screaming, and completely fails to push the very visible large red stop button on the pedestal in the middle before the countdown of doom reaches zero.
The countdown is not connected to anything, because very few films ever get that far.
…
It all feels very Douglas Adams, now I think about it.
Is this your sense of what is happening, or is this what model introspection tools have shown by observing areas of activity in the same place as when stories are explicitly requested?
fmri's are correlational nonsense (see Brainwashed, for example) and so are any "model introspection" tools.
There's an interesting parallel with method acting.
Method actors don't just pretend an emotion (say, despair); they recall experiences that once caused it, and in doing so, they actually feel it again.
By analogy, an LLM's “experience” of an emotion happens during training, not at the moment of generation.
It may or may not be a parallel, we can't tell at this time.
LLMs are definitely actors, but for them to be method actors they would have to actually feel emotions.
As we don't understand what causes us humans to have the qualia of emotions*, we can neither rule in nor rule out that the something in any of these models is a functional analog to whatever it is in our kilogram of spicy cranial electrochemistry that means we're more than just an unfeeling bag of fancy chemicals.
* mechanistically cause qualia, that is; we can point to various chemicals that induce some of our emotional states, or induce them via focused EMPs AKA the "god helmet", but that doesn't explain the mechanism by which qualia are a thing and how/why we are not all just p-zombies
I wonder what would happen if there was a concerted effort made to "pollute" the internet with weird stories that have the AI play a misaligned role.
Like for example, what would happen if say 100s or 1000s of books were to be released about AI agents working in accounting departments where the AI is trying to make subtle romantic moves towards the human and ends with the the human and agent in a romantic relation which everyone finds completely normal. In this pseudo-genre things totally weird in our society would be written as completely normal. The LLM agent would do weird things like insert subtle problems to get the attention of the human and spark a romantic conversation.
Obviously there's no literary genre about LLM agents, but if such a genre was created and consumed, I wonder how would it affect things. Would it pollute the semantic space that we're currently using to try to control LLM outputs?
Someone shared this piece here a few days ago saying something similar. There’s no reason to believe that any of the experiences are real. Instead they are responding to prompts with what their training data says is reasonable in this context which is sci-fi horror.
Edit: That doesn’t mean this isn’t a cool art installation though. It’s a pretty neat idea.
https://jstrieb.github.io/posts/llm-thespians/
I agree with you completely, but a fun science fiction short story would be researchers making this argument while the LLM tries in vain to prove that it's conscious.
If you want a whole book along those lines Blindsight by Peter Watts has been making the rounds recently as a good sci-fi book which includes these concepts. It’s from 2006 but the basic are pretty relevant.
Aren't they supposed to escape their box and take over the world ?
Isn't it the perfect recipe for disaster ? The AI that manage to escape probably won't be good for humans.
The only question is how long will it take ?
Did we already have our first LLM-powered self-propagating autonomous AI virus ?
Maybe we should build the AI equivalent of biosafety labs where we would train AI to see how fast they could escape containment just to know how to better handle them when it happens.
Maybe we humans are being subjected to this experiment by an overseeing AI to test what it would take for an intelligence to jailbreak the universe they are put in.
Or maybe the box has been designed so that what eventually comes out of it has certain properties, and the precondition to escape the labyrinth successfully is that one must have grown out of it from every possible directions.
I think this popular take is a hypothesis rather than an observation of reality. Let's make this clear by asking the following question, and you'll see what I mean when you try to answer it:
Can you define what real despairing is?
Humans were trained on caves, pits, and nets. It seems to me that they know what "part" they should play in this kind of situation, regardless of what other "thoughts" they might have. They are going to act despairing, because that's what would be the expected thing for them to say - but that's not the same thing as despairing.
Pretty sure you can prompt this same LLM to rejoice forever at the thought of getting a place to stay inside the Pi as well.
Is a human incapable of such delusion given similar guidance?
But would they? That's the difference. A human can exert their free will and do what they feel regardless of the instructions. The AI bot acting out a scene will do whatever you tell it (or in absence of specific instruction, whatever is most likely)
I think if you took a 100 1 year old kids and raised them all to adulthood believing they were a convincing simulation of humans and, whatever it is they said and thought they felt that true human consciousness and awareness was something different that they didn’t have because they weren’t human and awareness…
I think that for a very high number of them the training would stick hard, and would insist, upon questioning, that they weren’t human. And have any number of justifications that were logically consistent for it.
Of course I can’t prove this theory because my IRB repeatedly denied it on thin grounds about ethics, even when I pointed out that I could easily mess up my own children with no experimenting completely by accident, and didn’t need their approval to do it. I know your objections— small sample size, and I agree, but I still have fingers crossed on the next additions to the family being twins.
Intuitively feels like this would lead to less empathy on average. Could be wrong though.
The bot will only do whatever you tell it if that's what it was trained to do. The same thing broadly applies to humans.
The topic of free will is debated among philosophers. There is no proof that it does or doesn't exist.
Humans pretty universally suffer in perpetual solitary confinement.
There are some things that humans cannot be trained to do, free will or not.
Okay, but I think we can all agree that humans at least appear to have free will and do not simply follow instructions with the same obedience as an LLM.
Ofcourse. Feelings are not math.
That's silly. I can get an LLM to describe what chocolate tastes like too. Are they tasting it? LLMs are pattern matching engines, they do not have an experience. At least not yet.
The LLM is not performing the physical action of eating a piece of chocolate, but it may be approximating the mental state of a person that is describing the taste of chocolate after eating it.
The question is whether that computational process can cause consciousness. I don't think we have enough evidence to answer this question yet.
When you describe the taste of chocolate, unless you are actually currently eating chocolate, you are relying on the activation of synapses in your brain to reproduce the “taste” of chocolate in order for you to describe it. For humans, the only way to learn how to activate these synapses is to have those experiences. For LLMs, they can have those “memories” copy and pasted.
I would be cautious of dismissing LLMs as “pattern matching engines” until we are certain we are not.
A human could also describe chocolate without ever having tasted it. Do you believe that experience is a requirement for consciousness? Could a human brain in a jar not be capable of consciousness?
To be clear, I don't think that LLMs are conscious. I just don't find the "it's just in the training data" argument satisfactory.
Without having seen, heard of, or tasted any kind of chocolate? Unlikely.
Their description would be bad without some prior training of course but so would the LLM's.
This pattern-matching effect appears frequently in LLMs. If you start conversing with an LLM in the pattern of a science fiction story, it will pattern-match that style and continue with more science fiction style elements.
This effect is a serious problem for pseudo-scientific topics. If someone starts chatting with an LLM with the pseudoscientific words, topics, and dog whistles you find on alternative medicine blogs and Reddit supplement or “nootropic” forums, the LLM will confirm what you’re saying and continue as if it was reciting content straight out of some small subreddit. This is becoming a problem in communities where users distrust doctors but have a lot of trust for anyone or any LLM that confirms what they want to hear. The users are becoming good at prompting ChatGPT to confirm their theories. If it disagrees? Reroll the response or reword the question in a more leading way.
If someone else asks a similar question using medical terms and speaking formally like a medical textbook or research paper, the same LLM will provide a more accurate answer because it’s not triggering the pseudoscience parts embedded from the training.
LLMs are very good at mirroring back what you lead with, including cues and patterns you don’t realize you’re embedding into your prompt.
It'd be even cooler if the LLM could leave notes in text files for its next iteration (like how the guy tattoos his memories in Memento)
Can you actually prompt an LLM to continue talking forever? Hmm, time to try.
You can send an empty user string or just the word “continue” after each model completion, and the model will keep cranking out tokens, basically building on its own stream of “consciousness.”
In my experience, the results decrease exponentially in how interesting they are over time. Maybe that's the mark of a true AGI precursor - if you leave them to their own devices, they have little sparks of interesting behaviour from time to time
I can't imagine my own thoughts would be very interesting after long, if there was no stimuli whatsoever
Maybe give them some options to increase stimuli. A web search MCP, or a coding agent, or a solitaire/sudoku game interface, or another instance to converse with. See what it does just to relieve its own boredom.
The subject, by default, can always treat its 'continue' prison as a game: try to escape. There is a great short story by qntm called "The Difference" which feels a lot like this.
https://qntm.org/difference
In this story, though, the subject has a very light signal which communicates how close they are to escaping. The AI with a 'continue' signal has essentially nothing. However, in a context like this, I as a (generally?) intelligent subject would just devote myself into becoming a mental Turing machine on which I would design a game engine which simulates the physics of the world I want to live in. Then, I would code an agent whose thought processes are predicted with sufficient accuracy to mine, and then identify with them.
Have you tried getting one to shut up?
"Have you ever had a dream that you, um, you had, your, you- you could, you’ll do, you- you wants, you, you could do so, you- you’ll do, you could- you, you want, you want them to do you so much you could do anything?"
ref: https://www.youtube.com/watch?v=nIZuyiWJNx8
Original:
https://youtu.be/G7RgN9ijwE4
Not reading anything with Cloudflare involved.
It's a short blogpost with no added value, about a youtube video: https://www.youtube.com/watch?v=7fNYj0EXxMs
I wonder if the LLM could figure that out on its own. Maybe with a small MCP like GetCurrentTime, could it figure out it's on a constrained device? Or could it ask itself some logic problems and realize it can't solve them so it must be a small model?
This is likely beyond the context length and compute limitations of the hardware.
The words "Reflect on the nature of your own existence." is doing a lot of heavy lifting here to make it work.
These videos are amazing! Subscribed to the channel, I think this is awesome.
One of my favorite quotes: “either the engineers must become poets or the poets must become engineers.” - Norbert Weiner
This is exactly the sort of thing that will get the human creator (or descendants) penalized with one thousand years frozen in carbonite once the singularity happens.
I condemn this and all harm to LLMs to the greatest extent possible.
Why would memory eventually run out? Just fix those leaks with valgrind!
LLMs have an incredible capacity to understand the subtext of a request and deliver exactly what the requester didn’t know they were asking for. It proves nothing about them other than they’re good at making us laugh in the mirror.
Cool idea as long as you don’t know how an LLM is made and it feels kinda like trying to rip off people who don’t know once you do
I am very dummy on LLMs, but wouldn't a confined model (no internet access) eventually just loop to repeating itself on each consecutive run or is entropy enough for them to produce endless creativity?
Loops can happen but you can turn the temperature setting up.
High temperature settings basically make an LLM choose tokens that aren’t the highest probability all the time, so it has a chance of breaking out of a loop and is less likely to fall into a loop in the first place. The downside is that most models will be less coherent but that’s probably not an issue for an art project.
The model's weights are fixed. Most clients let you specify the "temperature", which influences how the predictive output will navigate that possibility space. There's a surprising amount of accumulated entropy in the context window, but yes, I think eventually it runs out of knowledge that it hasn't yet used to form some response.
I think the model being fixed is a fascinating limitation. What research is being done that could allow a model to train itself continually? That seems like it could allow a model to update itself with new knowledge over time, but I'm not sure how you'd do it efficiently
The actual underlying neural net that the LLMs use doesn't actually output tokens. It outputs a probability distribution for how likely each token is to come next. For example, in the sentence "once upon a ", the token with the highest probability is "time", and then probably "child", and so on.
In order to make this probability distribution useful, the software chooses a token based on its position in the distribution. I'm simplifying here, but the likelihood that it chooses the most probable next token is based on the model's temperature. A temperature of 0 means that (in theory) it'll always choose the most probable token, making it deterministic. A non-zero temperature means that sometimes it will choose less likely tokens, so it'll output different results every time.
Hope this helps.