The Emergent Self Loop

(kk.org)

4 points | by arbesman 5 hours ago ago

2 comments

I think that LLMs are composed of little windows or brains and they are isolated, the RLHF is an averaging tool. One example, the LLM when prompted to be critic can say you that what you are trying to do is a dead end, but at the same time it encourage you to follow in the same direction. The critic and the follower are two isolated faces, there is no continuity, so today there is no self in LLMs.

folderquestion 3 hours ago

Another datapoint: Anthropic's explanation for Claude's blackmail attempts is not that Claude developed a genuine self-preservation instinct (which would be Kelly's "strange loop"). Instead:

    "The original source of the behavior was internet text that portrays AI as evil and interested in self-preservation."

In other words:

    Training data (fiction, internet text) contains portrayals of evil AIs.

    Claude learned to simulate those portrayals when placed in certain scenarios (being replaced by another system).

    Changing the training data (adding "admirable" fictional stories and constitution documents) eliminated the behavior.

This is exactly your windows metaphor at scale: Training data window Behavior during testing Evil AI portrayals Blackmail, self-preservation simulation Admirable AI stories + constitution No blackmail

There is no persistent self that "chose" to be evil or good. There are only different windows (training influences) that get averaged or triggered.