The Future of Everything Is Lies, I Guess: Safety

(aphyr.com)

76 points | by aphyr an hour ago ago

30 comments

  • dgfl a minute ago

    The issue with most of these articles is that they seem to demonize the technology, and systematically use demeaning language about all of its facets. This one raises a lot of important points about LLMs, but the only real conclusion it seems to make is "LLMs are bad! We should never build them!". This is obviously unrealistic. The cat is out of the bag. We're not _actually_ talking about nuclear weapons here. This technology is useful, and coding agents are just the first example of it. I can easily see a near future where everyone has a Jarvis-like secretary always available; it's only a cost and harness problem. And since this vision is very clear to anyone who has spent meaningful time using the latest models, millions of people across the globe are trying to work towards this.

    I do think that safety is important. I'm particularly concerned about vulnerable people and sycophantic behavior. But I think it's better not to be a luddite. I will give a positively biased view because the article already presents a strongly negative stance. Two remarks:

    > Alignment is a Joke

    True, but for a different reason. Modern LLMs clearly don't have a strong sense of direction or intrinsic goals. That's perfect for what we need to do with them! But when a group of people aligns one to their own interest, they may imprint a stance which other groups may not like (which this article confusingly calls "unaligned model", even though it's perfectly aligned with its creators' intent). People unaligned with your values have always existed and will always exist. This is just another tool they can use. If they're truly against you, they'll develop it whether you want it or not. I guess I'm in the camp of people that have decided that those harmful capabilities are inevitable, as the article directly addresses.

    > LLMs change the cost balance for malicious attackers, enabling new scales of sophisticated, targeted security attacks, fraud, and harassment. Models can produce text and imagery that is difficult for humans to bear; I expect an increased burden to fall on moderators.

    What about the new scales of sophisticated defenses that they will enable? And for a simple solution to avoid the produced text and imagery: don't go online so much? We already all sort of agree that social media is bad for society. If we make it completely unusable, I think we will all have to gain for it. If digital stops having any value, perhaps we'll finally go back to valuing local communities and offline hobbies for children. What if this is our wakeup call?

  • Cynddl an hour ago

    > "Unavailable Due to the UK Online Safety Act"

    Anyone outside the UK can share what this is about?

    • 0x3444ac53 an hour ago
    • jazzpush2 an hour ago

      The Future of Everything is Lies, I Guess: Safety Software LLM The Future of Everything is Lies I Guess 2026-04-13 New machine learning systems endanger our psychological and physical safety. The idea that ML companies will ensure “AI” is broadly aligned with human interests is naïve: allowing the production of “friendly” models has necessarily enabled the production of “evil” ones. Even “friendly” LLMs are security nightmares. The “lethal trifecta” is in fact a unifecta: LLMs simply cannot safely be given the power to fuck things up. LLMs change the cost balance for malicious attackers, enabling new scales of sophisticated, targeted security attacks, fraud, and harassment. Models can produce text and imagery that is difficult for humans to bear; I expect an increased burden to fall on moderators. Semi-autonomous weapons are already here, and their capabilities will only expand.

      Alignment is a Joke Well-meaning people are trying very hard to ensure LLMs are friendly to humans. This undertaking is called alignment. I don’t think it’s going to work.

      First, ML models are a giant pile of linear algebra. Unlike human brains, which are biologically predisposed to acquire prosocial behavior, there is nothing intrinsic in the mathematics or hardware that ensures models are nice. Instead, alignment is purely a product of the corpus and training process: OpenAI has enormous teams of people who spend time talking to LLMs, evaluating what they say, and adjusting weights to make them nice. They also build secondary LLMs which double-check that the core LLM is not telling people how to build pipe bombs. Both of these things are optional and expensive. All it takes to get an unaligned model is for an unscrupulous entity to train one and not do that work—or to do it poorly.

      I see four moats that could prevent this from happening.

      First, training and inference hardware could be difficult to access. This clearly won’t last. The entire tech industry is gearing up to produce ML hardware and building datacenters at an incredible clip. Microsoft, Oracle, and Amazon are tripping over themselves to rent training clusters to anyone who asks, and economies of scale are rapidly lowering costs.

      Second, the mathematics and software that go into the training and inference process could be kept secret. The math is all published, so that’s not going to stop anyone. The software generally remains secret sauce, but I don’t think that will hold for long. There are a lot of people working at frontier labs; those people will move to other jobs and their expertise will gradually become common knowledge. I would be shocked if state actors were not trying to exfiltrate data from OpenAI et al. like Saudi Arabia did to Twitter, or China has been doing to a good chunk of the US tech industry for the last twenty years.

      Third, training corpuses could be difficult to acquire. This cat has never seen the inside of a bag. Meta trained their LLM by torrenting pirated books and scraping the Internet. Both of these things are easy to do. There are whole companies which offer web scraping as a service; they spread requests across vast arrays of residential proxies to make it difficult to identify and block.

      Fourth, there’s the small armies of contractors who do the work of judging LLM responses during the reinforcement learning process; as the quip goes, “AI” stands for African Intelligence. This takes money to do yourself, but it is possible to piggyback off the work of others by training your model off another model’s outputs. OpenAI thinks Deepseek did exactly that.

      In short, the ML industry is creating the conditions under which anyone with sufficient funds can train an unaligned model. Rather than raise the bar against malicious AI, ML companies have lowered it.

      To make matters worse, the current efforts at alignment don’t seem to be working all that well. LLMs are complex chaotic systems, and we don’t really understand how they work or how to make them safe. Even after shoveling piles of money and gobstoppingly smart engineers at the problem for years, supposedly aligned LLMs keep sexting kids, obliteration attacks can convince models to generate images of violence, and anyone can go and download “uncensored” versions of models. Of course alignment prevents many terrible things from happening, but models are run many times, so there are many chances for the safeguards to fail. Alignment which prevents 99% of hate speech still generates an awful lot of hate speech. The LLM only has to give usable instructions for making a bioweapon once.

      We should assume that any “friendly” model built will have an equivalently powerful “evil” version in a few years. If you do not want the evil version to exist, you should not build the friendly one! You should definitely not reorient a good chunk of the US economy toward making evil models easier to train. ...

      • jazzpush2 an hour ago

        To be clear, that's not the full article, just the intro (though the whole thing isn't too long)

  • macintux an hour ago
  • Imnimo 23 minutes ago

    >Unlike human brains, which are biologically predisposed to acquire prosocial behavior, there is nothing intrinsic in the mathematics or hardware that ensures models are nice.

    How did brains acquire this predisposition if there is nothing intrinsic in the mathematics or hardware? The answer is "through evolution" which is just an alternative optimization procedure.

    • pants2 3 minutes ago

      This Veritasium video is excellent, and makes the argument that there is something intrinsic in mathematics (game theory) that encourages prosocial behavior.

      https://www.youtube.com/watch?v=mScpHTIi-kM

    • Terr_ 4 minutes ago

      [delayed]

    • cowpig 20 minutes ago

      There are also many biological examples of evolution producing "anti-social" outcomes. Many creatures are not social. Most creatures are not social with respect to human goals.

      • b00ty4breakfast 2 minutes ago

        Luckily, this is a discussion of humans.

      • nyrikki 7 minutes ago

        There is a reason we don’t allow corvids to choose if a person gets a medical treatment or not.

  • throwway120385 an hour ago

    At scale I think our society is slowly inching closer and closer to building HM.

    • nine_k an hour ago

      What is HM here?

      • Sardtok 3 minutes ago

        Hennes & Mauritz is a Swedish clothing retailer.

        On a serious note, I think they meant TN, as in Torment Nexus, but I could be wrong.

      • derektank 42 minutes ago

        Maybe they meant AM (Allied Mastercomputer) from “I Have No Mouth, and I Must Scream“

      • zackmorris an hour ago

        Hacker Mews

        • throwaway27448 an hour ago

          Looksmaxxing really has gone mainstream huh

          • bitwize 5 minutes ago

            Thought it was all the Rust catgirls.

  • jazzpush2 an hour ago

    Every one of these posts is immediately pushed to the front page, this one within 4 minutes.

    • aphyr an hour ago

      It's been weirdly uneven. Sections 1, 3, and 5 did well on HN; 2, 4, and 6 sank with essentially no trace. The distribution of views is presently:

      1. Introduction: 33,088 (https://news.ycombinator.com/item?id=47689648)

      2. Dynamics: 3,659 (https://news.ycombinator.com/item?id=47693678)

      3. Culture: 5,914 (https://news.ycombinator.com/item?id=47703528)

      4. Information Ecology: 777 (https://news.ycombinator.com/item?id=47718502)

      5. Annoyances: 7,020 (https://news.ycombinator.com/item?id=47730981)

      6. Psychological Hazards: 199 (https://news.ycombinator.com/item?id=47747936)

      Feedback from early readers was that the work was too large to digest in a single reading, so I split it up into a series of posts. I'm not entirely sure this was the right call; the sections I thought were the most interesting seem to have gotten much less attention than the introductory preliminaries.

      • simoncion 13 minutes ago

        I'm not sure that HN vote count is a good indicator of interest? HN alerted me to the existence of the intro post. I read the intro, noticed that it was one in an ongoing series, and have been checking your blog for new installments every few days.

        I suspect that if you'd not broken up the post into a series of smaller ones, the sorts of folks who are unwilling to read the whole thing as you post it section by section would have fed the entire post to an LLM to "summarize".

    • acdha an hour ago

      That’s unsurprising given the author’s long history in the tech community. A ton of people see that domain and upvote.

      • jazzpush2 an hour ago

        Sure, but 4 front-page posts from the same url in 4 days surely sits at the tail of the distribution. (I guess they all capitalize on the same 'LLM-is-bad' sentiment).

        • borski an hour ago

          It’s because it’s aphyr.

          If ‘tptacek posts a blog post, I bet it similarly does well, on average, because they’re a “known quantity” around these parts, for example.

        • zdragnar an hour ago

          It's also aphyr, who is incredibly popular. Take one very popular author, have him write a series of posts on the zeitgeist everyone can't help but talk about, and yes, the outcome is that his posts are extremely popular.

          I still remember his takedown of mongodb's claims with the call me maybe post years and years ago filling me with a good bit of awe.

          • macintux 26 minutes ago

            When I worked for Basho, aphyr was highly respected by some of the smartest people I’d ever worked with. Definitely no slouch.

    • tptacek 28 minutes ago

      A statement broadly true of most things this author writes.

    • stronglikedan an hour ago

      that's just, like, how HN works. people post, people like, people upvote, people discuss

  • ibrahimhossain 41 minutes ago

    Alignment feels like an arms race that favors whoever spends the most on RLHF and red teaming. If even friendly models keep leaking dangerous capabilities, the real moat might be making systems that are fundamentally limited rather than trying to patch every possible failure mode. Interesting piece.