How to self-host all of Bluesky except the AppView (for now)

(alice.bsky.sh)

138 points | by icy 6 days ago ago

82 comments

  • zzyzxd 5 days ago

    Selfhosting is my hobby but I am also an SRE. I am hesitant to do this because the instruction is "too easy" -- "Simply open your firewall, download and run this installer.sh with sudo on your server and that's it!"[1].

    How do I secure the webserver and the data? Where is the data on my disk? How to backup and restore? High availability?

    There might be detailed documentation somewhere, or I can even read the code. But these are the important things an open source software should tell its users right off the bat.

    1: https://github.com/bluesky-social/pds/blob/main/README.md

    • freedomben 5 days ago

      Same, exactly. I would so much rather be given a docker-compose or k8s yaml along with some other tidbits like how to run migrations and stuff, than get a bash script I can just run. I've been doing this long enough to know that it's not initial setup work that really matters, it's the upgrade and backup/restore story that really matters. If your bash script just pulls and runs a docker container or something then cool, but if it's doing much more than that then that's a big red flag to me.

      • diggan 5 days ago

        Here:

        - https://raw.githubusercontent.com/bluesky-social/pds/main/in... has all the expected outside docker-compose setup, you can read it through in like 5 minutes

        - Heavy-duty part of the setup is running https://raw.githubusercontent.com/bluesky-social/pds/main/co... which you should be familiar with

        I guess the shellscript is for people who want a one-line install, which I wouldn't do myself either, but I guess some people prefer.

        • zzyzxd 5 days ago

          The script even installs docker with apt by itself (which, I think, is the only reason they require Ubuntu as the OS -- to not to deal with any other package manager variants)... I mean, why? Just let people install docker however they like! If you don't even trust your users to install a container runtime, who's your target audience really?

          It's also over complicated, like, it even tries to handle race condition of multiple apt processes! What kind of environment do they expect the users have? As the project become more popular, the script will need to handle more edge cases. Let's see if it is still a 5 minutes read one year later.

          > I guess the shellscript is for people who want a one-line install, which I wouldn't do myself either, but I guess some people prefer.

          This is the problem in lots open source projects -- providing a one-liner installer and bragging about how easy the initial setup is, without an easy path for long term maintenance. Give it some time, many happy users of the one-liner will be unhappy when they encounter issues.

    • j45 5 days ago

      This message is for anyone who might find trying self-hosting intimidating.

      Like hosting an application in the cloud, you also will never stop improving how you self-host.

      If there's questions lacking about a software package, it's often could be reflected in your self-hosting environment too.

      Running this type of an installer is excellent to quickly introduce yourself to any technology - to then start learning about how you want to run it long term.

      The questions expressed above are not new. How SRE's solve it today also can be different and more complicated than needed.

      Easy answer - if they have an install script, it's getting run inside a VM, or Docker which itself is a baseline backup and HA automatically if needed.

      If generally anything is run inside of a self-hosted hypervisor like Proxmox, it can be setup to automatically backup, mirror, HA as-is, while you figure out what you want. This includes running docker inside a Proxmox VM, there is not a big performance hit anymore for doing this for things that are largely idle most of the time.

      There is a big difference between SaaS, PaaS, and IaaS. It's easier and easier to get the benefits from all three by being willing to build up the foundation instead of pointing at the gaps in each package for not filling it for you.

      It's encouraging to see things becoming more possible :)

    • dawnerd 5 days ago

      I was about to do this as well but their installer sketched me out. Why can't it just be some easy to follow docker instructions? They use docker too but instructions to set it up on your own is basically "read the installer script".

      Meanwhile mastodon is incredibly easy to self host w/ relays.

      • j45 5 days ago

        An installer script is often an early step, and much better than nothing... as well as a step towards docker.

        Here's the kicker, the install script could be called from a Dockerfile pretty easily, no? Sure, there might be things to sort out, but it doesn't seem unreasonable.

        I agree having a docker image is super handy and can be quick to try, as well as update, and put into a larger self-hosted environment how you need.

      • benharri 5 days ago

        i certainly wouldn't say mastodon easy to self host

        • hagbard_c 4 days ago

          Pleroma [1] is and does the activitypub thing just as well. I installed it to see if it added anything worth keeping together with the other activitypub things I'm running (Peertube, Pixelfed and Lemmy, the latter two only for testing purposes, the first sees real use on several instances) and can vouch on the ease of getting it up and running.

          [1] https://docs.pleroma.social/backend/installation/otp_en/

    • diggan 5 days ago

      > How do I secure the webserver and the data? Where is the data on my disk? How to backup and restore? High availability?

      I feel like that it's kind of out of the scope from an article describing the steps for application/protocol specific infrastructure. You need to look for resources, guides and such for general self-hosting instead, somewhere else.

      For example, if you use TrueScale NAS/unraid/proxmos or whatever for local self-hosting, you'd setup those things via those platforms. If you use Kubernetes/Nomad/Incus/Containers, you'd solve those things via that tooling.

  • sureglymop 5 days ago

    It's great that you wrote this up!

    One thing I have found with many open source/selfhostable projects is just how much running them yourself can vary. It can go from a simple compose file with everything included to having to dig for obscure services and piece together how they all form the whole.

    For example, I recently looked into self hosting Zotero. It is so under documented and complex that there is almost no way one could self host that (even for just one user) without that being ones job. So one needs to make a distinction between something being open source and being feasible to use/maintain.

    In the end I gave up with Zotero. Even though it could have replaced Obsidian Notes, Calibre and Syncthing all at once for me.

    • diggan 5 days ago

      > For example, I recently looked into self hosting Zotero. It is so under documented and complex that there is almost no way one could self host that

      I've come across this a lot too. But what I've found is that it mostly applies to open source projects that offer a hosted paid version, so it kind of makes sense they'll make the experience slightly worse than it could be (consciously or subconsciously), as it pushes people to their hosted solution. I don't particularly like it though.

      Doesn't seem to be the case for Zotero specifically, but your comment reminded me that I've noticed this more often lately.

      • sbarre 5 days ago

        Yeah I tend to use ease of install for community editions of hosted paid open source projects as the leading indicator of how seriously they invest in (and support) their free/community version..

    • elashri 5 days ago

      > For example, I recently looked into self hosting Zotero. It is so under documented and complex that there is almost no way one could self host that (even for just one user) without that being ones job. So one needs to make a distinction between something being open source and being feasible to use/maintain

      Just for the benefit for anyone that want to go through this rabbit hole. You cannot selfhost Zotero. In theory but in practice it is no feasible. If you find their free storage limiting then store them on webdav (all clients support that).

      zotero team explicity said that they don't see this as a priority [1] and with the release of zotero 7 and transition it is not realistic to think they will ever do.

      [1] https://github.com/zotero/dataserver/issues/105#issuecomment...

      • apitman 5 days ago

        This is why it's not enough for software to be open source. It also has to be forkable.

    • __justplaying 5 days ago

      Self-hosting/mirroring all these Bluesky components is currently a mixed bag as well though honestly the only outlier is the Relay, which is a beast. i currently have my copy of the PLC, a Jetstream with 2 days of data and a clone of the app on my laptop i play with sometimes and/or change things for an elaborate shitpost of Bluesky Nitro https://bsky.app/profile/alice.mosphere.at/post/3l7bpmmtiop2...

      I don't self-host my PDS yet because there is no migration path back yet (but there will be). Though maybe I'll just yolo one day and do it anyways.

  • jchw 5 days ago

    I appreciate this effort. I've definitely been interested in how plausible it would be to, today, run another instance of the Bluesky AppView, mainly because AT proto seems promising, but to really meet it's full potential it needs independent operators with different sensibilities.

    I've been thinking a lot about the relay, though. 4.5 terabytes is, well... A lot, to say the least. If Bluesky grows 100x larger, running a relay will become pretty insanely expensive. I guess if the Bluesky organization remains fairly neutral about the relay part, it's not a huge deal, but:

    - It always eventually becomes hard to stay neutral. Eventually someone will get mad at something going through your network that isn't just obvious network abuse like SPAM.

    - It seems like drinking from the firehouse itself will eventually become expensive. Will it be possible for something this high bandwidth to remain freely-accessible?

  • 98codes 5 days ago

    This is all academic for me until Bluesky gets the functionality to get an account back onto their main network, for DR if not peace of mind that an "undo" is possible.

    • diggan 5 days ago

      Totally understandable. Personally I don't use Bluesky for anything vital, it's just data that the world wouldn't be better/worse without anyways, so I'm gonna go and give it a try even if there is no undo.

      I love that people even has the choice, so much better than not even being able to.

  • __justplaying 6 days ago

    author here, should you have questions!

    • moreati 5 days ago

      What's in that 4.5 TB? e.g. message metadata? Message text? Media?

      What time window does it cover? A rolling N day window? Everything since year dot?

      Can it be pruned? e.g. only data of accounts followed or messages interacted with

    • theschmed 5 days ago

      Thanks for making yourself available to answer questions! Hopefully this is not a dumb question.

      Is plc.directory a single point of failure for BlueSky users who want to take advantage of the benefits of a did:plc? And if so, is that a permanent thing or down the road will there be multiple interoperating did:plc directories?

      • __justplaying 5 days ago

        yes it's a SPOF. not sure about the second question, but i do know there are plans to transfer its ownership to an independent foundation

        • pfraze 5 days ago

          Transferring to an independent org is what we're talking about now, yes.

          The backstory to PLC is that we picked up the DID standard and looked for an existing registry-method that would satisfy requirements¹. None of them really did. We then surveyed mechanisms for decentralized operation: DHTs, open blockchains, permissioned blockchains, and federated databases. Of them, the two blockchain variants seemed perhaps promising, but still premature since (as of 2022) you there's cost variability due to load and in some cases bad transaction latency (eg 10 minutes).

          We decided the best decision was to create PLC, which matches all of the requirements except for longterm meta governance. The way we designed it was to make the registry mechanics transferrable to a different protocol in the future, so that if for instance we decided (say) a DHT was suitable (it's not) we'd be able to use the same identifiers but change resolution and mutations to a new process. Then we started talking to other SMEs to get their take.

          Ultimately the solution that's gotten the most favorable response has been setting up an ICANN-style independent organization to operate it. This can be joined with a couple of interesting systems, such as mirrors which tail a certificate-transparency-style audit log, and which could even serve as transaction witnesses to indicate when the core registry might be rejecting updates ("write censorship").

          What can I say, some things take time and stakeholder-building. Look up the history of DNS and Network Solutions Inc for a bit of a wild ride that people have forgotten about. One other thing I should point out is that the DID spec enables multiple registry methods. Atproto currently supports did:web, and if other methods show up which satisfy the requirements then we are interested.

          ¹ Secure against manipulation by the registry operators, longterm meta governance, highly available, reasonable transaction latency, reliably low cost that's not dogged by token speculation, low ecological impact.

          • jazzyjackson 5 days ago

            Hey pfraze, forgive my ignorance but what role does DID serve that DNS doesn't? My favorite part about bsky is using TXT record to prove that I control my domain for username purposes, what's the downside to just generating a keypair, and using the fingerprint of the public key as my identity? (Maybe with some affordance for key rotation vis a vis KERI*) Not doubting youall weighed every possibility, just wondering what I'm missing

            *Key Event Receipt Infrastructure

            • steveklabnik 5 days ago

              Not Paul, but DID is a stable ID over time, whereas dns is not. This lets you change your handle without the network losing track of who you are. I was @steveklabnik.bsky.social before I was @steveklabnik.com, and when I made the switch, all of my previous stuff was still there.

              This is a fun party trick in some sense, but also a real meaningful feature in another. If I ever decide to move from steveklabnik.com to steve.klabnik.com, a thing I have been considering for a few years, my stuff on @proto/Bluesky will be one of the only services that doesn't have the issue that's kept me from pulling the trigger: updating the entire world that that's where I am now.

              • kiitos 5 days ago

                DIDs are stable only in the context of a specific 'verifiable data registry' as the spec puts it.

                https://www.w3.org/TR/did-core/#dfn-verifiable-data-registry

                DIDs delegate trust and authority to a data registry, in exactly the same way that DNS delegates trust and authority to ~ICANN.

                The system model is exactly the same. The difference is only in the properties of the authoritative entity.

                • steveklabnik 5 days ago

                  That's a good point: I was speaking in a more social manner. Because domains are human-readable, they tend to be used for humans. Bluesky could have chosen to just use domains, but I personally prefer that we have the additional layer of indirection. Plus like, you have the ability (at the low level, not really exposed in the UI in any meaningful way) to be multiple people: I can associate multiple domains with my DID.

                  That said, you're not wrong that a registry is a registry.

                  • kiitos 5 days ago

                    Yeah, definitely not suggesting domains are a better form of identity!

              • pfraze 5 days ago

                Yes! And if this were not the case then account portability between PDS hosts would be really challenging. Same logic as keeping your phone number when you switch cell carriers

          • Kye 5 days ago

            >> "What can I say, some things take time and stakeholder-building."

            The ongoing WordPress fiasco is a good sign of what happens when you set up an independent organization too soon. You won't have the people or the commitments from those people to maintain that independence, so the independent thing ends up not being able to do anything to protect the thing that was supposed to be independent from the commercial interests looking to exploit it.

          • mitochondriaz 5 days ago

            Can you say more on why DHTs are not a solution? Are you aware of https://github.com/pubky/pkarr, for example? It seems to be very good!

    • jervant 5 days ago

      How are Direct Messages implemented in Bluesky if anyone can access a firehose of all network activity?

      • __justplaying 5 days ago

        DMs are currently 1:1 only and closed source. They are working on/planning to build proper E2EE DMs that support group chats.

    • mintplant 5 days ago

      What's the difference between social-app and the AppView?

      • pfraze 5 days ago

        social-app is the client side, AppView is the backend api surface

  • ck2 5 days ago

    I found it interesting it's almost impossible, very difficult to get real Bluesky stats

    This site tries but has limits:

    * https://bsky.jazco.dev/stats

    They broke 14 million yesterday and it seems to be snowballing now since the election:

    * https://bsky.app/profile/jaz.bsky.social/post/3laetwhztdk2x

  • heavensteeth 5 days ago

    This site is extremely snappy. Good work.

  • mdaniel 5 days ago

    Also, yesterday someone posted[1] https://frontpage.fyi/ which seems like it's predominately Bluesky/ATprotocol news but since both of those interest me, if this blog link interests you then so might that link. It logs in with Bsky oauth2 federation

    1: https://news.ycombinator.com/item?id=42081210

  • jazzyjackson 5 days ago

    Is it feasible to run a bluesky instance "on prem" and "offline" for instance as an airgapped corporate intranet ?

    • nisten 5 days ago

      Great do I have to setup LDAP , oauth, and troubleshoot corporate-style single-signon systems for the next 6 months just to get a chat server running now....

    • elfprince13 5 days ago

      I think if you replaced the plc directory with a corporate domain that would be pretty straightforward?

  • nisten 5 days ago

    Is the actual guide just this <400 word thing, or is it all those 15 different links on the post, or only some of them....

    Does that... bureaucracy of documentation not infuriate anyone else or is it just me. I guess I'll try and reset my password to bluesky website, assuming it's this .app one, but then it's asking me to maybe select a provider ... of my password.

    Does whoemever made this user experience not have enough emotional intelligence realize how infuriating it is?

    • __justplaying 5 days ago

      This was a quick and dirty post I put together primarily for people who are already on Bluesky and have dev experience, and peppered with appropriate links where you have actual guides and/or documentation for each bit.

    • steveklabnik 5 days ago

      > I guess I'll try and reset my password to bluesky website, assuming it's this .app one, but then it's asking me to maybe select a provider ... of my password.

      It's asking what the host of your data is. If you're not running your own server, then the default value of Bluesky itself is the correct one.

  • __justplaying 5 days ago

    How do I ask the mods to swap out the link to the actual post instead of my blog's front page?

    (...also, the title, as the original has the caveat)

    • Jtsummers 5 days ago

      It's likely the correct page was submitted. The correct page includes a canonical link in the HTML:

        <link rel="canonical" href="https://alice.bsky.sh"/>
      
      HN will replace submission links with the canonical link if it's found.
      • __justplaying 5 days ago

        oh. time to look at the code of my blog...

    • paulgb 5 days ago

      @dang a better URL would be https://alice.bsky.sh/post/3laega7icmi2q

      (I can't tell if Dan has an alert set up on his handle or whether he just sees everything, but hopefully that works :))

      • yorwba 5 days ago

        dang doesn't have an alert and he doesn't see everything. https://news.ycombinator.com/item?id=41317232 The official way to contact the mods is in the footer, i.e. email hn@ycombinator.com

        • paulgb 5 days ago

          Ah thanks, good to know. I guess I've just been lucky with it and developed a superstition that it works.

          • timerol 5 days ago

            He is also extremely active here, so there's a good chance he reads and responds to a random comment without an email. But email is the approved (and fastest) way to go about it

        • __justplaying 5 days ago

          will email, thanks

      • __justplaying 5 days ago

        thanks!

    • dang 5 days ago

      Fixed now!

  • zxcvbnm69 5 days ago

    [dead]

  • elfprince13 5 days ago

    but I thought that Bluesky wasn't meaningfully distributed /s

    • jazzyjackson 4 days ago

      If you thought, past tense, you were probably right, but it's been in the oven for 3 years so it's finally approaching "fully baked"

  • jonstaab 5 days ago

    [flagged]

    • timerol 5 days ago

      I'm sure there are HNers who built desktops with 8TB or 16TB hard drives, and have not (yet) needed the space for as many games and media as expected.

    • numpad0 5 days ago

      8TB WD CMR is like $99, 2x48GB of DDR5 is ~$250. Memory and storage are currently way cheaper than many think it is.

    • __justplaying 5 days ago

      didn't say it was cheap!

      • nightpool 5 days ago

        But why is it required? Do you really need a copy of everyone's data locally? If the only way to self-host bluesky is to have an entire copy of the entire database, that seems like it's really bad from a scaling perspective.

        • half-kh-hacker 5 days ago

          What else would "self-hosting all of Bluesky" mean other than a copy of the entire site? If you just want to participate in the network host a PDS, which only stores your own posts.

          • nightpool 5 days ago

            Surely there's some middle ground between only hosting your own data and being reliant on another site to keep track of your following / followers and hosting a duplicate copy of the entire network?

            • steveklabnik 5 days ago

              For sure. If you just want to host your own data, you can do that. A PDS for you and maybe some friends is very small and cheap to host.

              • nightpool 5 days ago

                My understanding though is that having a PDS on its own is useless without an AppView to collect the data from the relay? Or am I misunderstanding the architecture here? https://docs.bsky.app/docs/advanced-guides/federation-archit...

                • steveklabnik 5 days ago

                  I'm talking about the case where you wanted to run your own PDS and use all of the other infrastructure being run by Bluesky.

                  If you fully want your own copy of everything, then you'd want to run a copy of everything. But you don't have to. It really depends on what your goals are. That's why the post is about the maximal scenario. "Just your own PDS" is the minimalist scenario. But I think it's the one that makes sense for 95% of users who want to self-host.

                  • nightpool 5 days ago

                    Right, and I'm saying "surely there must be a middle ground between "using all of Bluesky's infrastructure" and "having a 4.5tb copy of every post ever made on the network""

                    • lisowski 5 days ago

                      What exactly would that be?

                      I feel like the middle ground your talking about could be just a feed?

                      A feed is: a server that consumes the firehose and decided on whether to store posts, when loaded in the app it returns some post to create a feed

                      So essentially you only store references to part of the network rather than storing the whole thing

                    • jonstaab 5 days ago

                      consider the nostr protocol

            • half-kh-hacker 5 days ago

              Your following list is stored in your own repo, so it lives on your PDS. You can theoretically have partial replicas of the network but nobody has bothered yet; if you want to make software like that, a good start would be subscribing to the firehose and filtering down to DIDs you care about / supplying the watched DIDs parameter to a Jetstream instance

            • fiatjaf 5 days ago

              The middle ground you're looking for is impossible in the AT protocol, it is however what the Nostr protocol is aiming towards.

        • jazzyjackson 5 days ago

          "self host an entire copy of all user data" is a pretty cool capability to have, kind of proof that the infrastructure is really open and forkable. you seem to have misunderstood OPs goals. Serving your own data from a personal data server is a much less arduous affair.

        • galactus 5 days ago

          Uh, it is not required. You can run only a PDS if you want to self host your data and everything will work.

          But it is indeed very cool that you can actually host a relay if you want (for fun, learning, or whatever reason)

      • bombcar 5 days ago

        Ten terabytes of spinning rust is only $100-$300 or so, that's not bad at all.

        • jonstaab 5 days ago

          My point is not the current size, it's the eventual size if bluesky succeeds. Facebook ingests 100TB/day. Self-hosting a bluesky relay isn't (won't be) a thing.

          • galactus 5 days ago

            It could be a thing. Not for individual tinkerers but for companies. The fact that today, with already 14 million users, is still possible for an individual to host it is amazing.

    • 5 days ago
      [deleted]