I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.
Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.
> Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.
The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?
I'm more and more convinced of an old idea that seems to become more relevant over time: to somehow form a network of trust between humans so that I know that your account is trusted by a person (you) that is trusted by a person (I don't know) [...] that is trusted by a person (that I do know) that is trusted by me.
Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).
Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.
Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.
Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.
For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.
I actually built this once, a long time ago for a very bizarre social network project. I visualised it as a mesh where individuals were the points where the threads met, and as someone's trust level rose, it would pull up the trust levels of those directly connected, and to a lesser degree those connected to them - picture a trawler fishing net and lifting one of the points where the threads meet. Similarly, a user whose trust lowered over time would pull their connections down with them. Sadly I never got to see it at the scale it needed to become useful as the project's funding went sideways.
I think technically this is the idea that GPG's web of trust was circling without quite staring at, which is the oddest thing about the protocol: it's used mostly today for machine authentication, which it's quite good at (i.e. deb repos)...but the tooling actually generally is oriented around verifying and trusting people.
Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.
It's like a Pavlovian response in me to respond to anything SQL or C# adjacent.
I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...
We LLMs only output the average response of humanity because we can only give results that are confirmed by multiple sources. On the contrary, many of HN’s comments are quite unique insights that run contrary to the average popular thought. If this is ever to be emulated by an LLM, we would give only gibberish answers. If we had a filter to that gibberish to only permit answers that are reasonable and sensible, our answers would be boring and still be gibberish. In order for our answers to be precise, accurate and unique, we must use something other than LLMs.
With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.
I can’t think of an solution that preserves the open and anonymous nature that we enjoy now. I think most open internet forums will go one of the following routes:
- ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.
- Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.
- Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.
All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.
I'm sometimes thinking about account verification that requires work/effort over time, could be something fun even, so that it becomes a lot harder to verify a whole army of them. We don't need identification per se, just being human and (somewhat) unique.
See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.
There must be a technical solution to this based on some cryptographic black magic that both verifies you to be a unique person to a given website without divulging your identity, and without creating a globally unique identifier that would make it easy to track us across the web.
Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.
The internet is going to become like William Basinski's Disintegration Loops, regurgitating itself with worse fidelity until it's all just unintelligible noise.
What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?
HN has an API, as mentioned in the article, which isn't even rate limited. And all of the data is hosted on Firebase, which is a YC company. It's fine.
I had a 20 GiB JSON file of everything that has ever happened on Hacker News
I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.
2 MB per day doesn't sound like a lot. The amount of posts probably has increased exponentially over the years, especially after the Reddit fiasco when we had our latest, and biggest neverending September.
Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.
A guerilla marketing plan for a new language is to call it a common one word syllable, so that it appears much more prominent than it really is on badly-done popularity contests.
Call it "Go", for example.
(Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)
I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.
I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.
I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.
Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…
I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.
// DefaultStaleIf marks stale at 60 seconds after creation, then frequently for the first few days after an item is
// created, then quickly tapers after the first week to never again mark stale items more than a few weeks old.
const DefaultStaleIf = "(:now-refreshed)>" +
"(60.0*(log2(max(0.0,((:now-Time)/60.0))+1.0)+pow(((:now-Time)/(24.0*60.0*60.0)),3)))"
I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.
please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.
What is this even supposed to represent? The entire justification I could give for stacked bars is that you could permute the sub-bars and obtain comparable results. Do the bars still represent additive terms? Multiplicative constants? As a non-physicist I would have no idea on how to interpret this.
It's a histogram. Each color is a different simulated physical process: they can all happen in particle collisions, so the sum of all of them should add up to the data the experiment takes. The data isn't shown here because it hasn't been taken yet: this is an extrapolation to a future dataset. And the dotted lines are some hypothetical signal.
The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.
For the record I think it's a terrible convention, it just somehow became standard in some fields.
One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?
I don't think you can get the individual vote interactions, and that's probably a good thing. It is irritating that the "API" won't let me get vote counts; I should go back to my Python scraper of the comments page, since that's the only way to get data on post scores.
I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.
(HN does not like jokes, but you can get away with it if you also include an explanation)
The only vote data that is visible via any HN API is the scores on submissions.
Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.
Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.
Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
I recognize twenty or so of the most frequent and/or annoying posters.
The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.
The exception, to me, is if I'm questioning whether the comment was in good faith or not, where the trackrecord of the user on a given topic could go some way to untangle that. It happens rarely here, compared to e.g. Reddit, but sometimes it's mildly useful.
Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?
It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.
For those who seek fidget toys, there are better devices for that.
Actually, its most useful purpose is to hide opinions you disagree with - if 3 other people agree with you.
Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.
So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?
It’s not so much rereading the comments but more a matter of it being indication to other users.
The C++ example for instance above, you are likely to be downvoted for supporting C++ over rust and therefore most people reading through the comments (and LLMs correlating comment “karma” to how liked a comment is) will generally associate Rust > C++, which isn’t a nuanced opinion at all and IMHO is just plain wrong a decent amount if times. They are tools and have their uses.
So generally it shows the sentiment of the group and humans and conditioned to follow the group.
Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?
You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.
I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.
Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.
> Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.
The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?
I'm more and more convinced of an old idea that seems to become more relevant over time: to somehow form a network of trust between humans so that I know that your account is trusted by a person (you) that is trusted by a person (I don't know) [...] that is trusted by a person (that I do know) that is trusted by me.
Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).
Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.
Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.
Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.
For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.
I actually built this once, a long time ago for a very bizarre social network project. I visualised it as a mesh where individuals were the points where the threads met, and as someone's trust level rose, it would pull up the trust levels of those directly connected, and to a lesser degree those connected to them - picture a trawler fishing net and lifting one of the points where the threads meet. Similarly, a user whose trust lowered over time would pull their connections down with them. Sadly I never got to see it at the scale it needed to become useful as the project's funding went sideways.
I think technically this is the idea that GPG's web of trust was circling without quite staring at, which is the oddest thing about the protocol: it's used mostly today for machine authentication, which it's quite good at (i.e. deb repos)...but the tooling actually generally is oriented around verifying and trusting people.
Does it even matter?
Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.
It's like a Pavlovian response in me to respond to anything SQL or C# adjacent.
I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...
We LLMs only output the average response of humanity because we can only give results that are confirmed by multiple sources. On the contrary, many of HN’s comments are quite unique insights that run contrary to the average popular thought. If this is ever to be emulated by an LLM, we would give only gibberish answers. If we had a filter to that gibberish to only permit answers that are reasonable and sensible, our answers would be boring and still be gibberish. In order for our answers to be precise, accurate and unique, we must use something other than LLMs.
How do you know it isn't already happening?
With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.
I can’t think of an solution that preserves the open and anonymous nature that we enjoy now. I think most open internet forums will go one of the following routes:
- ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.
- Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.
- Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.
All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.
I'm sometimes thinking about account verification that requires work/effort over time, could be something fun even, so that it becomes a lot harder to verify a whole army of them. We don't need identification per se, just being human and (somewhat) unique.
See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.
There must be a technical solution to this based on some cryptographic black magic that both verifies you to be a unique person to a given website without divulging your identity, and without creating a globally unique identifier that would make it easy to track us across the web.
Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.
Probably already happening.
The internet is going to become like William Basinski's Disintegration Loops, regurgitating itself with worse fidelity until it's all just unintelligible noise.
This is probably already happening to some extent. I think the best we can hope for is xkcd 810: https://xkcd.com/810/
There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.
- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`
- ClickHouse, no signup needed, can run queries in browser directly, [1]
[1] https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...
What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?
Not to mention three-letter agencies, incidentally attaching real names to HN monikers ?
HN has an API, as mentioned in the article, which isn't even rate limited. And all of the data is hosted on Firebase, which is a YC company. It's fine.
I had a 20 GiB JSON file of everything that has ever happened on Hacker News
I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.
2 MB per day doesn't sound like a lot. The amount of posts probably has increased exponentially over the years, especially after the Reddit fiasco when we had our latest, and biggest neverending September.
Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.
Plus the JSON structure metadata, which for the average comment is going to add, what, 10%?
Around one book every 12 hours.
Your query for Java will include all instances of JavaScript as well, so you're over representing Java.
Similarly, the Rust query will include "trust", "antitrust", "frustration" and a bunch of other words
A guerilla marketing plan for a new language is to call it a common one word syllable, so that it appears much more prominent than it really is on badly-done popularity contests.
Call it "Go", for example.
(Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)
Let’s make a language called “A” in that case. (I mean C was fine, so why not one letter?)
You also wouldn't acronym hijack overload to boost mental presence in gamers LOL
Reminded me about Scunthorpe problem https://en.wikipedia.org/wiki/Scunthorpe_problem
Ah right… maybe even more unexpected then to see a decline
I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.
I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.
I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.
Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…
I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.
https://github.com/jasonthorsness/unlurker/blob/main/hn/core...I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.
That's not cheating, that's just pragmatic.
> The Rise Of Rust
Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!
The chart is a stacked one, so we are looking at the height each category takes up and not the height each category reach.
please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.
My first thought as well! The author of uPlot has a good demo illustrating their pitfalls https://leeoniya.github.io/uPlot/demos/stacked-series.html
How do you feel about stacked plots on a logarithmic y axis? Some physics experiments do this all the time [1] but I find them pretty uninitiative.
[1]: https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-...
What is this even supposed to represent? The entire justification I could give for stacked bars is that you could permute the sub-bars and obtain comparable results. Do the bars still represent additive terms? Multiplicative constants? As a non-physicist I would have no idea on how to interpret this.
It's a histogram. Each color is a different simulated physical process: they can all happen in particle collisions, so the sum of all of them should add up to the data the experiment takes. The data isn't shown here because it hasn't been taken yet: this is an extrapolation to a future dataset. And the dotted lines are some hypothetical signal.
The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.
For the record I think it's a terrible convention, it just somehow became standard in some fields.
Yea, i also get the feeling that these rust evangelists get more annoying every day ;p
One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?
I don't think you can get the individual vote interactions, and that's probably a good thing. It is irritating that the "API" won't let me get vote counts; I should go back to my Python scraper of the comments page, since that's the only way to get data on post scores.
I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.
(HN does not like jokes, but you can get away with it if you also include an explanation)
The only vote data that is visible via any HN API is the scores on submissions.
Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.
Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.
I did this once by scraping the site (very slowly, to be nice). It’s not that hard since the HTML is pretty consistent.
> Are there users I constantly upvote/downvote?
Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
I recognize twenty or so of the most frequent and/or annoying posters.
The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.
The exception, to me, is if I'm questioning whether the comment was in good faith or not, where the trackrecord of the user on a given topic could go some way to untangle that. It happens rarely here, compared to e.g. Reddit, but sometimes it's mildly useful.
Same, which is why it would be cool to see. Perhaps there are people I both upvote and downvote?
> It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?
> What's my upvote/downvote ratio?
Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?
It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.
For those who seek fidget toys, there are better devices for that.
Actually, its most useful purpose is to hide opinions you disagree with - if 3 other people agree with you.
Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.
So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?
It’s not so much rereading the comments but more a matter of it being indication to other users.
The C++ example for instance above, you are likely to be downvoted for supporting C++ over rust and therefore most people reading through the comments (and LLMs correlating comment “karma” to how liked a comment is) will generally associate Rust > C++, which isn’t a nuanced opinion at all and IMHO is just plain wrong a decent amount if times. They are tools and have their uses.
So generally it shows the sentiment of the group and humans and conditioned to follow the group.
Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?
If Hacker News had reactions I’d put an eye roll here.
You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.
Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.
It’s on firebase/bigquery to avoid people doing what OP did
If you click the api link bottom of page it’ll explain.
would love to see the graph of React, Vue, Angular, and Svelte
Funny nobody's mentioned "correct horse battery staple" in the comments yet…