Hacker News Data Map [180MB]

(lmcinnes.github.io)

183 points | by mooreds 2 days ago ago

26 comments

  • lucb1e 2 days ago

    Maybe add [180MB] to the title, similar to how videos or pdfs are tagged? It starts loading that immediately when you open the page, which would be 18% of my data bundle if I had been on mobile

    (This is actually transferred bytes btw, based on seeing ~12MiB/s for ~15 seconds in the system monitor)

    Edit: some people are saying they can't view it, especially on mobile browsers. Here's some screenshots:

    - Landing overview https://snipboard.io/YTQRZc.jpg

    - Zooming into the center, hovering over an item that is too small to see but the title shows in a tooltip: https://snipboard.io/xOvA47.jpg

    - Zoomed in further still, now an individual item can be targeted easily and there are lines delimiting topics (looking like height lines on a map): https://snipboard.io/P6UVAv.jpg

    - Hovering over the year selector on the bottom left, same zoom position for comparison: https://snipboard.io/VDW2JI.jpg

    Clicking the year seems not to do anything, you can't lock into that view. Clicking a title opens the page, not the discussion thread.

    ---

    Looking into the corresponding GitHub repository (I wonder if they have a bandwidth limit for repositories or if it will foot any bill), <https://github.com/lmcinnes/datamapplot_examples>, there's also a visualization for Wikipedia which is a bit less heavy: https://lmcinnes.github.io/datamapplot_examples/Wikipedia_da... (screenshot <https://snipboard.io/M9GRQt.jpg>)

    • walterbell 2 days ago

      180MB download per HN visitor isn't going to be fun for the server either.

      More civilized would be a photo snapshot + optional link to 180MB download for interactive UX.

      • odo1242 2 days ago

        180MB is probably fine for most servers (especially CDNs), to be honest. My M1 MacBook with 16 gigabytes of RAM is struggling to load/display the data though.

        • tg180 2 days ago

          > GitHub Pages sites have a soft bandwidth limit of 100 GB per month.

          > If your site exceeds these usage quotas, we may not be able to serve your site, or you may receive a polite email from GitHub Support suggesting strategies for reducing your site's impact on our servers, including putting a third-party content distribution network (CDN) in front of your site, making use of other GitHub features such as releases, or moving to a different hosting service that might better fit your needs.

          https://docs.github.com/en/pages/getting-started-with-github...

          • winter_blue a day ago

            Doesn't GitHub Pages have a CDN?

            • walterbell a day ago

              Presumably there are higher CDN limits for paid Github accounts.

      • lucb1e 2 days ago

        A photo could work as a quick preview indeed. As another idea for large content spiking in popularity, another solution may be something like webtorrent, or whatever peertube uses

        Or a vector map, loading data as needed for the region you're zooming into

    • tomthe 2 days ago

      I made a similar map but with tiles that only load of you zoom in far enough: tomthe.github.io/hackmap/ (Sorry for posting my link so often) That way it has to load only a few megabyte for the first view.

    • mooreds 2 days ago

      > Maybe add [180MB] to the title, similar to how videos or pdfs are tagged

      Done.

  • codingdave 2 days ago

    It is a cool visualization, so I don't want to diminish the effort to make it in any way. And as an experiment in visualization, it is interesting. (If a bit large and laggy.) But if the authors expect people to use it to navigate content, it has a few problems:

    1) The topics don't seem to be hierarchical, so as I drill down on one area, I get all kinds of things that don't seem related. I have no idea what I'm missing unless I zoom into the whole thing.

    2) I don't know where my browser is going when I click a link. That is a security problem.

    3) I cannot tell how this data is sourced. Are these all the links posted to HN? Just the ones that got upvotes? Something else? Because while we have some great links here, we also get a lot of stinkers.

    4) Much of the value of HN is the discussions. I didn't see a way to navigate to discussions related to any of the links.

  • anonu 2 days ago

    I like how Web Development and User Experience grouping is way outside the central bubble.

    Nonetheless, great visualization of a lot of data. I need to learn more about this:

    UMAP: https://umap-learn.readthedocs.io/en/latest/

    Nomic-Embed: https://www.nomic.ai/blog/posts/nomic-embed-text-v1

    The visual groupings aren't perfect. For example, there are a quite a few COVID-19 tagged articles before 2020.

    • Cupprum 2 days ago

      Is that necessarily a bad thing? Cant some posts be relevant even if they were created before covid?

  • nighthawk454 2 days ago

    Repo: https://github.com/lmcinnes/datamapplot_examples

    Also, lmcinnes is the author of UMAP and HDBSCAN!

  • rolfan 2 days ago

    This website crashed my smartphone xD.

    After loading some sections of the map, my screen turned into digital garbage.

  • andrewmcwatters 2 days ago

    Maybe browsers should have resource limits and ask the user if they want to continue loading the page beyond some sort of threshold...

    • lucb1e a day ago

      At least on metered connections, indeed. With HSDPA/3g this was not a problem, a few hundred KB/s is plenty to load text (including jquery and the like) and images just fine but you'd notice if it took oddly long. Then came LTE/4g and you can now download hundreds of megabytes in a few seconds when you've got good signal. When it came out, most people's data bundles were on the order of 500-1500 MB iirc. I guess nobody was a dick about it since I never heard of bundle exhaustion attacks, but what surprises me more is that few people mention running into the issue by accident

      Now we have 5g and I'm still wondering what to do with speeds greater than what 3g+ (HSDPA) could offer, namely full HD streaming. On a rare occasion it's nice to download a file fast, but do I need this to be enabled for me all the time and run this risk? The lower latency of LTE and increased capacity of NR are nice but this constant race of more and more gigabits just to... do what, sell bigger data bundles? There are purposes for this (like wireless internet for homes in rural areas) but not mainstream mobile daily use. Range/coverage was the primary issue and still is to this day. The faster rates require good signal to function at all. Companies are turning off 2g and 3g, the things that allowed you to still get a few KB/s to send a message that you'll be back later, or get a weather forecast, when you're out hiking. The newer versions are just as good when you're in range and worse when you're out of range, so there is no point. I don't know how the incentives are so misaligned, I guess not enough people experienced 3g with good signal (nowadays when you see 3g, you're just out of range and that's why it feels like it's not working) to know that they don't need to upgrade to a plan+phone that can exhaust their data bundle in 90 seconds, or their monthly FUP allowance in a matter of minutes?

      At any rate, yes I agree with your post emphatically, but it doesn't seem like a task specific for browsers to me. Any software can decide to pull an update or other data file (like an offline map file or large video), also while tethered so the downloading software might not even know that it's on a metered network, and cause the same problem

  • rmellow 12 hours ago

    How are the cluster label strings generated?

  • avandekleut 2 days ago

    Thats a neat visualization. It took about 35 seconds to load for me, and the actual loading progress appeared to get stuck at 15% for most of the time, which tempted me to close before it was ready.

  • MaheshNat 2 days ago

    these data maps should incorporate level of detail so unnecessary data isn’t loaded on first load, the way mimic does it.

  • kissgyorgy 2 days ago

    It would be way more usable if a time range could be selected and it would list the actual threads to the results.

  • deskr 2 days ago

    Doesn't work on Firefox - white screen instead of a map.

    • airstrike 2 days ago

      Works here on Firefox 131.0.2 on Mac OS Sequoia 15.0.1

    • lucb1e a day ago

      WFM, latest stable Firefox on Linux

  • xyst 2 days ago

    broken on mobile with safari, I’ll check it later today

  • tofof 2 days ago

    This seems deeply flawed. Every 'category' I drill into has mostly articles that are nothing of the sort.

    "Writing and Gaming" has 6+ Sieve of Erasthones articles (generating prime numbers), the Lehmer sieve (also primes), and "Migrating from Procmail to Sieve", along with a half dozen articles on the death of cursive. The only vaguely gaming related articles are one on Katamari (nothing about writing) and an exercise of naming big numbers.

    Math/Tech/Gender Representation is articles about Magic the Gathering, or about Larry Wall of Perl fame.

    Monty Hall Problem Explanation has several articles about "The Gettysburg Powerpoint Presentation", articles about the name of the Ogg container format, mythbusting 6 misconceptions about learning english, "in defense of swap common misperceptions", adages have opposites - a list, the Assange case, and an article ("I've got some things to say") about poverty and american football.

    Animal Cognition & Development - an article about chinese spies (Operation Fox Hunt), a Schnier on Security post about an NSA program called FOXACID, an examination of Bill Ngyuen's investments, articles about github merges and security metaphors about food delivery by zebra, a traffic calming mechanism in La Paz Bolivia, a complex journey to the middle ear, 6 Creatures by Hieronymus Bosch That Could Be Pokémon, crippling digestive problems, the wikipedia list of organisms named after the Harry Potter series, an article about the Cumberland County New Jersey election board, the scientific evidence for discarding a glass of wine that a fruit fly has touched, a 17th century european blood sport of tossing foxes high into the air, an article about the standard english pangram sentence of fox and dog, a hate-filled rant (sierra juliet foxtrot) about "skulking juvenile fascists" and a Mozilla project to incorporate image processing into their browser.

    Lobster News - articles about the sky being blue, in literal and metaphorical senses both. Microsoft Azure. BlueSky, the twitter alternative. An article about 3d engine lighting. The Timing of Evolutionary Transitions Suggests Intelligent Life is Rare. Keeping a grocery store lobster as a pet. Lobster.rs launched 8 years ago. Bootstrapping Ikarus Scheme. Project Daedalus, a theoretical insterstellar drive. An article about the metaphorical fall of icarus, with visualization of the Chutes and Ladders boardgame. A youtube music album called Land Locked Blues. An rpi car pc. Murmation - the flocking pattern of starlings. Articles about the abuse of a font. An Israeli mining company unearthing a rare mineral. ICARUS, the swiss neutrino detector. An article about "His Majesty’s Airship No. 1, the rigid, phallic Mayfly". A podcast about the stages of color terms in langauge and blue's position at the end.

    Black Swan Theory - A cluster of "I fell in love with a female assassin" (literal, an account of the Cambodian civil war), joined with a cluster around "Disneyland with the Death Penalty" a 4,500 word nonfictional article by William Gibson (of Neuromancer fame) about Singapore's governance as an authoritarian and austere city-state, an article about PMS, the Bill Nye the Science Guy reddit ama, the Bill Nye - Ken Ham debate, Bill Nye on the abortion argument, a capitalist named Bill Ackman's views on a possible pyramid scheme called Herbalife, and a big cluster on Covid vaccine's effect on periods and vaginal bleeding,

    I'm sorry, what?

  • 2 days ago
    [deleted]