10B Integers Walk into an Array

(donraab.medium.com)

52 points | by ingve 5 days ago ago

27 comments

  • zigzag312 3 days ago

    OpenStreetMap has over 9B nodes. To resolve geometry of a way, you need to lookup the node for each point in the geometry.

    The funny thing is that a 64 bit node ID is of equal size as the location the node element contains (32 bit longitude + 32 bit latitude) [1].

    [0] https://taginfo.openstreetmap.org/reports/database_statistic...

    [1] https://wiki.openstreetmap.org/wiki/Node

    • ghusbands 2 days ago

      Your wiki link states that multiple nodes can share the same latitude and longitude (different heights/layers), so the ID could not be subsumed. (Unless you were trying to imply something else?)

      • zigzag312 a day ago

        Nodes have different uses in OSM. I was refereeing to the case where node provide a location for way's geometry.

        97.5% of nodes don’t have tags and only contain latitude and longitude data. [0]

        Only 0.01 % to 0.05 % of the nodes are distinct nodes with the same coordinates and majority of them have been created by error. [1]

        While there are some cases where a node provides an additional information for a way, it would still be much better, if ways would contain geometry directly (instead of node IDs) and encode that additional information differently.

        If way's coordinates contain altitude as 3rd dimension, it can just store three-dimensional geometry. Any layer tag, on a node that is part of a way, could probably be moved to the way itself.

        [0] https://media.jochentopf.com/media/2022-08-15-study-evolutio...

        [1] https://github.com/osmlab/osm-data-model/blob/master/geometr...

        • ghusbands a day ago

          When you're down to 90% of them only providing location (unless you stop caring about topology), it's no longer as clear-cut. I was only really arguing against the implication that it's "funny" to use nodes; the topological connections and normalisation/deduplication do have advantages and it looks like a reasonable technical decision to have made.

          I also imagine that creating new topological connections may become trickier when most locations are 'hidden' in ways.

          However, I do hope you find a common way forward that reduces the apparent redundancy and excess processing/storage requirements.

          • zigzag312 a day ago

            I found the added complexity of this normalization funny/ironic when I wanted to process some OSM data. As IMO, the benefits don't nearly outweigh the added complexity and costs caused by it.

            A different way of creating new topological connections wouldn't be that difficult. For example, some topological connections are already being created via relation type.

            I'm not involved with OSM project, but I think that some changes to OSM data model would benefit it greatly. Resolved geometries would enable stream processing and decrease complexity for majority of task. It would make OSM data much more accessible.

  • smarks 2 days ago

    I've had some discussions with the author on this topic, some of which might have led to his writing this article.

    One of the issues we discussed is trying to do this with Java arrays or collections. As he observes, Java arrays have a maximum size of 2^31 - 1 elements, because they're indexed by a signed 32-bit int. In practice, the Hotspot JVM has a maximum array size of 2^31 - 2 elements. (I'm not entirely sure why this is. I'd guess that it's because the various GC implementations have a maximum memory block size of 2^31 words, and two words are consumed by the object header.)

    The author considered trying to store 10B elements in a Java Collection, but there are some roadblocks that make this either difficult or insurmountable.

    A Java ArrayList stores all its elements in a single array; thus it has the same limitation as Java's arrays. This is a pretty hard limit.

    A Java LinkedList can actually store more than 2^31 elements. It's a non-intrusive linked list, so each element has a separate list node, which is another Java object. Each list node object consumes 40 bytes (including the object header and fields) which means that storing two billion elements in a LinkedList has 80GB of overhead, not including the actual data elements. (And ten billion elements will require 400GB of memory for overhead.) This will probably work if you have enough memory, but this amount of overhead seems prohibitive.

    The author instead used the `fastutil` library to work around these limitations.

    Pretty impressive that Pharo Smalltalk is able to do this.

  • pjmlp 2 days ago

    I wonder why such interesting post got flagged.

    • smarks 2 days ago

      Yes, very strange to see this flagged. Doesn't seem controversial at all.

    • em-bee 9 hours ago

      maybe because some of us are not comfortable with small talk ;-)

  • mrkeen 3 days ago

    > Who needs more than 2 billion elements in an array?

    The best case I've seen for this is memory mapping (mmap).

    There's long been a tradeoff between simple-data & complex-algorithm and complex-data & simple-algorithm. And the complex-data approach has all but won in the typical OO or web-dev shop.

    In short, mmap gives you the power to treat files as arrays. You can just open a 200GB file and run qsort on it, or do a binary search or whatever.

    The operating system takes care of when to move things in and out of ram, and it's pretty good at it. Not only that, but it stays in the cache between invocations of your program. So you can crash out, and resume processing as fast as you can hit ctrl-c, up, enter.

  • Elfener 5 days ago

    > This is one place where static typing and primitive types fall short.

    I can't reproduce it here, but I found it really funny that "short" was monospaced, like it was the type.

    • rbanffy 2 days ago

      I think it was clever.

    • BiteCode_dev 3 days ago

      Sometimes in the middle of the sentence, I write True or None, out of habit.

  • Dagonfly 2 days ago

    It's quite intriguing to think about when and in which applications this bit-length churn will end.

    For example, 64-bit UNIX epoch will work for 292 billion years. That's longer than the age of the universe and obviously enough. Meanwhile, signed 64-bit nanosecond (!) timestamps only range from 1678 AD to 2262 AD. That seems quite restrictive. So ultimately somewhere around 96-bits of time seems to be "enough" for any practical application.

    For audio we've probably crossed the point already. Uncompressed audio is in the ballpark of a few Mbit/s. Humans will probably never need Gbit audio. What's the limit for video? Seems to be undetermined yet.

    All of this will limit the max compute resources humanity will need for certain applications. Maybe we will see the widespread use of 128-bit computers during our lifetimes. Mostly for virtual addressing fun. But will 256-bit ever be needed?

    • kibwen 2 days ago

      > Maybe we will see the widespread use of 128-bit computers during our lifetimes.

      Not at current rates of demand. During the 80s and 90s demand for RAM increased a thousandfold per decade, from 1 KB to 1 MB to 1 GB by 2000-ish. At that rate we'd expect normal users to exceed a 64-bit address space by the 2040s. But instead demand has slowed dramatically, and after 20 years a power user might only have 64 gigs, or 128 if they're feeling spicy. Apple is still selling computers with only 8 gigs of un-upgradeable RAM. And as for virtual memory, I doubt that consumer machines have any legitimate reason to chafe against the 16-exabyte limit (until such time as we're selling zettabyte hard drives, which is fantasy at this point). Servers and supercomputers might be a different story, but even servers are going to be loath to pay the cost of halving the speed of ordinary operations just to double the addressable memory space.

      • Dagonfly 2 days ago

        "Widespread" was maybe the wrong word, I was mostly imagining virtual memory on servers.

        Currently, not all 64 bits are available for user-space VAs. Even with Intel's 5-level paging you can "only" get 56-bit (128 PB) of virtual addresses. On ARM they reserved parts the top-level byte for memory-tagging as well.

        Remember with pointer tagging and overcommit you might have orders of magnitudes more VA address space used than available RAM. Exhausting 128 PB of virtual addresses in the next 40 years does not seem unfeasible to me. CHERI already exists in some niche.

  • superjan 3 days ago

    The author wonders why he does not see a memory footprint for the billions of objects, and conjectures they are stored in-line in the array. I suspect that it is actually achieved by the flyweight pattern, where copies of the same immutable (reference type) object are simply the same object.

    • owl57 3 days ago

      All the numbers in that array are different.

      • superjan 2 days ago

        You’re right. I’ll read more carefully next time.

        • chipsrafferty 2 days ago

          I'd be curious to know how they are stored, since the author either doesn't know either, or thought it wasn't important to tell us.

  • rbanffy 2 days ago

    > Note: Server class machines can be configured with terabytes of RAM today.

    Some even come in tower format, so we can call them high-end technical workstations if we really want to.

  • tehjoker 3 days ago

    More people in the world than a uint32_t... never thought about it like that.

    • samatman 3 days ago

      Given that this is the same number as the number of IPv4 address which exist, I have thought of it in those terms, many time.

      One address per human wouldn't be enough, mind you, but we have about half that to work with.

      • chipdart 3 days ago

        > One address per human wouldn't be enough (...)

        What leads you to believe that?

        I alone use more than one device at a time, and my household alone has over twice the number of devices permanently online than residents. If you happen to have more than one SIM card in a mobile phone and enable mobile internet, it also adds up.

        Then we have all the non-human-facing computers chatting over the internet.

        Don't you agree that even today there are far more devices connected to the internet than there are people?

        • lkirkwood 3 days ago

          I think they probably do agree, GP says one per person wouldn't be enough.

        • wjholden 3 days ago

          The OP is comparing the estimated number of humans on this planet (≈8.2 billion) to the known number of globally-routable, public IPv4 unicast addresses (<2^32≈3.2 billion).

          Network Address Translation (NAT) enables the sharing of public IPv4 addresses, to your point that there are likely more Internet-connected devices in this world than people.

  • anothername12 3 days ago

    Is this just tag bits in the reference pointer?