10B Integers Walk into an Array

(donraab.medium.com)

53 points | by ingve 9 months ago ago

19 comments

OpenStreetMap has over 9B nodes. To resolve geometry of a way, you need to lookup the node for each point in the geometry.

The funny thing is that a 64 bit node ID is of equal size as the location the node element contains (32 bit longitude + 32 bit latitude) [1].

[0] https://taginfo.openstreetmap.org/reports/database_statistic...

[1] https://wiki.openstreetmap.org/wiki/Node

[-]

ghusbands 9 months ago

Your wiki link states that multiple nodes can share the same latitude and longitude (different heights/layers), so the ID could not be subsumed. (Unless you were trying to imply something else?)

[-]

zigzag312 9 months ago

Nodes have different uses in OSM. I was refereeing to the case where node provide a location for way's geometry.

97.5% of nodes don’t have tags and only contain latitude and longitude data. [0]

Only 0.01 % to 0.05 % of the nodes are distinct nodes with the same coordinates and majority of them have been created by error. [1]

While there are some cases where a node provides an additional information for a way, it would still be much better, if ways would contain geometry directly (instead of node IDs) and encode that additional information differently.

If way's coordinates contain altitude as 3rd dimension, it can just store three-dimensional geometry. Any layer tag, on a node that is part of a way, could probably be moved to the way itself.

[0] https://media.jochentopf.com/media/2022-08-15-study-evolutio...

[1] https://github.com/osmlab/osm-data-model/blob/master/geometr...

smarks 9 months ago

I've had some discussions with the author on this topic, some of which might have led to his writing this article.

One of the issues we discussed is trying to do this with Java arrays or collections. As he observes, Java arrays have a maximum size of 2^31 - 1 elements, because they're indexed by a signed 32-bit int. In practice, the Hotspot JVM has a maximum array size of 2^31 - 2 elements. (I'm not entirely sure why this is. I'd guess that it's because the various GC implementations have a maximum memory block size of 2^31 words, and two words are consumed by the object header.)

The author considered trying to store 10B elements in a Java Collection, but there are some roadblocks that make this either difficult or insurmountable.

A Java ArrayList stores all its elements in a single array; thus it has the same limitation as Java's arrays. This is a pretty hard limit.

A Java LinkedList can actually store more than 2^31 elements. It's a non-intrusive linked list, so each element has a separate list node, which is another Java object. Each list node object consumes 40 bytes (including the object header and fields) which means that storing two billion elements in a LinkedList has 80GB of overhead, not including the actual data elements. (And ten billion elements will require 400GB of memory for overhead.) This will probably work if you have enough memory, but this amount of overhead seems prohibitive.

The author instead used the `fastutil` library to work around these limitations.

Pretty impressive that Pharo Smalltalk is able to do this.

mrkeen 9 months ago

> Who needs more than 2 billion elements in an array?

The best case I've seen for this is memory mapping (mmap).

There's long been a tradeoff between simple-data & complex-algorithm and complex-data & simple-algorithm. And the complex-data approach has all but won in the typical OO or web-dev shop.

In short, mmap gives you the power to treat files as arrays. You can just open a 200GB file and run qsort on it, or do a binary search or whatever.

The operating system takes care of when to move things in and out of ram, and it's pretty good at it. Not only that, but it stays in the cache between invocations of your program. So you can crash out, and resume processing as fast as you can hit ctrl-c, up, enter.

pjmlp 9 months ago

I wonder why such interesting post got flagged.

[-]

smarks 9 months ago

Yes, very strange to see this flagged. Doesn't seem controversial at all.

em-bee 9 months ago

maybe because some of us are not comfortable with small talk ;-)

Elfener 9 months ago

> This is one place where static typing and primitive types fall short.

I can't reproduce it here, but I found it really funny that "short" was monospaced, like it was the type.

[-]

BiteCode_dev 9 months ago

Sometimes in the middle of the sentence, I write True or None, out of habit.

rbanffy 9 months ago

I think it was clever.

Dagonfly 9 months ago

It's quite intriguing to think about when and in which applications this bit-length churn will end.

For example, 64-bit UNIX epoch will work for 292 billion years. That's longer than the age of the universe and obviously enough. Meanwhile, signed 64-bit nanosecond (!) timestamps only range from 1678 AD to 2262 AD. That seems quite restrictive. So ultimately somewhere around 96-bits of time seems to be "enough" for any practical application.

For audio we've probably crossed the point already. Uncompressed audio is in the ballpark of a few Mbit/s. Humans will probably never need Gbit audio. What's the limit for video? Seems to be undetermined yet.

All of this will limit the max compute resources humanity will need for certain applications. Maybe we will see the widespread use of 128-bit computers during our lifetimes. Mostly for virtual addressing fun. But will 256-bit ever be needed?

[-]

kibwen 9 months ago

> Maybe we will see the widespread use of 128-bit computers during our lifetimes.

Not at current rates of demand. During the 80s and 90s demand for RAM increased a thousandfold per decade, from 1 KB to 1 MB to 1 GB by 2000-ish. At that rate we'd expect normal users to exceed a 64-bit address space by the 2040s. But instead demand has slowed dramatically, and after 20 years a power user might only have 64 gigs, or 128 if they're feeling spicy. Apple is still selling computers with only 8 gigs of un-upgradeable RAM. And as for virtual memory, I doubt that consumer machines have any legitimate reason to chafe against the 16-exabyte limit (until such time as we're selling zettabyte hard drives, which is fantasy at this point). Servers and supercomputers might be a different story, but even servers are going to be loath to pay the cost of halving the speed of ordinary operations just to double the addressable memory space.

superjan 9 months ago

The author wonders why he does not see a memory footprint for the billions of objects, and conjectures they are stored in-line in the array. I suspect that it is actually achieved by the flyweight pattern, where copies of the same immutable (reference type) object are simply the same object.

[-]

owl57 9 months ago

All the numbers in that array are different.

tehjoker 9 months ago

More people in the world than a uint32_t... never thought about it like that.

[-]

samatman 9 months ago

Given that this is the same number as the number of IPv4 address which exist, I have thought of it in those terms, many time.

One address per human wouldn't be enough, mind you, but we have about half that to work with.

rbanffy 9 months ago

> Note: Server class machines can be configured with terabytes of RAM today.

Some even come in tower format, so we can call them high-end technical workstations if we really want to.

anothername12 9 months ago

Is this just tag bits in the reference pointer?