30 comments

  • maxmouchet 13 hours ago

    Another solution is to use an MMDB (“MaxMind DB”) file [1] which is essentially a binary tree + deduplicated values (same as idea 3.1).

    There are several free ASN MMDBs [2,3] but you can also build your own MMDB files from any Prefix->Value mapping with the mmdbwriter library [4] or a CLI tool built on top of it like mmdbctl [5].

    Assuming the ASN MMDB is fully loaded in memory, it would use around 60MB.

    [1] https://maxmind.github.io/MaxMind-DB/

    [2] https://dev.maxmind.com/geoip/docs/databases/asn/

    [3] https://ipinfo.io/products/free-ip-data-downloads

    [4] https://github.com/maxmind/mmdbwriter

    [5] https://github.com/ipinfo/mmdbctl

    (I work for IPinfo, but there are lots of other companies offering MMDB files).

  • zamadatix 12 hours ago

    If you want to get fun with it you can actually fit the entire IP range in a single 32 bit value. On the internet networks larger than /24s aren't considered valid advertisements so you only need to store 3 bytes (the first 24 bits) of the address to hold the start. If you store than in a uint32 you can use the remaining 8 bits to hold the netmask (you don't actually need all of them but it keeps things byte aligned), which lets you dynamically calculate the end address in just a few CPU cycles. The search still occurs on the base address portion so you don't ever have to calculate the end address until you've already found the right base address.

    • riobard 10 hours ago

      > fit the entire IP range in a single 32 bit value

      … only if you pretend IPv6 does not exist, which isn't the case in the original blogpost.

      But the idea can be extended to fit entire IP (both v4 and v6) subnets in an uint64, given that the minimal IPv6 prefix size to be globally routed is /48, you could use the first byte of the uint64 to mark the address type (e.g. 0x4 for IPv4 and 0x6 for IPv6), the remaining 3 (for IPv4) or 6 (for IPv6) bytes for actual prefix, and the last byte for prefix length.

    • blibble 4 hours ago

      for ipv4 you can do with with just a single lookup, so real O(1)

      store the entire thing in one big array, indexed by the uint24 value of the /24

      so ASN for 0.0.0.0 stored in position 0, ASN for 0.0.1.0 stored in position 1, etc

      and you can drop 0.0.0.0/8, 10.0.0.0/8, 127.0.0.0/8, 224.0.0.0/4, 240.0.0.0/24 to save 35 /8s

      (256-16-16-3)256256*4 bytes needed -> 55mb

      mmap it if necessary

    • teddyh 10 hours ago

      > On the internet networks larger than /24s aren't considered valid advertisements

      The number 24 is larger, but the network is smaller.

      • zamadatix 10 hours ago

        Good catch, thanks! This should have been "netmasks" instead of "networks" but it's too late for me to edit now.

        • AStonesThrow an hour ago

          A netmask is different and deprecated terminology.

          The "/24" you refer to is a "prefix length", i.e. a network prefix being advertised by BGP.

  • vlovich123 12 hours ago

    Since this is fairly static data, preprocess the on-disk format to match your memory layout and just mmap the file so that your memory usage for the DB is effectively 0 and the OS kernel pages it in/out for you based on system-level needs?

  • zamadatix 12 hours ago

    > As an aside – I’m storing the ASN in a uint32, is that right? I looked in the ip2asn file and the biggest one seems to be 401307, though there are a few lines that say 4294901931 which is much bigger and would not fit in a uint32. I decided to just ignore the gigantic ASNs and assume I could fit everything in a uint32.

    Does anyone have a theory to how that 4294967296 might have made its way into the data? I'm getting 0 matches on the actual internet with:

        birdc show route protocol peer4 | grep 4294967296 -c
    
    And ASNs are certainly not >32 bit.

    Edit: actually, I'm not seeing it in the linked iptoasn data either?

    • de46le 12 hours ago

      https://datatracker.ietf.org/doc/html/rfc6996#section-5 - it's in the range reserved for private use, so it's probably network internal to wherever the list is originally from?

    • maxmouchet 10 hours ago

      Bogus announcements are probably filtered by your upstream(s) (see [1] for a common list of filters).

      IP-to-ASN mappings are typically built from route collectors [2,3] that peer with various networks and receive their announcements. AFAIK route collectors don't filter anything and it's easy to find bogus announcements (e.g. private ASNs) in the data.

      I can't find 4294967296 from a quick glance at the latest RouteViews data but I can find other private ASNs. For example AS7594 - AS2764 - AS4294901866 for 210.10.189.0/24 seen by the route-views.perth collector.

      I don't know what kind of filtering iptoasn.com is doing but at work (ipinfo.io) we do filter bogus origins, as well as a bunch of other things like RPKI/IRR-invalid routes and hyper-specific prefixes (> /24 or /48) [4].

      [1] https://bgpfilterguide.nlnog.net

      [2] https://www.routeviews.org/routeviews/

      [3] https://www.ripe.net/analyse/internet-measurements/routing-i...

      [4] https://hyperspecifics.io

      • zamadatix 10 hours ago

        Actually 4294967296 couldn't ever appear as the maximum value you can fit in the protocol field is 1 less than that... my problem here was I couldn't manage to keep the 2 numbers I was comparing (the one in the article and 2^32) straight haha! This was mistake was noted by a commenter here https://news.ycombinator.com/item?id=41963745

        That said you're ultimately right that my upstream provider is filtering the 4294901866 value from the article as well anyways for the reasons you stated.

        • maxmouchet 10 hours ago

          Ah right haha. Thanks for the heads up, I should have checked ^^

    • rayhaanj 12 hours ago

      Also 4_294_901_931 is actually just small enough to fit in a u32 whose max value is 4_294_967_295

      • astrobe_ 11 hours ago

        Perhaps hex notation helps to see things better: 0xFFFF00AB. It's actually nearly 64K below the maximum. 0xAB is 171.

      • zamadatix 10 hours ago

        Ha, this is exactly right! It's 100% in u32 range and part of the private use reservations so wouldn't appear in the internet table unless such values aren't filtered by the provider and/or it's being used locally.

        Originally I went to write your exact comment because it seemed the value in the article should fit at a glance and then I must have done the comparison check backwards because I started pasting the 2^32 value in the rest of my comment concluded it was actually too large when really I had just jumbled things about.

        Thanks for setting my mind straight!

    • tialaramex 12 hours ago

      The ip2asn-v4-u32.tsv file has

          996520704       996520959       4294901931      Unknown AS4294901931
      
          3523770368      3523770623      4294901931      Unknown AS4294901931
      
          3523771136      3523771391      4294901931      Unknown AS4294901931
  • hnenjoyer 12 hours ago

    Would be nice to see revisited SQLite solution.

    Rest of the described approaches relies on data being in memory rather on disk - not sure whether it was microsd card, SSD over USB, etc. but there's in memory option for SQLite.

    Having that in mind it's time to optimize storage - instead of text just store using two 64 bit integers and it was already noticed two indexes are not very helpful - just create one that has both columns included - preferable as primary key so you don't waste space and already have rest of the select data included in single lookup.

    Happy to see revisited results, but still not sure if that's a best approach for your problem - but seems like obvious improvements for generic SQL approach.

  • pjmlp 8 hours ago

    A good example on how to use better data structures, instead of the usual rewrite into something else.

  • moontear 11 hours ago

    Interesting read - but how cool is Mess with DNS! What a nice live learning tool.

  • gmuslera 12 hours ago

    The trie approach seemed to be a bit plain. Tries are particularly good for storing netblocks, and see if a particular IP belongs to some block stored there. But is better to use binary values instead of strings, and to use patricia or radix trees as they might compress strings of bits without branches with a single node. And, anyway, postgresql already have the netblock type with efficient indexes for them.

    But if that is used for individual IPs, without worrying about blocks they belong to, probably won't get big gains in that area.

  • Timber-6539 12 hours ago

    Had the same issue with restic taking up a lock in the event of an unsuccessful operation. I have a very simple workaround for it.

    ExecStartPre=/usr/bin/restic unlock

    ExecStart=/usr/bin/restic backup

    • LtdJorge 11 hours ago

      That has the benefit of Systemd not letting more than one instance run at the same time, which I'm guessing is one of the reasons the lock was needed in the first place.

      You just have to make sure that the type of Service used is correct, so that Systemd can track whether Restic has actually stopped running.

  • immibis 7 hours ago

    You need an array of 64-bit values: a 32-bit end address, and a 32-bit ASN. Use 0 for unallocated ranges - it's an invalid ASN.

    For IPv6 you need more. It just barely fits if you consider that all public Internet addresses start with 001 and only up to 48 bits can be a published prefix - that's 45 bits - and the public ASNs go up to 19 bits according to your article - that's exactly 64 in total (those high ASNs are bad data from wherever you're getting it from). But it won't work once the next 100000 ASNs get assigned, so you'd better go up to 96 or 128 bits and store the ASN properly in 32 bits.

    For cache efficiency you should have one table of addresses and another parallel table of ASNs. Then every access to the address table that pulls in a cache line doesn't waste half the cache line with ASNs you aren't looking for. It won't affect the total memory use, however.

    A separate table would hold the name and country for each ASN.