What is better: a lookup table or an enum type?

(cybertec-postgresql.com)

51 points | by todsacerdoti 3 days ago ago

22 comments

  • unwind 3 days ago

    I don't database, but I like to think I have some kind of intuition for storage space requirements, and this article was very confusing.

    Ignoring the indexes and just focusing on the main table sizes reported, we have:

    - String ("The frequent repetition of these names inflates the size of the table"): 392 MB

    - Enum data type ("Internally, an enum type is stored as four-byte floating point number. So it saves space in the table [...]"): 338 MB

    - Lookup table ("Also, since a smallint only occupies two bytes, the person_l table can potentially use less storage space than the other solutions"): 338 MB.

    I just can't make sense of the numbers, especially given the authors comments that I've quoted.

    Is this some kind of typo/editing fail?

    • leononame 3 days ago

      I'm also wondering about that. But maybe this could be it?

      > Surprisingly, the table is just as big as with the enum type above, even though an enum uses four bytes. The reason is that each table row is aligned at a memory address divisible by eight, so PostgreSQL will add six padding bytes after the smallint. If we had more columns and could arrange them carefully, we could see a difference.

      This could be the explanation. If the row is padded to 8, bigint is 8, then smallint or enum also use 8. The entries in the string table will be 8 or 16 due to the string length. So one row in person_e and person_l is 16, one row in person_s could be about 20 on average, that is a bit closer to the reality than my intuition, although the storage savings are still less than what I would have expected.

      edit:

      I did also try out the test and dropped the primary key on the table to compare only enum and string size:

        SELECT PG_SIZE_PRETTY(PG_RELATION_SIZE('person_e')), PG_SIZE_PRETTY(PG_RELATION_SIZE('person_s'))
      
        277 MB,330 MB
      
      Does not look like an amazing saving either.
    • gdevenyi 3 days ago

      > Enum type 4-byte floating point number

      This is why the storage is weird. Why would you use a float for distinct number storage!

  • systems 3 days ago

    well uniformity and homoiconicity are very important in an ideal db management system (a.k.a a true rdbms) everything should be represent as a relation and use the same set of operators to be manipulated

    separations of types and relations should be limited to core atomic type, string, int , date etc ... (althought date is debatable as is not usually atomic in most cases, and many dbs end up with one more date relations)

    anyway, always use a table .. when its a choice

    • netcraft 3 days ago

      couldn't have said it better myself.

      Data should be data, queryable, relational. So often I have had to change enums into lookup tables - or worse, duplicate them into lookup tables - because now we need other information attached to the values. Labels, descriptions, colors, etc.

      My biggest recommendation though is that if you have a lookup table like this, make the value you would have made an enum not just unique, but _the primary key_. Now all the places that you would be putting an ID have the value just like they would with an enum, and oftentimes you wont need to join. The FK makes sure its valid. The other information is a join away if you need it.

      I do wish though that there were more ways to denote certain tables as configuration data vs domain data, besides naming conventions or schemas.

      Edit to add: I will say there is one places where I have begrudgingly used enums and thats where we have used something like prisma to get typescript types from the schema. It is useful to have types generated for these values. Of course you can do your own generation of those values based on data, but there is a fundamental difference there between "schema" and "data".

      • systems 3 days ago

        well, if DDL (data definition language) and DML (data manipulation language), were unified and both operated on relation , manipulating meta data would have been a lot simpler, and more dynamics

        you can always created data dictionary relation, where you stored the code for table creation, add meta data, and use dynamic sql to execute the DML code stored in the DB, i worked somewhere where they did this ... sort of

    • mamcx 3 days ago

      Yeah, that is what I think on https://tablam.org, where I consider everything could be a relation, so like

          "hello world" ? where #chars != " " == ["h", "e", ...]
    • 9rx 3 days ago

      > everything should be represent as a relation

      > always use a table .. when its a choice

      Everything should be represented as relations (sets of tuples) but you should always use tables (multisets of tuples) when possible? That seems a little contradictory.

      • systems 3 days ago

        how do you want to represent relations in a DBMS, an enum or a table ?

        • 9rx 3 days ago

          If said DBMS is relational, with relations.

          If said DBMS is tablational, like SQL, then you would have to approximate them using tables and constraints.

          If said DBMS is of an another paradigm, like a document database, there may be no way to represent relations within the DBMS.

          An enum is a construct that numbers things. There is no way to represent a set of tuples with an integer[1]. I'm not sure where you are trying to go with that one. Inversely, you could hold an enum generated value within a relation. Is that what you mean?

          [1] Yes, technically you could break up the individual bits such that they form a set of tuples, but that wouldn't be useful beyond a very narrow use-case and doesn't generalize the way relation implies.

        • psychoslave 3 days ago

          with foreign keys?

  • nlitened 3 days ago

    I also love the approach of ClickHouse with LowCardinality(String). Flexible, clear semantics, high performance

  • Backslasher 3 days ago

    To me, since the DB is there to serve the app (which is there to serve the user), the lookup/enum decision mostly depends on whether the list is defined before build time (> enum) or after (> lookup). US states are probably a solid "before", so you get the added value of easily materializing a validator in the app code. Children IDs sound a bit more dynamic.

  • CuriouslyC 3 days ago

    From a maintainability standpoint lookup tables are miles ahead, but from a DX perspective there are a few cases where enums are nice. Honestly I probably would never use enums again, I feel like it's caused pain every time I've done it.

    • tucnak 3 days ago

      Enums are great if you're into json/jsonb custom logic and aggregates. It's quite cool to use the constraint system to impose checks on various JSON fields, especially if you're doing extension development, or packaging up procedures for downstream consumption.

  • mannyv 3 days ago

    Whatever needs the least number of joins is the best.

  • Joker_vD 3 days ago

    Honestly, the storage use would probably be the last thing of my mind when designing for "what should state/region/district/bundesland/etc. be modelled as". Sometimes those things get renamed, sometimes they are merged, and sometimes they are split. Which means that you may end up in an awkward state when e.g. Mecklenburg-Vorpommern gets split back into Mecklenburg and Western Pomerania, and some of your customers have updated their addresses, and some haven't. You have to store all of that anyway because remember: your DB doesn't represent the current state of the world, it represents your knowledge about the current state of the world (which is where the whole impetus for NULL originated: "I know that the customer has an address, I just don't know what it is", and all related problems with it: compare "I know that the customer actually does not have any address at all", and "I know that this address just can't be correct no longer but I have no new knowledge about what it can be").

  • sublinear 3 days ago

    Basically ugly no matter what.

    In a lot of web apps this need tends to be related to validation, so many just do these lookups and simple comparisons in their app logic and based on static values from config files long before any db query is made. Sometimes you just don't need to involve the database and the performance would be better for it anyway.

  • pavel_lishin 3 days ago

    Not super familiar with the internals of postgres, but what sort of performance would something like this have?

      SELECT person_id
      FROM person_l
      WHERE state_id = (SELECT id FROM state_l WHERE name = 'Burgenland');
  • aksss 3 days ago

    Table with a thread-safe read-through cache in code, imo. But there are places where enums make sense. For instance, things that are specifically in the code's domain.

  • veltas 3 days ago

    Who was child 12

    • teddyh 3 days ago

      Who was child 12̣