What I wish someone told me about Postgres

(challahscript.com)

441 points | by todsacerdoti 2 days ago ago

186 comments

  • munk-a 2 days ago

    While postgres is indeed case sensitive usually writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching. It absolutely isn't needed but if I'm debugging a query of yours I will send it through my prettifier so that I can breeze through your definitions without getting hung up on minor weird syntax things.

    It's like prettification in any other language - visual structures that we can quickly recognize (like consistent indentation levels) make us waste less time on comprehension of the obvious so we can focus on what's important.

    The only thing I really object to is "actuallyUsingCaseInIdentifiers" I never want to see columns that require double quotes for me to inspect on cli.

    • gwbas1c a day ago

      It's really useful to know this when working with SQL interactively.

      Specifically, if I'm banging out an ad-hoc query that no one will ever see, and I'm going to throw away, I don't worry about casing.

      Otherwise, for me, all SQL that's checked in gets the commands in ALL CAPS.

    • wiredfool a day ago

      I find all caps identifiers wind up just looking like interchangeable blocks, where lowercase have word shapes. So all caps just slows down my reading.

      • WorldMaker a day ago

        I feel similarly and I also have friends with Dyslexia with even stronger opinions on it. All caps in addition to being "shouting" to my ancient internet-using brain (and thus rude in most cases), creates big similar rectangular blocks as word shapes and is such a big speed bump to reading speed for everyone (whether or not they notice it). For some of my friends with Dyslexia that have a huge tough time with word shapes at the best of times, all caps can be a hard stop "cannot read" blocker for them. They say it is like trying to read a redacted document where someone just made rectangular black marker cross outs.

        Personally, given SQL's intended similarity to English, I find that I like English "sentence case" for it, with the opening keyword starting with a capital letter and nearly every remaining letter lower case (except for Proper Nouns, the truly case-sensitive parts of SQL like table names). Sentence case has been helpful to me in the past in spotting things like missing semicolons in dialects like Postgres' that require them, and/or near keywords like `Merge` that require them or helping to visually make sure the `select` under a `Merge` is intended as a clause rather than starting a new "sentence".

        • akira2501 a day ago

          > I find that I like English "sentence case" for it,

          I could go either way, but if you want to go back and modify a query, this makes it more difficult for me. I just use a block syntax for my queries:

              SELECT   *
              FROM     the_table
              WHERE    some_column = 12
              AND      other_column IS NOT NULL
              ORDER BY order_column;
          • WorldMaker a day ago

            It's a bit of a "Why not both?" situation, I think? You can have blocks and sentence case:

                Select   *
                from     the_table
                where    some_column = 12
                and      other_column is not null
                order by order_column;
            
            That seems so much more readable to me. As I said, that single capital `S` in the outermost "select" has come in surprisingly handy in my experience when scanning through a collection of statements or a transaction or a stored procedure or even just a statement with a bunch of nested sub-selects. It's an interesting advantage I find over "all lower case" or "all upper case" keywords.
            • 4ggr0 17 hours ago

              your example is less readable for me. not by a lot, but still.

              the other example has the commands, in caps, on the left and the values on the right, in lowercase. your example removes one of those aspects and makes everything lowercase. my brain can ignore all-caps stuff, as these are just commands and the things i actually care about mostly are the values.

              but i mean, in the end, it's just preferences. if you write SQL-queries, #1 is that you understand them well :)

              • Ntrails 15 hours ago

                > your example is less readable for me. not by a lot, but still

                Agree, but I wonder how much of that is just the lack of colouring. My brain is suuuuper hard wired to expect all the special keywords to be identified that way as well as by case.

                Mostly I'm a caps guy because my ahk scripts expand text like "ssf","w" and "gb"* to be the way I learned to write them at first.

          • wruza a day ago

            I use a similar structure but without column alignment.

              SELECT
                a,
                b
              FROM t1
              JOIN t2 ON …
              WHERE cond1
                AND cond2
              ORDER/GROUP/HAVING etc
          • cwbriscoe a day ago

               SELECT *
                 FROM the_table
                WHERE some_column = 12
                  AND other_column IS NOT NULL
             ORDER BY order_column;
            
            I usually don't bother with ALL-CAP keywords.
      • conductr a day ago

        I prefer lower case for my personal legibility reasons and it seems like a prettyfier should be able to adjust to that user’s preference. It’s not a team sport for me so I never had a conflict of styles other than converting public code samples to match my way.

      • yen223 a day ago

        I've always found it funny that SQL was designed the way it is to be as close to natural English as possible, but then they went ahead and made everything all-caps

        • paulryanrogers a day ago

          Some old terminals didn't have lower case. Like 1960s era

      • andrei_says_ a day ago

        Also sql editors like datagrip color the sql syntax very well.

    • archsurface 2 days ago

      My understanding is that the caps were syntax highlighting on monochrome screens; no longer needed with colour. Can't provide a reference, it's an old memory.

      • _Wintermute 2 days ago

        Most of the SQL I write is within a string of another programming language, so it's essentially monochrome unless there's some really fancy syntax highlighting going on.

        • jaredklewis a day ago

          Aside, but jetbrains IDEs seem to have some way to detect embedded sql and highlight it. I don’t remember configuring anything to get this feature.

          • wmfiv a day ago

            More than highlight they'll do schema validation against inline SQL strings also.

          • TRiG_Ireland a day ago

            VS Code does (did?) detect embedded SQL in PHP and correctly colour it, but only if it's on a single line. Any linebreaks and the colour is turned off. Also, if you're using prepared statements and have an @label, and that label is at the end of the string (so immediately followed by a closing quote), the SQL colouring continues into the rest of the PHP beyond the string. So it's important that single-line SQL statements ending in a @label be edited into multi-line statements to turn off the broken SQL colouring. Odd.

            • munk-a a day ago

              PHP strings tend to have better syntax highlighting with here/now docs (i.e. starting with `<<<TOKEN`). I've found SublimeText to have excellent SQL detection when using these tokens to delineate queries (and the syntax lends itself well to block strings anyways).

          • yen223 a day ago

            This is Jetbrain's "language injection" feature if you want to look it up. It works with any languages that the IDE supports, and like a sibling comment mentioned it does more than syntax highlighting.

            https://www.jetbrains.com/help/idea/using-language-injection...

          • wruza a day ago

            Vimers can adapt https://vim.fandom.com/wiki/Different_syntax_highlighting_wi... for a similar thing (but you have to wrap a string into some regular syntax).

          • tomjen3 a day ago

            Another aside: that is true for a huge range of programming languages as well as things like HTML. I believe it can automatically add \ to " in strings when those strings are marked to the IDE as HTML.

          • formerly_proven a day ago

            If you're working with some established framework and project structure their IDEs pull that information out of that, otherwise you'll need to at least tell it the dialect, but if you e.g. configure the database as a data source in the IDE you'll get full schema xref.

        • somat a day ago

          Same, I am a bit conflicted, I enjoy the sql and don't really like the ORM's but I hate seeing the big blocks of SQL in my code.

          So I wrote a thing that lets me use sql text as a function, it is several sorts of terrible, in that way that you should never write clever code. but I am not really a programmer, most of my code is for myself. I keep using it more and more. I dread the day Someone else needs to look at my code.

          http://nl1.outband.net/extra/query.txt

        • WorldMaker a day ago

          I find that in worst cases I can always copy and paste to a quick temporary buffer that is highlighted. I might be doing that naturally anyway if I'm trying to debug it, just to run it in a Data IDE of my choice, but sometimes even just using a scratch VS Code "Untitled" file can be useful (it's SQL auto-detect is usually good enough, but switching to SQL is easy enough if it doesn't auto-detect).

      • CoastalCoder 2 days ago

        I think the "color is all we need" idea makes sense in proportion to how many of our tools actually support colorization.

        E.g., the last time I used the psql program, I don't think it had colorization of the SQL, despite running in a color-capable terminal emulator.

        It probably doesn't help that terminal colors are a bit of mess. E.g., piping colored output through 'less' can result in some messy control-character rendering rather than having the desired effect.

        • dllthomas a day ago

          > piping colored output through 'less' can result in some messy control-character rendering rather than having the desired effect.

          It can, but -G or -R can fix that.

        • mixmastamyk a day ago

          Try pgcli for color and completion.

      • vundercind a day ago

        Like sigils in that regard. Perl-type sigils are extremely nice... if you're editing in Notepad or some ancient vi without syntax highlighting and the ability to ID references on request. Little point to them, if you've got more-capable tools.

    • o11c a day ago

      Note that case handling is a place where postgres (which folds to lowercase) violates the standard (which folds to uppercase).

      This is mostly irrelevant since you really shouldn't be mixing quoted with unquoted identifiers, and introspection largely isn't standardized.

      • yen223 a day ago

        Given that other mainstream RDBMSes lets you configure how case handling should happen, Postgres is arguably the closest to the standard.

        Usual caveat of how nobody sticks to the ANSI standard anyway applies.

    • mannyv a day ago

      The uppercase is usually there to show people what's SQL vs what's custom to your database. In books it's usually set in courier.

      I thought he was talking about psql's case sensitivity with table names, which is incredibly aggravating.

    • emmanuel_1234 2 days ago

      Any recommendation for a prettifier / SQL linter?

      • homebrewer a day ago

        IDEA if you want to use it for other things (or any other JetBrains IDE). Nothing comes close feature-wise.

        If you don't:

        - https://www.depesz.com/2022/09/21/prettify-sql-queries-from-...

        - https://gitlab.com/depesz/pg-sql-prettyprinter

        Or https://paste.depesz.com for one-off use.

        • jillyboel a day ago

          i think the idea sql prettifier is pretty silly sometimes. it really likes indenting stuff to make sure things are aligned, which often results in dozens of whitespaces

          • homebrewer a day ago

            It makes it easy to distinguish null vs not null columns and other similar things, so I personally don't mind.

            • jillyboel a day ago

              It's more about queries like (dummy example)

                  RETURN CASE
                             WHEN a = 1 THEN 1
                             ELSE 2
                      END;
              
              where it insists on aligning WHEN past CASE. I think it would be perfectly reasonable to indent WHEN and ELSE 4 spaces less, for example. Similar things happen with nested conditions like (a or (b and c)) all getting pushed to the right
      • gnulinux a day ago

        I'm curious about this for DuckDB [1]. In the last couple months or so I've been using DuckDB as a one-step solution to all problems I solve. In fact my development environment rarely requires anything other than Python and DuckDB (and some Rust if native code is necessary). DuckDB is an insanely fast and featureful analytic db. It'd be nice to have a linter, formatter etc specifically for DuckDB.

        There is sqlfluff etc but I'm curious what people use.

        [1] DuckDB SQL dialect is very close to Postgres, it's compatible in many ways but has some extra QOL features related to analytics, and lacks a few features like `vacuum full`;

      • mustime 17 hours ago

        If you're interested in streamlining your SQL queries, you might also want to check out https://ai2sql.io/ . It uses AI to generate optimized SQL queries quickly and could be a useful addition to your workflow.

      • NegativeLatency 21 hours ago

        prettier plugin sql, or pg_format

    • MetaWhirledPeas a day ago

      > writing queries with keywords in all caps is an effort to increase legibility for visual pattern matching

      Considering most programming languages do just fine without ALL CAPS KEYWORDS I'd say it's a strange effort. I wish SQL didn't insist on being different this way.

      I agree with you on prettification though. As long as the repository chooses a prettifier you can view it your way then commit it their way. So that's my advice: always demand prettification for pull requests.

      • sensanaty a day ago

        I don't write or read SQL too often, but I prefer the ALL_CAPS because it usually lets me know if something is part of the SQL syntax itself or a function or whatever, or if it's referencing a table/column/etc.

        Obviously not very foolproof, but automated linters/prettifiers like the one in DataGrip do a good job with this for every query I've ever thrown at it.

    • avg_dev 2 days ago

      i agree; but i use caps in my codebase, and lowercase when testing things out manually, just for ease of typing.

      • munk-a 2 days ago

        Ditto - if I'm throwing out an inspection query just to get a sense of what kind of data is in a column I won't bother with proper readable syntax (i.e. a `select distinct status from widgets`). I only really care about code I'll need to reread.

    • grahamplace a day ago

      For checked-in SQL queries, we follow: https://www.sqlstyle.guide/

      The combination of all caps keywords + following "the river" whitespace pattern dramatically improves readability in my opinion

    • neves a day ago

      You can convey more info with color. Any half decent editor can color your SQL.

      All caps letters are more similar and harder to read.

    • zusammen a day ago

      The all caps syntax also helps queries stand out as distinct from typical source languages. It is often helpful, since SQL tends to end up in all sorts of applications.

    • troyvit a day ago

      > no longer needed with colour.

      I think the increased legibility for visual pattern matching also makes SQL easier to read for many of the 350 million color blind people in the world.

    • marcosdumay a day ago

      Well, as long as you aren't imposing the noisy syntax into everybody by pushing the case-change back into the code...

      But getting some editor that highlights the SQL will completely solve your issue.

      • munk-a a day ago

        I think fighting in PRs over syntax preferences is pretty useless so dev shops should generally have code style guidelines to help keep things consistent. In my company we use all caps casing since we have momentum in that direction but I think that decision can be reasonable in either direction as long as it's consistent - it's like tabs vs. spaces... I've worked in companies with both preferences, I just configure my editor to auto-pretty code coming out and auto-lint code going in and never worry about it.

    • jillyboel a day ago

      what is your prettifier of choice for postgres?

  • christophilus a day ago

    I’d never stumbled across the “don’t do this” wiki entry[0] before. Very handy.

    [0] https://wiki.postgresql.org/wiki/Don%27t_Do_This

    • gwbas1c a day ago

      Why don't they deprecate some of these features? If they're such easy stumbling blocks, seems like it makes sense to disable things like table inheritance in new schemas, and require some kind of arcane setting to re-enable them.

      • Macha a day ago

        e.g. the suggested replacement for timestamp is timestamptz, which has its own problems (notably, it eagerly converts to UTC, which means it cannot account for TZ rule changes between storing the date and reading it). If medium term scheduling across multiple countries is something that needs to work in your app, you're kind of stuck with a column with a timestamp and another column with a timezone.

        • Terr_ a day ago

          > stuck with a column with a timestamp and another column with a timezone.

          I've been tinkering with a weird hack for this issue which might help or at least offer some inspiration.

          It's similar to the foot-gun of "eagerly convert straight to UTC" except you can easily recalculate it later whenever you feel like it. Meanwhile, you get to keep the same performance benefits from all-UTC sorting and diffing.

          The trick involves two additional columns along with time_zone and time_stamp, like:

              -- Column that exist as a kind of trigger/info
              time_recomputed_on TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
          
              -- You might be able to do this with special triggers too
              -- This coalesce() is a hack so that the above column changing causes re-generation
              estimated_utc TIMESTAMP GENERATED ALWAYS AS (COALESCE(timezone('UTC', timezone(time_zone, time_stamp)), time_recomputed_on)) STORED
          
          PostgreSQL will recalculate estimated_utc whenever any of the other referenced columns change, including the useless dependency on time_recomputed_on. So you can force a recalc with:

              UPDATE table_name SET time_recomputed_on = now() WHERE time_recomputed_on < X;
          
          Note that if you want time_recomputed_on to be more truthful, it should probably get updated if/when either time_zone or stamp are changed. Otherwise it might show the value as staler than it really is.

          https://www.postgresql.org/docs/current/ddl-generated-column...

        • stickfigure a day ago

          `timestamptz` doesn't convert to UTC, it has no timezone - or rather, it gets interpreted as having whatever TZ the session is set to, which could change anytime. Postgres stores the value as a 64 bit microseconds-since-epoch. `timestamp` is the same. It's sad that even the official PG docs get this wrong, and it causes problems downstream like the JDBC driver natively mapping it to OffsetDateTime instead of Instant.

          But you're right that timestamptz on postgres is different from timestamptz on oracle, which _does_ store a timezone field.

          • Ndymium a day ago

            Here's the PostgreSQL documentation about timestamptz:

            > For timestamp with time zone, the internally stored value is always in UTC (Universal Coordinated Time, traditionally known as Greenwich Mean Time, GMT). An input value that has an explicit time zone specified is converted to UTC using the appropriate offset for that time zone. If no time zone is stated in the input string, then it is assumed to be in the time zone indicated by the system's TimeZone parameter, and is converted to UTC using the offset for the timezone zone.

            > When a timestamp with time zone value is output, it is always converted from UTC to the current timezone zone, and displayed as local time in that zone. To see the time in another time zone, either change timezone or use the AT TIME ZONE construct (see Section 9.9.4).

            To me it seems to state quite clearly that timestamptz is converted on write from the input offset to UTC and on read from UTC to whatever the connection timezone is. Can you elaborate on which part of this is wrong? Or maybe we're talking past each other?

            • stickfigure 4 hours ago

              That is, unfortunately, a lie. You can look at the postgres source, line 39:

              https://doxygen.postgresql.org/datatype_2timestamp_8h_source...

              Timestamp is a 64 bit microseconds since epoch. It's a zone-less instant. There's no "UTC" in the data stored. Times are not "converted to UTC" because instants don't have timezones; there's nothing to convert to.

              I'm guessing the problem is that someone heard "the epoch is 12am Jan 1 1970 UTC" and thought "we're converting this to UTC". That is false. These are also the epoch:

              * 11pm Dec 31 1969 GMT-1

              * 1am Jan 1 1970 GMT+1

              * 2am Jan 1 1970 GMT+2

              You get the picture. There's nothing special about which frame of reference you use. These are all equally valid expressions of the same instant in time.

              So somebody wrote "we're converting to UTC" in the postgres documentation. The folks writing the JDBC driver read that and now they think OffsetDateTime is a reasonable mapping and Instant is not. Even though the stored value is an instant. And the only reason all of this works is that everyone in the universe uses UTC as the default session timezone.

              To make it extra confusing, Oracle (and possibly others) TIMEZONE WITH TIME ZONE actually stores a timezone. [1am Jan 1 1970 GMT+1] <> [2am Jan 1 197 GMT+2]. So OffsetDateTime makes sense there. And the generic JDBC documentation suggests that OffsetDateTime is the natural mapping for that type.

              But Posgres TIMESTAMP WITH TIME ZONE is a totally different type from Oracle TIMESTAMP WITH TIME ZONE. In Postgres, [1am Jan 1 1970 GMT+1] == [2am Jan 1 197 GMT+2].

              • Macha 3 hours ago

                You are thinking of UTC offsets as zones here, which is wrong. Yes, you can interpret an offset from the epoch in any utc offset and that's just a constant formatting operation. But interpreting a zoned datetime as an offset against a point in UTC (or UTC+/-X) is not.

                You do not confidently know how far away 2025-03-01T00:00:00 America/New_York is from 1970-01-01T00:00:00+0000 until after that time. Even if you decide you're interpreting 1970-01-01T00:00:00+0000 as 1969-12-31T19:00-0500. Postgres assumes that 2025-03-01T00:00:00 America/New_York is the same as 2025-03-01T00:00:00-0500 and calculates the offset to that, but that transformation depends on mutable external state (NY state laws) that could change before that time passes.

                If you get news of that updated state before March, you now have no way of applying it, as you have thrown away the information of where that seconds since epoch value came from.

            • Terr_ 21 hours ago

              Seeing this topic/documentation gives me a sense of deja vu: I think it's been frustrating and confusing a great many perfectly decent PostgreSQL-using developers for over 20 years now. :P

              • Ndymium 21 hours ago

                I agree, "timestamp with time zone" is a terribly misleading name and personally I don't use that type very much.

        • bvrmn 20 hours ago

          timestamptz is not more than glorified unix timestamp with nice formatting by default. And it's great! It provides easy zone conversions directly in SQL and allows to use your actual language "date with time zone" type.

          Naming is highly misleading though.

      • mixmastamyk a day ago

        Several of the broken are SQL standard.

        • FridgeSeal 20 hours ago

          At what point can we have an update to the standard that fixes a good number of these old hangups?

      • SoftTalker a day ago

        Presumably there are rare exceptions where you DO want to do the thing.

      • __loam a day ago

        Changing defaults can screw over existing users.

      • Parodper 17 hours ago

        The money one honestly sounds like a bug.

    • datadrivenangel a day ago

      This reminds me of SQL Anti-patterns, which is a book that everyone who works with databases should read.

    • samarthr1 a day ago

      That was a fun read, thanks!

      Made me reconsider a few habits I picked up from MySQL land

  • jedberg a day ago

    My number one tip: Vacuum every day!

    I didn't know this when I started, so I never vacuumed the reddit databases. Then one day I was forced to, and it took reddit down for almost a day while I waited for it to finish.

    • FreakLegion a day ago

      No autovacuum? At reddit's scale I'm surprised you didn't run out of transaction IDs.

      • jedberg a day ago

        I turned it off because it would run at inopportune times.

        And we did run out of transaction IDs, which is why I was forced to do it.

        I never turned on the auto-vacuumer but I did set up a daily vacuum.

        Keep in mind I left Reddit 13 years ago and I’m sure they’ve made improvements since.

      • gtaylor a day ago

        Auto vacuuming is enabled now. We did have some near misses due to long running vacuums that barely completed before wraparounds, but got things tuned over time.

        I did none of that work but was on the team where it happened.

  • pavel_lishin a day ago

    A lot of these aren't postgres-specific. (null weirdness, index column order, etc.)

    For example, how nulls work - especially how interact with indexes and unique constraints - is also non-intuitive in mysql.

    If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.

    • anarazel a day ago

      > If you have a user table with a non-nullable email column and a nullable username column, and a uniqueness constraint on something like (email, username), you'll be able to insert multiple identical emails with a null username into that table - because a null isn't equivalent to another null.

      FWIW, since 15 postgres you can influence that behaviour with NULLS [NOT] DISTINCT for constraints and unique indexes.

      https://www.postgresql.org/docs/devel/sql-createtable.html#S...

      EDIT: Added link

    • bpicolo a day ago

      I think this is a good pragmatic default. The use case for the alternative is much more rare.

      • pavel_lishin 12 hours ago

        I totally agree - but it's not an intuitive default.

  • marcosdumay a day ago

    > Normalize your data unless you have a good reason not to

    Ouch. You don't want to just say that and move on.

    The author even linked to a page citing 10 different kinds of normalization (11 with the "non-normalized"). Most people don't even know what those are, and have no use for 7 of those. Do not send people on wild-goose chases after those higher normal forms.

    • pavel_lishin a day ago

      But the author did have a paragraph explaining, in general, what they mean.

      And they're right! I've had to fix a few issues of this in a project I recently got moved to. There's almost never a reason to duplicate data.

    • cryptonector a day ago

      The general rule is to normalize to the max, then denormalize till you get the performance that you need.

      • DanHulton a day ago

        The one exception I'll make from the very start is "add tenant identifier to every row, yes, even if it's linked to another table that has tenant identifiers."

        Sure, this means you will have some "unnecessary" `tenant_id` columns in some tables that you could get through a relation, but it saves you from _having_ to include that relation just to limit by tenant, which you will some day almost be guaranteed to want. (Especially if you want to enable row-level security¹ someday.)

        ¹ - https://www.postgresql.org/docs/current/ddl-rowsecurity.html

        • magicalhippo a day ago

          We do a variant of this across all our tables. If we have a parent-child-child relationship, then all child tables, regardless of depth, will have the parent id.

          This way we can load up all the data needed to process an order or invoice etc easily, without a ton of joins.

          We don't do multi-tenant, instead separate databases per tenant so far.

          • wadadadad a day ago

            How is this working for you so far? Do you ever need to report across the multiple tenants, and how do database migrations go? I'm starting to look into this, and purely for reporting and database migrations I'm leaning towards multi-tenant.

            • magicalhippo a day ago

              The product I'm working have not needed to report across multiple tenants.

              We do have some customers which have separate daughter companies which might technically be individual tenants, and where we might need to report across those. But in all those cases so far we've been able to host the databases on the same server, so can easily join tables from the individual databases into the same query.

              Database migrations are very smooth, given each tenant has their own database, so we can do it when they're ready instead of needing a service window that fits all our tenants. We have several that are essentially 24/7 operations, and while they do have periods of low activity that fits a service window, they usually don't line up well between them.

              Likewise schema upgrades are trivial. We have written our own little tool that updates the database given a source schema description in XML (since we've had it for 15+ years now), and as it does not do destructive updates is is low risk. So schema upgrades is done automatically as part of our automated update deployment (ala Windows Update).

              Of course this requires a bit more thought during schema design, and sometimes we do add some manual cleanup or similar of tables/columns to our tool that we know are safe to remove/change.

              One upside is that performance is easy to manage. A single tenant will almost never cause performance issues for others than themselves. If they start doing that we can always just trivially move them to their own server.

              A downside is that we have a lot of large code lists, and currently these are kept per-database as we've traditionally been deployed on-prem. So we're looking to consolidate those.

              We do have another product that I haven't worked on, it's a more traditional SAAS web app thing, and that does have multi-tenant. It's not as business-critical, so the service window constraint isn't an issue there.

              Given that it has an order of magnitude more tenants than the application I work on, I think that multi-tenant was a decent choice. However it has also had some complications. I do recall hearing some issues around it being a more critical point of failure, and also that certain larger customers have started to want their own separate database.

              I think ideally we would have combined the two approaches. Allow for multi-tenant by having a tenant id everywhere, but also assume different tenants will run in different databases.

              • wadadadad 12 hours ago

                Thank you very much for your time, I appreciate it! It definitely seems automated schema updating is necessary if you're doing more than a couple of databases, and you raised many other good points that I hadn't fully considered. I can definitely appreciate larger clients wanting their own dedicated databases, so planning for that initially could be a wise choice. Thank you again!

        • getcrunk a day ago

          Nice one! I’m working on an app the is user/consumer facing but will eventually have a teams/bussiness offering. Everything has a uid with it cus users are the core of your app. But yea if your multitennant then tenants should be treated like that as well.

        • jherdman a day ago

          What happens if the row is shared between two tenants?

          • zbentley a day ago

            In a great number of (probably most) business-to-business software domains, most rows are only relevant to one tenant at a time. It’s a partition/shard key, basically: queries will only uncommonly retrieve rows for more than one tenant at a time.

            Hassles emerge when the tenant that owns a row changes (client splits or mergers), but even then this is a common and robust architecture pattern.

      • marcosdumay a day ago

        Oh, no, it's absolutely not.

        It's usually to normalize into the 3rd form. But that's not enough on some cases, that's too much on some other cases, and the reason it breaks is performance about as often as it's not.

      • jayknight a day ago
        • cryptonector a day ago

          Isn't 6NF essentially a flavor of EAV? I think essentially it is.

          6NF means having one non-PK column, so that if the value would be NULL then the row need not exist, and so the value column can be NOT NULL.

          But when you need all the columns of what you'd think of as the <thing> then you need to go gather them from all those rows in all those tables. It's a bit annoying. On the other hand a proper EAV schema has its positives (though also its negatives).

          • jayknight a day ago

            It's similar, but instead of one table holding lots of attributes, there are separate tables that hold optional fields that might otherwise be null if they were in the main table.

        • magicalhippo a day ago

          Every time I see a suggestion like this, I wonder what kind of projects people work on that only require a few columns.

          In my experience every non-trivial project would need several dozen columns in core tables, all which could be NULL while the user works on the record during the day.

          In our current project we'd have to do several hundred joins to get a single main-table record.

          I also wonder why people have such an aversion for NULL. Crappy tools?

          • jayknight 4 hours ago

            No, it's just a dedication to pure relational algebra. Nulls introduce "three valued logic" into your code/sql. A nullable boolean column can contain 3 different "values" instead of two. True is not false, true is not null, false is not null (and null is neither true nor false). Or in a nullable numeric field, you have rows where that column is neither greater than zero or less than zero.

            On the other hand, even if you eliminate all nulls, you still have to do some three-valued logic ("A join B on B.is_something", "A join B on not B.is_something" and the "anti-join": "A where no B exists"), but only in the context of joins.

            It feels a little like rust's Option type. If something in "nullable", these are tools that forces you to deal with it in some way so there are fewer surprises.

          • marcosdumay 2 hours ago

            AFAIK, the 6NF is exclusively for theory research.

      • yen223 a day ago

        "normalise til it hurts, denormalise til it works"

    • hobs a day ago

      I guess this is targeted towards noobs, but the answer is pretty much always 3rd normal form if you are clicking this and are not sure.

      • makr17 a day ago

        Interesting, I would have said Boyce-Codd unless you have a good reason to vary in either direction.

        • philipwhiuk a day ago

          The problem with BCNF is generally that you're enforcing a generally fairly complex and subject-to-change relation at database level rather than application logic.

  • ellisv 2 days ago

    I really wish developers cared more about normalization and stop shoving everything into a JSON(b) column.

    • abraae a day ago

      Long before databases could even store structured JSON data, junior developers used to bikeshed viciously over the correct degree of normalization.

      More experienced developers knew that the correct answer was to duplicate nothing (except for keys obviously) and then to denormalize only with extreme reluctance.

      Then databases like mongo came along and encouraged those juniors by giving them something like a database, but where normalization was difficult/irrelevant. The result was a brief flowering of horrible database designs and unmaintainable crap towers.

      Now the pendulum has swing back and people have rediscovered the virtues of a normalized database, but JSON columns provide an escape hatch where those bad practices can still flower.

      • christophilus a day ago

        Eh. JSON has its place. I have some stateful data that is fairly transient in nature and which doesn’t really matter all that much if it gets lost / corrupted. It’s the sort of thing I’d throw into Redis if we had Redis in our stack. But the only storage in my project is S3 and Postgres. Postgres allows me to trivially query, update, analyze the usage of the feature, etc. Normalization wouldn’t buy me much, if anything, for my use case, but it would make “save this stuff for later” more of a nuisance (a sync across a bunch of rows vs a single upsert).

        That said, I’ve worked on projects that had almost no normalization, and it was pure hell. I’m certainly not arguing against normalizing; just saying that data blobs are useful sometimes.

      • codr7 a day ago

        Yeah, I'm def not taking a any more mongodb jobs if I can avoid it.

        I'm fine with using it for simple throw away stuff, but deciphering someone else's ball of json is soul killing.

    • yen223 a day ago

      There are two reasons to use a jsonb column:

      1. To store JSON. There's a pattern where when your webserver calls into some third-party API, you store the raw API response in a JSONB column, and then process the response from there. This gives you an auditable paper trail if you need to debug issues coming from that 3rd-party API.

      2. To store sum types. SQL not supporting sum types is arguably the biggest deficiency when modelling data in SQL databases. There are several workarounds - one of them being "just chuck it in a JSONB column and validate it in the application" - but none of the workarounds is particularly great.

      • zie a day ago

        I would add:

        3. End-user extra fields. Stuff you don't care about, but someone somewhere does.

    • devin a day ago

      Even if you care about it, you will still often wind up with a junk drawer of JSONB. I don't really see it as a problem unless people are writing bad queries against it instead of lifting values out of the JSONB into their own columns, etc.

      • mnahkies a day ago

        Yeah exactly, and I'll take a JSON(B) column over TEXT with maybe valid serialised JSON, maybe RON, maybe something completely random any day

    • randomdata a day ago

      Most developers using these kinds of tools these days are actually building their own database management systems, just outsourcing the persistence to another DMBS, so there isn't a strong imperative to think about good design so long as it successfully satisfies the persistence need.

      Whether we actually should be building DMBSes on top of DMBSes is questionable, but is the current state of affairs regardless.

      • tightbookkeeper a day ago

        A previous employer thought that sql databases didn’t understand graphs. So they made their own system for serializing/deserializing graphs of objects into Postgres

        . They never used queries and instead had their own in-memory operators for traversing the graph, had to solve problems like deleting an entry and removing all references, partial graph updates.

        And I still don’t think it works.

    • theamk a day ago

      This needs working schema migration process, including ability to undo schema change if the new column tanks the performance or breaks stuff.

      If there are CLI tools involved, you also need to ensure you can handle some downtime, or do synchronized version update across company, or support both old and new schemas for a while.

      If a database is not part of team's primary product all of this could be missing.

  • avg_dev 2 days ago

    this is really nice. i am glad the author put it together. i didn't know the pg docs were 3200 pages long! i have been using it for a while and try to learn as i go. i really do like the docs. and i also like to read articles on various particular subjects as i find a need to.

    i think the author might find it helpful to readers to add a note to https://challahscript.com/what_i_wish_someone_told_me_about_... that if someone is selecting for b alone, then an index on columns (b, a) would work fine. i think this is kind of implied when they talk about selecting on a alone, but maybe it wouldn't hurt to be extra explicit.

    (i didn't spend much time on the json/jsonb part since i rarely use that stuff)

  • tomcam 2 days ago

    I wrote this to help beginners: https://tomcam.github.io/postgres/

  • the5avage a day ago

    From my experience with a lot of hilarious SQL stuff I have seen in the wild.

    It would be a good start to read the paper of codd and trying to understand what the relational model is. It's only 11 pages long and doing that would reduce the suffering in this world.

  • nasretdinov 2 days ago

    Nice article! One thing I'd add is that almost all of it applies to other MVCC databases like MySQL too. While some details might be different, it too suffers from lon transactions, holds metadata locks during ALTERs, etc, all the good stuff :).

  • whiskey-one 2 days ago

    Really great post! It belongs on a reading list somewhere for everyone who is using Postgres independently or as a part of their stack.

  • leoqa a day ago

    I had an interesting problem occur to the pg stats. We were migrating and had a version column, I.e key, val, version.

    We were migrating from version 1 to version 2, double writing into the same table. An index on (key, val, version) was being hit by our reader process using a where clause like key=k and version=1.

    When we flipped the reader to read version 2, the latency jumped from 30ms to 11s. Explain showed a sequential scan even though the index could serve the query. I was able to use MATERIALIZED and reorder CTEs to get the planner to do the right thing, but it caused an outage.

    We were autovacuuming as well. I ended up deleting the old version and rebuilding the index.

    My theory is that because the read load was on 50% of the data, the stats were super skewed.

  • ahachete a day ago

    > Most notably, 'null'::jsonb = 'null'::jsonb is true whereas NULL = NULL is NULL

    Because 'null' in the JSON spec is a literal value (a constant), not SQL's NULL. Nothing to see here.

    https://datatracker.ietf.org/doc/html/rfc7159

    • hibachrach a day ago

      Right--it makes sense and shouldn't be changed, but it's a bit unintuitive for newcomers.

  • sroerick a day ago

    It’s always a relief to read stuff articles this, realize I know 90% of it, and I’ve deserved the jobs I’ve had.

    Great and super useful notes

  • jaza a day ago

    Only some of these are really Postgres specific (use "text" / "timestamptz"; make psql more useful; copy to CSV). Most of them apply to relational databases in general (learn how to normalise!; yes, NULL is weird; learn how indexes work!; locks and long-running transactions will bite you; avoid storing and querying JSON blobs of doom). Not that that detracts from the usefulness of this article - pretty much all of them are important things to know when working with Postgres.

  • roydivision 19 hours ago

    Really like the

    Don't <thing>

    Why not?

    When should you?

    format of the Postgres Don't Do This page.

  • zmmmmm a day ago

    > It’s possible that adding an index will do nothing

    This is one of the more perplexing thing to me where Postgres ideology is a bit too strong, or at least, the way it works is too hard for me to understand (and I've tried - I'm not going to claim I'm a genius but I'm also not a moron). I hear there may be finally support for some kind of hints in upcoming versions, which would be very welcome to me. I've spent way too much time trying to divine the voodoo of why a slow query is not using indexes when it seems obvious that it should.

  • jerrygoyal a day ago

    I recently asked a similar question on reddit and got many inputs https://www.reddit.com/r/PostgreSQL/comments/1gbr0it/experie...

  • nodakai 2 days ago

    That nested JSON query operator chains such as json_col->'foo'->'bar'->>'baz' internally return (copy) entire sub-objects at each level and can be much slower than jsonb_path_query(json_col, '$.foo.bar.baz') for large JSONB data

    ... although I haven't had the chance to verify this myself

    • mdaniel a day ago

      I got nerd-sniped on this, because I actually hadn't heard that before and would be horrified if it were true. It took some non-trivial digging to even get down into the "well, what does foo->>'bar' even map to in C?" level. I for sure am not claiming certainty, but based merely on "getIthJsonbValueFromContainer" <https://sourcegraph.com/github.com/postgres/postgres@REL_17_...> it does seem that they do palloc copies for at least some of the JSONB calls

      • anarazel a day ago

        Postgres does have infrastructure to avoid this in cases where the result is reused, and that's used in other places, e.g. array constructors / accessors. But not for jsonb at the moment.

    • hans_castorp 2 days ago

      You can also use #>> operator for that:

          json_col #>> '{foo,bar,baz}'
  • vivzkestrel a day ago

    since we are on the topic and since your article clearly mentions "Normalize your data unless you have a good reason not to" I had to ask. I am trying to build a news aggregator and I have many websites. Each of them has slightly different format. Even though I use feedparser in python, it still doesn't change how some of them put html text inside content and some of them break it down into a separate media xml attribute while retaining only basic textual summary inside a summary attribute. Do you think it makes more sense to store a single rss item as a column inside postgres or should it be stored after parsing it? I can see upsides and downsides to both approaches. Store it as XML and you have the ability to change your processing logic down the line for each stored item but you lose the flexibility of querying metadata and you also have to parse it on the fly every single time. Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source. What do you guys recommend?

    • sea6ear a day ago

      With the caveat that you probably shouldn't listen to me (or anyone else on here) since you are the only one who knows how much pain each choice will be ...

      I think that given that you are not really dealing with structured data - you've said that different sites have different structures, and I assume even with processing, you may not be able to generate identical metadata structures from each entry.

      I think I would go for one column of XML, plus maybe another column that holds a parsed data structure that represents the result of your processing (basically a cache holding the post-processed version of each site). Hopefully that could be re-evaluated by whatever language (Python?) you are using for your application. That way you don't have to do the full parsing each time you want to examine the entry, but you have access to something that can quickly give you whatever metadata is associated with it, but which doesn't tie you to the rigid structure of a table based database.

      Once you know what you are really doing with the data, then you could add additional metadata columns that are more rigid, and which can be queried directly in SQL as you identify patterns that are useful for performance.

      • vivzkestrel 20 hours ago

        i am using the feedparser library in python https://github.com/kurtmckee/feedparser/ which basically takes an RSS url and standardizes it to a reasonable extent. But I have noticed that different websites still get parsed slightly differently. For example look at how https://beincrypto.com/feed/ has a long description (containing actual HTML) inside but this website https://www.coindesk.com/arc/outboundfeeds/rss/ completely cuts the description out. I have about 50 such websites and they all have slight variations. So you are saying that in addition to storing parsed data (title, summary, content, author, pubdate, link, guid) that I currently store, I should also add an xml column and store the raw <item></item> from each url till I get a good hang of how each site differs?

    • krab a day ago

      Just shooting without knowing your real needs - take this with a grain of salt.

      Store some parsed representation that makes it easier for you to work with (probably normalized). Keep an archive of raw data somewhere. That may be another column, table or even S3 bucket. Don't worry about schema changes but you need to not lose the original data. There are some pitfalls to schema migrations. But the schema should be the representation that works for you _at the moment_, otherwise it'll slow you down.

    • setr 19 hours ago

      If you’re going to be parsing it anyways, and there’s the possibility of actually doing something with that parsed info beyond just reprinting it, then storing the post-parse results is probably better. Especially if you’re only going to need a reduced subset of the information and storage matters.

      If you’re just reprinting — you’re parsing only for the sake of rendering logic — then storing the parse-result is probably just extra unnecessary work.

      Also if storage isn’t a concern, then I like using the database as intermediate storage for the pipeline. Grab the RSS, stuff it in the DB as-is. Take it out of the DB, parse, store the parse results. Etc. You’ll have to do this anyways if you’re going to end up with a processing queue (and you can use PG as a simple queue.. SELECT…FOR UPDATE), but it’s nice-to-have if your pipeline is going to eventually change as well — and you’re able to reprocess old items

      > Store it in multiple columns after processing it and it may require different logic for different websites + changing your overall python processing logic requires a lot of thinking on how it might affect some source.

      Don’t you need to deal with this problem regardless? Ideally you’ll find a common subset you actually care about and your app-logic will look like website —> website handler -> extract data subset —> add to common structure —> render common structure

      Even if you don’t use a database at all your app-code will need to figure out the data normalization

    • cryptonector a day ago

      With all RDBMSes the rule is "normalize to the max, then denormalize till you get the performance that you need".

  • samsartor a day ago

    Instead of psql, I really like https://github.com/dbcli/pgcli

    • mustime 17 hours ago

      You might also enjoy using https://ai2sql.io/ ! It’s an AI-powered tool designed to handle complex query generation and formatting, which might complement your use of pgcli well.

  • abobov 20 hours ago

    My tip is a better pager for psql: https://github.com/okbob/pspg

  • RedShift1 a day ago

    I can't horizontally scroll on mobile, can't see the full query texts...

    • hibachrach a day ago

      (author here) hmm it seems to work on my device. Are you scrolling within the code block?

      • Lvl999Noob 9 hours ago

        Unrelated to that issue, the right hand side TOC does not display completely. I am using firefox on windows with a default zoom of 120%. The TOC ends up going below the screen and hiding the last few entries. It also floats above the footer and hides the text there.

        If I may suggest, use `position: sticky` instead of `position: fixed` or some equivalent to whatever trick you used.

      • RedShift1 a day ago

        I'm trying to but the entire page scrolls horizontally instead of just the code block.

        • hibachrach a day ago

          It might be because of the larger tables. Thanks for letting me know--I'll take a look at it soon.

  • fatih-erikli-cg 21 hours ago

    Storing data in text costs less. A tcp connection to get some blog posts from another process is not necessary.

    • bvrmn 20 hours ago

      On other hand it's a mere blog post. You should not be bothered by TCP cost. But reliable data storage and reliability/restore (in case of backups) cost is a concern.

      • fatih-erikli-cg 18 hours ago

        It is mostly a blog post. A usecase for a database that holds tables and rows is very rare in real world. I know noone who uses contacts app in their mobile phones. Access is already there with checkboxes, select inputs and everything for years. Noone uses.

  • cryptonector a day ago

    Watch out, for row/record values, if a column in the row/record is NULL then IS NULL will be true! You want to use IS [NOT] DISTINCT FROM NULL, full stop.

  • teddyh a day ago

      \pset null '␀'
  • deadbabe a day ago

    Don’t create views that reference other views.

  • 5Qn8mNbc2FNCiVV a day ago

    Your code sections are almost unscrollable on mobile

    • hibachrach a day ago

      (author here) hmm it seems to work on my device. what happens when you try to scroll?

  • michaelcampbell 14 hours ago

    > null weirdness

    Oracle enters the chat... (last I used it it considered an empty string '' the same as NULL)

  • philipwhiuk a day ago

    > Name your tables in snake_case

    This bit me. It's highly tedious that case isn't preserved honestly.

  • cyanydeez 2 days ago

    I'd say I tend to ignore the standard docs because they rarely have examples and rely on the arcane procedure of trying to decipher the super command options with all it's "[OR THIS|THAT]".

    I assume _someone_ can read this pseudo programming, but it's not me.

    • PhilippGille 2 days ago

      I love the diagrams from the SQLite documentation: https://www.sqlite.org/syntaxdiagrams.html

      • teddyh a day ago

        Those are commonly called “railroad diagrams”: <https://en.wikipedia.org/w/index.php?title=Syntax_diagram&ol...>

      • o11c a day ago

        I've been bitten by those before because they are not generated from the actual syntax-parsing code, and thus are sometimes out of sync and wrong (or at least misleading).

      • christophilus a day ago

        SQLite diagrams plus examples are pretty much the way I wish all of my documentation worked.

    • smartbit a day ago

      PostgreSQL Administration Cookbook series served me well

      https://www.oreilly.com/library/view/postgresql-16-administr...

    • paulryanrogers 2 days ago

      Sometimes I find it annoying but mostly it works well for me. I've come to embrace the find feature and visually scanning over any parenthetical stuff.

      The alternative is they have to break it up into several quasi examples, each with their own optional modifiers.

    • mustime 17 hours ago

      For real-world SQL examples and generating queries fast, you might like https://ai2sql.io/ . It's an AI-based tool that lets you enter natural language prompts and outputs efficient SQL code tailored to your needs.

    • assbuttbuttass 2 days ago

      I'm looking at the postgres docs now, and they certainly have the arcane formal syntax definitions, but they also have plenty of useful examples https://www.postgresql.org/docs/current/sql-altertable.html#...

    • Clubber 2 days ago

      FWIW square brackets indicate an optional parameter and the pipe inside indicates all the parameters available; but I agree, I don't particularly care for that notation and prefer actual running SQL examples.

      • hobs a day ago

        Railroad-esque diagrams can be weird but they say a lot in a very short space, I would highly recommend spending a little extra time working on thinking through them, they are everywhere!

        • Clubber a day ago

          >Railroad-esque diagrams

          Now and forever I will think of railroads when I see these.

          • hobs a day ago

            Well, you should! But, I didn't invent anything there, that's what they are named :)

            Syntax diagrams (or railroad diagrams) are a way to represent a context-free grammar. They represent a graphical alternative to Backus–Naur form, EBNF, Augmented Backus–Naur form, and other text-based grammars as metalanguages... https://en.wikipedia.org/wiki/Syntax_diagram

  • mikrotikker a day ago

    Shout-out to my postgres server that has been sitting untouched doing it's thing perfectly for 10 years, you're a real one

  • cynicalsecurity a day ago

    People who use PostgreSQL instead of MySQL just want to suffer while pretending "they are better than others".

  • andrewstuart 2 days ago

    Dump your schema, drop the entire thing into ChatGPT or Claude and ask it to write your Postgres query.

    Then ask it to rewrite the query three different ways and explain the pros and cons of each.

    Do the same for any existing queries in your system … drop in the schema then the query and ask for an analysis and suggested improvements.

    • bilekas a day ago

      People have too much faith in LLM's at the moment. This might be able to give some insights, but if you're trying to LEARN something in detail, such as the pros and cons on a particular custom query on your unuiqe-ish schema.. The LLM will be prone to making stuff up by design, it wasn't trained on YOUR data..

      It will give some good basic results and probably some functional queries, but it misses the mark with the finer points of optimization etc. If you don't know how to properly build the queries, go learn it properly with confidence that what you're learning is actually reliable.

    • cpursley a day ago

      This is similar to what I’ve done with my sass where most of the logic lives in Postgres. I went through much of the logic and refactored it with GPT a while back and improved it significantly. It’s soooo good with SQL. Like insanely good. And these days I'm using Claude for all things code, which is significantly better than GPT.

      • sgarland a day ago

        Agreed that LLMs are quite good at SQL (and relational algebra). However, ask yourself this: how many programming languages do you know to at least a comfortable level? If N > 1, why not add another (SQL)? It’s not a difficult language, which is in part why LLMs are so good at it – it’s easy to predict the next token, because there are so few of them.

        • cpursley a day ago

          I know SQL pretty well (enough to write complex procedures) and prefer it over doing biz logic in application code (yuck ORMs). But SQL is just such a wonky syntax that I appreciate any help that I can get.

    • cyanydeez 2 days ago

      then have corporate knock on your teams window asking why all it's IP ended up in a Chinese competitor?

      • junaru 2 days ago

        Schema != data.

        • jrpelkonen 2 days ago

          “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious.” - Fred Brooks

          • josephg 2 days ago

            Yep, but if you think your database schema is "secret sauce", you're fooling yourself. You can almost always guess the database schema if you're given 10 minutes to play with the application.

        • PittleyDunkin 2 days ago

          It's certainly data to the extent that it's copyrightable. I doubt this sentiment would stand up in court.

        • hobs a day ago

          Generally your schema is your code, which IS your IP - your data is often NOT your IP (and often might just be a list of facts, a thing which is not copyrightable)

  • onoke 2 days ago

    Haha, good job. :)

  • TwoNineFive a day ago

    "going on a journey similar to mine"

    At this point "journey" is a cringe word because of it's excessive usage in blogspamverts. It tells me you are a gross aspiring influencer.

    • hibachrach a day ago

      lol this is the first blog post i've written in 4 years and i didn't even post it here so I think you're imagining things

  • Jean-Papoulos 19 hours ago

    >"Normalize your data"

    >"You don't need to write SQL in uppercase"

    >"What's an index ?" section

    From reading this, it's clear that the author never sat down and learned to use databases from the ground up. The author started using them and learned as he went, so his "tips" include things you'll be told in the first hour of any course.

    This doesn't hold any value for someone who's been using databases for almost any length of time.