SQLite: Outlandish Recursive Query Examples

(sqlite.org)

201 points | by Rendello 8 months ago ago

55 comments

jitl 8 months ago

No matter how much time I spend contemplating recursive CTE examples, I just cannot “get it” enough to write my own without a lot of trial and error and head-scratching. I would love to take a 2 hour class on “thinking in SQL for hackers” or something, but I haven’t really found anything to improve my mental model from “broken” to “working” so far.

[-]

CGamesPlay 8 months ago

A prerequisite to understanding is knowing the CTE syntax ("WITH"). That's just a group of SQL queries that can refer to one another. It's extremely useful for making modular SQL queries and has nothing to do with recursion.

Then when you use "WITH RECURSIVE", the queries can now refer back to themselves. This is just a for loop over a queue of SQL results, conceptually. The part before the UNION ALL fills the queue to start, and the part after the UNION ALL runs once for each result in the queue, adding all results back into the queue.

If you understand this, then you can understand the Sudoku or Mandelbrot examples (definitely don't start with trying to understand these two though). For example, the Sudoku example contains one recursive query, "x(s, ind)". As explained on the page, "s" is the puzzle (unsolved spaces shown as ".") and "ind" is the index of the first unsolved space (or 0 for a solved puzzle). It creates an unsolved puzzle in the initial setup. The for loop body finds all valid values for the first unsolved space, and the next index to solve; then puts all these results into the queue. The final (non-recursive) SELECT in the CTE looks over all results in the queue, and returns the one where the index is 0 (the solved puzzles).

cryptonector 8 months ago

SQL recursive queries are just loops. The CTE is a UNION or UNION ALL query (well, never use UNION ALL unless you really want to loop infinitely!) where one side is the query that seeds the CTE and the other is executed and its outputs added to the CTE, and the last part is repeated until the CTE stops growing. There's just one more detail: the "recursive" side of the UNION query needs to do a JOIN to the CTE itself (thus the "recursion"), but this doesn't change the loop nature of the thing.

That's it. It's just a simple loop.

(Yes, recursion is looping. But in this case it's very clear that a stack is not needed, that the recursion is "tail recursion" if know what that is.)

The hard part lies in getting the JOIN in the "recursive" side of the UNION query right. Here's a transitive closure query:

  WITH RECURSIVE closure AS (
    SELECT parent, child FROM relationships
    UNION
    SELECT c.parent AS ancestor, r.child AS descendant
    FROM closure c
    JOIN relationships r ON c.child = r.parent
  )
  SELECT * FROM closure;

Here `SELECT parent, child FROM relationships` is the seed, and the rest (`SELECT c.parent, r.child ...`) is the recursive part of the query that gets repeated until the whole set of results from the seed and all the repetitions of the recursive part stops growing. The recursive part simply says to add to the `closure` all the children of children of tuples in the closure, but with the parents from the closure as the parents. So if this were human parent/child relationships then initially you might have your parents' relationships to you, and also yours to your children, and this will find that your parents are ancestors to your children.

[-]

CGamesPlay 8 months ago

> However, if the example had used UNION instead of UNION ALL, then SQLite would have had to keep around all previously generated content in order to check for duplicates. For this reason, programmers should strive to use UNION ALL instead of UNION when feasible.

[-]

cryptonector 8 months ago

Of course, if you have loop detection in your business logic when writing to the database, then you can safely use UNION ALL always.

EDIT: Removed a bunch of stuff that was partly based on a faulty memory of how UNION works vs UNION ALL. Indeed, UNION ALL means the RDBMS does not need to keep around the whole result set.

[-]

CGamesPlay 8 months ago

Just to be clear, the reason I mentioned this in response to your original post is that UNION ALL doesn't have anything to do with the query recursing infinitely: it just allows the database engine to not keep previous results around. If you have a query that recurses infinitely, with UNION it will definitely exhaust the memory of the database engine, and with UNION ALL it will never stop returning results, which may exhaust the memory of the database client (unless it discards the results).

remram 8 months ago

An crucial point for me was realizing that "recursive CTEs" are not really recursive but better understood as iterative, in other words a loop. The results of each iteration are fed into the next iteration, until no new result is produced.

[-]

eru 8 months ago

Yes, iteration is a special case of recursion.

[-]

cryptonector 8 months ago

All recursion is iteration. Sometimes you have a stack to help you, and other times you get tail recursion optimization, but it's always a loop.

[-]

eru 8 months ago

What makes you think so?

Please have a look at the formal definition of regular expressions at https://en.wikipedia.org/wiki/Regular_expression#Formal_defi... on Wikipedia, and let me know where the stack and the iterations are. I can't find them.

https://en.wikipedia.org/wiki/Recursive_definition#Well_form... is also a good example.

Regular expressions are a particularly interesting example because you brought up 'loops', so I'm assuming you are interested in how you can implement some of these recursive definitions on a computer.

So for regular expressions a common technique is to compile them to a finite state machine. You can model that in your computer as one block of eg assembly instruction per machine state. To move to a different state, you just 'jmp' to the corresponding code's address.

That's pretty fun to work out, but I still don't see anything I would describe as a 'loop' here; but the recursive analysis still makes perfect sense.

Yes, some programming languages have special purpose looping constructs. But they are of limited usefulness, and not particularly fundamental: they usually get compiled away.

vincnetas 8 months ago

all iterations are recursion but not the other way around.

[-]

cryptonector 8 months ago

When you add an explicit stack to help you keep state then you can have recursion be the same as iteration. Normally the stack you get in recursion is implicit rather than explicit.

[-]

eru 8 months ago

A stack is one way to implement a specific kind of recursion on a computer.

But see https://news.ycombinator.com/item?id=42234121 for other examples of recursion.

fifilura 8 months ago

Isn't recursion exactly that? A loop that feeds into the next iteration.

[-]

Shraal 8 months ago

It's important to differentiate between tail-recursive functions and non-tail-recursive functions. Compilers will often convert tail-recursive functions into their iterative counterparts. See: https://en.wikipedia.org/wiki/Tail_call

In contrast, non-tail-recursive functions will make the call stack grow with each call.

[-]

fifilura 8 months ago

They both fit the description in GP.

"The results of each iteration are fed into the next iteration, until no new result is produced."

hobs 8 months ago

It's just going to be practice, recursion in general is annoying to think about. Start simple, have a loop you want to write instead of a recursive thing, write it step by step instead of trying to do it all at once.

I was implementing newton raphson a few years ago in SQL (so as not to implement it as a straight loop) and iterating over the problem several times really helped.

Get a query. Now think of the next step in the iteration, and you're basically writing a query that connects to that. And now, it runs again and again based on the previous criteria.

If you can isolate each part of the query for each logical step you're going to have a much simpler problem to mentally solve.

danielheath 8 months ago

“Thinking in sql” is hard because it’s an awkward syntax for the (much simpler) relational algebra.

Learn to think in terms of the relational algebra, and how to translate that to/from SQL, and it starts making sense.

[-]

jitl 8 months ago

I can express my ideas well enough in various Datalog/Prolog variants w/ the horn syntax. But when it comes to translating that from several discrete simple propositions into one massive CTE-stack SQL query I get very puzzled. I wrote a toy Datalog-to-SQLite compiler (https://percival.jake.tl) but I struggle to grasp the translation skill myself

[-]

refset 8 months ago

It doesn't help that SQL's recursive CTE model is a bit odd - Frank McSherry proposed a much nicer alternative (which Materialize has implemented): https://materialize.com/blog/recursion-in-materialize/

cryptonector 8 months ago

I don't find that the syntax gets in the way of thinking in relational algebra, but I did learn SQL first.

cryptonector 8 months ago

I highly recommend the O'Reilly SQL Pocket Guide (https://www.oreilly.com/library/view/sql-pocket-guide/978149...). Its explanation of CTEs is fantastic, and it does a great job on many other parts of SQL, and it's a short and sweet book you could read in a few hours. And it will server as a reference long after you first absorb it.

pawelduda 8 months ago

It's unusual and not a commonly needed tool in SQL. I always need a quick refresher on how to write these after a longer break

[-]

DemocracyFTW2 8 months ago

Not commonly needed presumably only because we've grown up to understand that SQL is not a language where you can easily express rescursively hierarchical relationships, such as the in the go-to example where you have a table of employees with a 'reports-to' field. In classical SQL there was no way to write a simple query that will resolve 'who is reporting to who' recursively with an arbitrary depth. Recursive CTEs can do that.

tlarkworthy 8 months ago

Some things that helped me

Every query response in SQL is a rectangular matrix [1]. JOINs add columns sideways. UNIONs add rows vertically[2].

[1] Which is why tree shaped data is awkward to query in SQL in one round-trip.

[2] From this you realize why the column names have to match to apply a UNION, and why recursion is something to do with a UNION.

[-]

fifilura 8 months ago

And if you get stuck, since it is all about rows and columns, just sketch it out in Excel.

Not using formulas in Excel but just using it as a rows/columns editor.

This makes it visually clearer.

hansvm 8 months ago

For "ordinary" working programmers (some denominator of reasonably common knowledge across the industry without any specific skills that help with CTEs in particular), there are a couple mental models I find helpful:

1. Recursion and iteration are duals of each other. Anywhere a "recursive" CTE is a good tool for the job, there exists a natural for-loop or while-loop you would write in an ordinary programming language to solve the same job. Figure out what that looks like, and then you can translate it to SQL. Optimizing further often isn't necessary. The general form isn't terrible (ellipses hide the extra columns you'd have to manually include in most SQL dialects):

  WITH RECURSIVE t(n, ...) AS (
      SELECT 1 as n, * from base_case
    UNION ALL
      SELECT n+1, f(...) FROM t WHERE not_terminated(n, ...)
  )
  SELECT something_interesting(...) FROM t

1 (continued). You can explicitly encode for-loops and while-loops, doing all the normal sorts of things an ordinary programming language allows. SQL is just a different language representing all the things you already know how to do; don't make it complicated until performance becomes a problem.

2. There exists a concept of "equality saturation" that's exceptionally powerful in a number of domains. Everything in (1) is familiar to an ordinary working programmer, but the database's mental model of a recursive CTE is

  while (not_done(working_set)) {
    working_set.extend(process(working_set));
  }

2 (continued 0). One model of solving a sudoku puzzle is an iterative/recursive branch-and-bound algorithm (try a reasonable thing, expand all the easily learnable knowledge, exit if impossible or if done (recursing at each level), go to the next reasonable thing). One "equality saturation" version of that solution is (plus a stopping condition somewhere):

2a. For each partial potential solution

2b. For all the ways in which the potential solution is viable

2c. Expand into all those new potentialities

2 (continued 1). That ability to describe a monotonically increasing set of things which finitely approaches a boundary is powerful in math, powerful in search, powerful in optimization (the Rust EGG crate (e-graphs good) has reasonable documentation for one particular concrete way in which that idea could manifest, if such concreteness helps you learn -- whether you know/like Rust or not), and so on. Gradient descent is just expanding your information till you don't have much more to learn. Optimizing a program is just expanding the set of optimization-based re-writes till you don't have any more re-writes (and pick the best one). Parsing a document is just adding to the things you know about it till no parsing rules can glean any more information. Since that thinking modality is how the database treats your query, you'll usually have better optimized queries if you can naturally formulate your problem in that language (as opposed to the naive iterative solution I proposed in (1)). Not all problems can be thus phrased, but if you're doing a lot of work in your database with recursive CTEs, it's a good idea to spend a week or three hammering home.

3. Combining (1) and (2) a bit, your database will usually have a cost somewhere around O(all_the_rows_you_produce) when evaluating recursive CTEs. These tasks only get hard when performance is a problem and you have to figure out how to take the naive ideas from (1) and transform them into the expected model of (2) in a way that actually reduces unnecessary work. For the sudoku example, you can do that by adding a bit more internal state to transform the breadth-first-search I described into a depth-first-search (making the solution _more_ iterative, interestingly; the "natural" solution is very slow comparatively), but in general you might have to get very creative or might not actually have a reasonable solution available.

daelon 8 months ago

I would also love to take that class!

nbbaier 8 months ago

Sign me up for this class too!

simonw 8 months ago

I thought this might be a relatively new example - I remember having seen the fractal one but I didn't recall the Sudoku one.

Turns out both of those examples have been in that documentation for over a decade now! https://web.archive.org/web/20140331191105/http://sqlite.org...

magicalhippo 8 months ago

I've mostly kept the usage to "sane" stuff like turning delimiter-separated text into rows, or walking a graph.

As much as I enjoy the Mandelbrot set, I bought the Fractint book as a kid, anyone done any outlandish but useful recursive queries?

PS: awesome explanation of how exactly the recursive query works. Wish I had read it when I first needed it in some other DB which help did not have such a clear explanation. Tore out a lot of hair before I got it working right.

[-]

airstrike 8 months ago

12 years ago I wrote a recursive CTE which aggregated a bunch of accounting journal entries from pretty big SAP extracts with intermediate account also needing to be calculated in a way that rolled up to company-level values, with some tagging/indexing in the process. I remember only being able to finish the query after some kind anon helped me in an IRC channel at 4am on a weeknight... to this day I'm immensely grateful to them

I made it into a neat little Django portal with configurable permissions, interactive charts for the data, filtering, etc.

It became a ~10-min async celery task running in the background from what previously used to take the company weeks to create that report in an error prone way with macros written by someone long gone / in another department. I'm still pretty proud of that app even though it never got implemented. I got promoted, moved to a different department and don't think it ever saw the light of day, but I do have the code laying around somewhere

hobs 8 months ago

Last time it was useful I re-implemented XIRR(https://support.microsoft.com/en-us/office/xirr-function-de1...) using Excel's approach in pure SQL so it would be able to make the finance bros happy, it was something like 50,000x faster than the user defined function/loop approach.

JaggerFoo 8 months ago

Years ago I came across a Knapsack problem solution written in Oracle SQL, that I adapted to Daily Fantasy Golf.

Here's the source: https://aprogrammerwrites.eu/?p=878

Cheers

dspillett 8 months ago

The Mandelbrot example had been around in various forms for quite some time, IIRC I first saw it in ~2006 for SQL Server which gained recursive CTE support in its 2005 release. Another example from around that time was using recursive CTEs to define dragon curves using spatial types that would then be drawn using SSMS's support for displaying data in those types.

jschrf 8 months ago

CTEs are extremely useful mechanisms for writing modular and easily maintainable queries, particularly around anything to do with ETL, graphs, and timeseries data. Highly recommend.

rkwz 8 months ago

Not SQLite, but recently used Postgres Recursive CTEs for graph retrieval - https://www.sheshbabu.com/posts/graph-retrieval-using-postgr...

hans_castorp 8 months ago

With Postgres you can also generate fireworks

http://explainextended.com/2009/12/31/happy-new-year/

or snowflakes:

http://explainextended.com/2011/12/31/happy-new-year-3/

Or you can solve a Sudoku riddle in Oracle

https://technology.amis.nl/2009/10/13/oracle-rdbms-11gr2-sol...

[-]

mike_hearn 8 months ago

Intriguing. The Sudoku puzzle in your last Oracle link from 2009 is exactly the same puzzle used in the SQLite example, doing the same thing.

gzel 8 months ago

I really like the way of how feldera implemented recursion in SQL: https://docs.feldera.com/sql/recursion

All you have to do is to add a forward declaration of the view and then you can reference it in a recursive query. Makes the syntax part much easier.

clumsysmurf 8 months ago

Just curios: can CTE's be used to make time series range queries easier / more performant?

Sqlite is the default option on Android and it's pretty common to have time series sensor data that needs to be captured, stored, and analyzed...

But sqlite isn't really meant for a time series workload.

There is also duckdb but I'm not sure about the status of the Android bindings.

[-]

dspillett 8 months ago

> can CTE's be used to make time series range queries easier / more performant?

On their own, I'd guess likely not a lot. If a view would help, a CTE will help similarly without needing the external structure, if a correlated subquery would help, then yes similarly, especially if the pattern is repeated in the overall query.

In conjunction with other things (good indexing, materialised sequences ("numbers" tables), etc.), is guess yes.

Though you need to be much more specific about your data and the queries in question before I can do better than these vague guesses.

remram 8 months ago

No. They are a query syntax, they don't change the storage or retrieval performance.

jitl 8 months ago

nothing really helps with potato slow Android device / storage media. CTE is not magic sauce that will make sqlite go faster.

Depending on the recursion pattern and the overhead of your sqlite driver, it can be faster to do many ID lookups then try to cram it all into one mega CTE query.

https://www.sqlite.org/np1queryprob.html

source: we have the CTE for loading page data in the notion Android app, and the network regularly beats disk on lower end Android devices using whichever query we pick.

firer 8 months ago

I love this stuff. If anybody wants another outlandish example here is an emulator I built: https://github.com/DanielFi/sqlite-vm/blob/main/emulator.sql

westurner 8 months ago

Firefox bookmarks have nested folders in an arbitrary depth tree, so a recursive CTE might be faster; https://www.google.com/search?q=Firefox+bookmarks+%22CTE%22

(Edit) "Bug 1452376 - Replace GetDescendantFolders with a recursive subquery" https://hg.mozilla.org/integration/autoland/rev/827cc04dacce

"Recursive Queries Using Common Table Expressions" https://gist.github.com/jbrown123/b65004fd4e8327748b650c7738...

[-]

leeoniya 8 months ago

storing the tree as an MPTT/NestedSet would massively simplify this, without any subquery shenanigans.

https://en.m.wikipedia.org/wiki/Nested_set_model

https://imrannazar.com/articles/modified-preorder-tree-trave...

[-]

jitl 8 months ago

Well the read query has simpler syntax with MPTT but implementing the whole structure is more complicated and any re-organization like moving a folder around requires rewriting a lot of rows. Although it doesn’t apply to the Firefox use-case, I’ve never understood how this technique can be applied to anything but the most trivially sized, roughly immutable trees. What do you do in a production system when two people move two different node in the tree? It seems to need all kinds of complicated locks.

westurner 8 months ago

There are at least five ways to store a tree in PostgreSQL, for example: adjacency list, nested sets like MPTT Modified Preorder Tree Traversal, nested intervals, Materialized Paths, ltree, JSON

E.g. django-mptt : https://github.com/django-mptt/django-mptt/blob/main/mptt/mo... :

  indexed_attrs = (mptt_meta.tree_id_attr,)
  field_names = (
    mptt_meta.left_attr,
    mptt_meta.right_attr,
    mptt_meta.tree_id_attr,
    mptt_meta.level_attr, )

Nested set model: https://en.wikipedia.org/wiki/Nested_set_model :

> The nested interval model stores the position of the nodes as rational numbers expressed as quotients (n/d).

owlninja 8 months ago

I often have to generate ad-hoc recursive queries for bills of materials. Any one have good ideas to make something more self serve for end users? I guess just some sort of app where the user supplies the top level part works, but I wonder if it should be pre-exploded? Often I am given a list of many parts to explode.

[-]

cryptonector 8 months ago

"Something more self-serve" is essentially what graph databases provide.

If you don't have the time to use one of those or build your own (and you probably shouldn't) sort of thing on top of SQL, then you can instead define a bunch of VIEWs using recursive queries and GROUP BY and aggregation functions to provide something simple for your users to query.

If you do want to build something more general purpose... If you begin by modeling your schema in something other than SQL DDLs (like with an AST) and then model FKs as bi-directional relationships (because they are), and if you add a way to group those relationships, and further provide a way to define a subset of like columns from all the tables in a graph defined by a group of those relationships, then suddenly you can also come up with a simple(ish) language for expressing [sub-]graphs that you want to fetch. And there's your ad-hoc query language for dealing with recursive graphs in your relational data.

[-]

owlninja 8 months ago

I appreciate this!

otteromkram 8 months ago

The SQL formatting on this website is atrocious.

I can't find a good reason not to left-align everything vs trying to right-align keywords and keep everything else on one line until the next keyword.

[-]

thissuchness 8 months ago

Can you give an example of a query you find offensive, and how you'd format it better?

8 months ago

[deleted]