Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB

(arxiv.org)

108 points | by PaulHoule 3 months ago ago

15 comments

bob1029 3 months ago

You could quickly wire up one of the LLM APIs as an application-defined function using SQLite if you wanted to play around with the idea of very slow and expensive queries:

https://sqlite.org/appfunc.html

https://learn.microsoft.com/en-us/dotnet/standard/data/sqlit...

Maybe stick with the aggregate variety of function at first if you don't want any billing explosions. I'd probably begin with something like LLM_Summary() and LLM_Classify(). The summary could be an aggregate, and the classify could be a scalar. Being able to write a query like:

  SELECT LLM_Summary(Comment)
  FROM Users
  WHERE datetime(Updated_At) >= datetime('now', '-1 day');

Is more expedient than wiring up the equivalent code pile each time. The aggregation method's internals could handle hierarchical summarization, chunking, etc. Or, throw an error back to the user so they are forced to devise a more rational query.

[-]

simonw 3 months ago

I tried that a couple of years ago with a CLI tool that uses Python functions called from SQLite - it worked with GPT-3.5: https://simonwillison.net/2023/Apr/29/enriching-data/

Example usage:

  openai-to-sqlite query database.db "
    update messages set sentiment = chatgpt(
      'Sentiment analysis for this message: ' || message ||
      ' - ONLY return a lowercase string from: positive, negative, neutral, unknown'
    )
    where sentiment not in ('positive', 'negative', 'neutral', 'unknown')
      or sentiment is null
  "

I haven't revisited the idea for fear of the amount it could cost if you ran it against a large database, but given the crashing prices of Gemini Flash, GPT-4o mini etc maybe it's worth another look!

[-]

jt_b 3 months ago

Perfect use case for https://github.com/urchade/GLiNER

Xmd5a 3 months ago

Maybe use embeddings from a BERT instead? This in particular: https://www.sbert.net/docs/sentence_transformer/pretrained_m...

datadrivenangel 3 months ago

The API call is the same price per token regardless of how you run it!

arthurcolle 3 months ago

if you switch model to GPT-4.5-preview you can spend a lot of money very quickly

falcor84 3 months ago

I love that and would maybe even add a model price parameter on each such function call. Perhaps e.g. an number in the range 1-10, with 1 being the cheapest available, and 10 being the best available (whatever the price), and then we'd have environment settings choose the actual models to use for each value. And perhaps have an overload fail-safe to switch all of the queries to cheaper models as a form of throttling.

datadrivenangel 3 months ago

Is there any reason why you can't build a full agent in SQL? SQLite with a little sauce should be competitive with Langchain if you think a little bit.

[-]

bob1029 3 months ago

I don't see why not. Recursive CTEs allow you to express pretty much anything you desire.

You could also expose additional functions corresponding to the external tools that you would like the agent to have access to and pass these as arguments to additional UDFs.

You could also lean into data-driven and express much of the configuration in tables and then use the enhanced SQL dialect to tie everything together at runtime. In SQLite, during UDF execution, arbitrary queries can be ran. You could pull tool descriptions, parameter lists, enums, etc. from ordinary SQL tables without having to pass explicit args to the functions.

[-]

whattheheckheck 3 months ago

https://explainextended.com/2023/12/31/happy-new-year-15/

datadrivenangel 3 months ago

Use triggers to execute trees of tasks and you have a working agentic system!

[-]

dpflan 3 months ago

Any benefit to having a graph database component alongside the SQL? Maybe add some Apache Age to Postgres?

“””

Apache AGE™ Graph Database for PostgreSQL Apache AGE™ is a PostgreSQL Graph database compatible with PostgreSQL's distributed assets and leverages graph data structures to analyze and use relationships and patterns in data. “””

https://age.apache.org/

ofrzeta 3 months ago

Here is the implementation https://github.com/dsg-polymtl/flockmtl

geekodour 3 months ago

This paper, and solutions such as sqlcoder, and https://doris.apache.org/zh-CN/blog/Tencent-LLM etc. Which let you query DB via natural language etc.

But recently there has been a surge around MCPs being able to query databases provided the n-number of MCP servers popping up. An example: https://www.reddit.com/r/ChatGPTCoding/comments/1jd9lfa/lear...

So I was wondering of things like the Doris blogpost, this paper and sqlcoder are still relevant/what extra does this approach offer vs trying to build a over mcp?

simlevesque 3 months ago

I'll try this at work tomorrow !