Yay! Excited to see DataChain on the front page :)
Maintainer and author here. Happy to answer any questions.
We built DataChain because our DVC couldn't fully handle data transformations and versioning directly in S3/GCS/Azure without data copying.
Analogy with "DBT for unstractured data" applies very well to DataChain since it transforms data (using Python, not SQL) inside in storages (S3, not DB). Happy to talk more!
DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.
Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.
In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().
It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.
I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.
It took me a minute to grok what this was for, but I think I like it
It doesn't really replace any of the tooling we use to wrangle data at scale (like prefect or dagster or temporal) but as a local library it seems to be excellent, I think what confused me most was the comparison to dbt.
I like the from_* utils and the magic of the Column class operator overloading and how chains can be used as datasets. Love how easy checkpointing is too. Will give it a go
Yes, it's not meant to replace data engineering tools like Prefect or Temporal. Instead, it serves as a transformation engine and ad-hoc analytics for images/video/text data. It's pretty much DBT use case for text and images in S3/GCS, though every analogy has its limits.
Lance is just a data format. Lance DB might be more comparable to DataChain.
DataChain focuses on data transformation and versioning, whereas LanceDB appears to be more about retrieving and serving data. Both designed for multimodal use cases.
From technical side: Lance has it's own data format and DB engine while DataChain utilizes existing DB engines (SQLite in open-source and ClickHouse/BigQuery in SaaS).
In SaaS, DataChain has analytics features including data lineage tracking and visualization for PDFs, videos, and annotated images (e.g., bounding boxes, poses). I'm curious to understand the unique value of LanceDB's SaaS — insight would be helpful!
You could think of it as OLTP (Lance) versus OLAP (DataChain) for multimodal data, though this analogy may not be perfect.
The idea is that it doesn't store binary files locally, just pointers in the DB + meta data (SQLite if you run locally, open source). So, it's versioning, structuring of datasets, etc by "references" if you wish.
(that's is different from let's say DVC - that does copy files into a local cache, always)
So in the case from the README, where you're trying to curate a sample of your data, the only thing that you're reading is the metadata, UNTIL you run `export_files` and that actually copies the binary data to your local machine?
Exactly! DataChain does lazy compute. It will read metadata/json while applying filtering and only download a sample of data files (jpg) based on the filter.
This way, you might end up downloading just 1% of your data, as defined by the metadata filter.
Yay! Excited to see DataChain on the front page :)
Maintainer and author here. Happy to answer any questions.
We built DataChain because our DVC couldn't fully handle data transformations and versioning directly in S3/GCS/Azure without data copying.
Analogy with "DBT for unstractured data" applies very well to DataChain since it transforms data (using Python, not SQL) inside in storages (S3, not DB). Happy to talk more!
Cool! Does this assume the unstructured data already has a corresponding metadata file?
My most common use cases involve getting PDFs or HTML files and I have to parse the metadata to store along with the embedding.
Would I have to run a process to extract file metadata into JSONs for every embedding/chunk? Would keys created based off document be title+chunk_no?
Very interested in this because documents from clients are subject to random changes and I don’t have very robust systems in place.
DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.
Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.
In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().
Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...
> However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.
Forgive my ignorance, but what is "json-pair"?
It's not a format :)
It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.
Thanks for the explanation!
> DataChain has no assumptions about metadata format.
Could your metadata come from something like a Postgres sql statement? Or an iceberg view?
Absolutely, that's a common scenario!
Just connect from your Python code (like the lambda in the example) to DB and extract the necessary data.
What relevant metadata is there in an HTML file?
I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.
Hopefully, @jerednel can add more details.
For HTML it's markup tags...h1's, page title, meta keywords, meta descriptions.
My retriever functions will typically use metadata in combination with the similarity search to do impart some sort of influence or for reranking.
It took me a minute to grok what this was for, but I think I like it
It doesn't really replace any of the tooling we use to wrangle data at scale (like prefect or dagster or temporal) but as a local library it seems to be excellent, I think what confused me most was the comparison to dbt.
I like the from_* utils and the magic of the Column class operator overloading and how chains can be used as datasets. Love how easy checkpointing is too. Will give it a go
Yes, it's not meant to replace data engineering tools like Prefect or Temporal. Instead, it serves as a transformation engine and ad-hoc analytics for images/video/text data. It's pretty much DBT use case for text and images in S3/GCS, though every analogy has its limits.
Try it out - looking forward to your feedback!
How does this relate to https://github.com/lancedb/lance
Lance is just a data format. Lance DB might be more comparable to DataChain.
DataChain focuses on data transformation and versioning, whereas LanceDB appears to be more about retrieving and serving data. Both designed for multimodal use cases.
From technical side: Lance has it's own data format and DB engine while DataChain utilizes existing DB engines (SQLite in open-source and ClickHouse/BigQuery in SaaS).
In SaaS, DataChain has analytics features including data lineage tracking and visualization for PDFs, videos, and annotated images (e.g., bounding boxes, poses). I'm curious to understand the unique value of LanceDB's SaaS — insight would be helpful!
You could think of it as OLTP (Lance) versus OLAP (DataChain) for multimodal data, though this analogy may not be perfect.
How about daft https://github.com/Eventual-Inc/Daft - also looks like a new multimodal dataframe framework
Good question! I’m not so familiar with it.
It looks like Daft is closer to Lance with it’s own data format and engine. But I’d appreciate more insights from users or the creators.
> It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine.
How does one wrangle terabytes of data on a local machine?
The idea is that it doesn't store binary files locally, just pointers in the DB + meta data (SQLite if you run locally, open source). So, it's versioning, structuring of datasets, etc by "references" if you wish.
(that's is different from let's say DVC - that does copy files into a local cache, always)
So in the case from the README, where you're trying to curate a sample of your data, the only thing that you're reading is the metadata, UNTIL you run `export_files` and that actually copies the binary data to your local machine?
Exactly! DataChain does lazy compute. It will read metadata/json while applying filtering and only download a sample of data files (jpg) based on the filter.
This way, you might end up downloading just 1% of your data, as defined by the metadata filter.