At least they link to the data, but of course a lot of the data is copyrighted and graciously made available on the web, so the name "commons" is misleading.
People will use the data as if it were "commons" though.
How is this built? What'd be the approach if I'd like to achieve similar results against proprietary data.
References article speak of RAG and RIG - but I wonder if they factor into fine-tuning the models. AFAIK, RAG doesn't play nicely with structured data.
At least they link to the data, but of course a lot of the data is copyrighted and graciously made available on the web, so the name "commons" is misleading.
People will use the data as if it were "commons" though.
How is this built? What'd be the approach if I'd like to achieve similar results against proprietary data.
References article speak of RAG and RIG - but I wonder if they factor into fine-tuning the models. AFAIK, RAG doesn't play nicely with structured data.
I haven't read it in great detail, but it looks like there's documentation for self-hosting[1] (on Google Cloud).
[1] https://docs.datacommons.org/custom_dc/
Used as grounding by Google's DataGemma model https://blog.google/technology/ai/google-datagemma-ai-llm/
[dead]