Given the increasingly closed-source nature of the U.S. AI ecosystem, it is now more important than ever to push for the proliferation of open model and dataset releases. Datamule, TeraflopAI, and Daft collaborated to release 43 Billion Tokens of SEC EDGAR data.
Amazing work leveraging Daft for this!
https://huggingface.co/datasets/TeraflopAI/SEC-EDGAR
Given the increasingly closed-source nature of the U.S. AI ecosystem, it is now more important than ever to push for the proliferation of open model and dataset releases. Datamule, TeraflopAI, and Daft collaborated to release 43 Billion Tokens of SEC EDGAR data.
Neat! Surprised at how cheap it was.
Very cool that this kind of work can now be performed at this kind of a price-point. 24 hours for 8M filings on just 12 cores :)
Excited for unstructured/multimodal data processing to become increasingly commoditized and abstracted away so that more such datasets can be built