Data Version Control

(dvc.org)

68 points | by shcheklein 5 hours ago ago

10 comments

  • bramathon 2 hours ago

    I've used DVC for most of my projects for the past five years. The good things is that it works a lot like git. If your scientists understand branches, commits and diffs, they should be able to understand DVC. The bad thing is that it works like git. Scientists often do not, in fact, understand or use branches, commits and diffs. The best thing is that it essentially forces you to follow Ten Simple Rules for Reproducible Computational Research [1]. Reproducibility has been a huge challenge on teams I've worked on.

    [1] https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

  • dmpetrov an hour ago

    hi there! Maintainer and author here. Excited to see DVC on the front page!

    Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.

    • ajoseps 11 minutes ago

      if the data files are all just text files, what are the differences between DVC and using plain git?

  • shicholas 2 hours ago

    What are the benefits of DVC over Apache Iceberg? If anyone used both, I'd be curious about your take. Thanks!

    • andrew_lettuce an hour ago

      I don't see any real benefits, as it feels like using the tool you already know even though it's not quite right. Iceberg is maybe geared towards slower changing models than this approach?

  • jerednel 3 hours ago

    It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?

    • riedel 3 hours ago

      DVC is (at least as I use it) pretty much just git LFS with multiple backends (guess actually a more simple git annex). It further has some rather MLOps specific stuff. Is handy if you do versions model training with changing data on S3.

      • matrss 27 minutes ago

        Speaking of git-annex, there is another project called DataLad (https://www.datalad.org/), which has some overlap with DVC. It uses git-annex under the hood and is domain-agnostic, compared to the ML focus that DVC has.

      • haensi 42 minutes ago

        There’s another thread from October 2022 on that topic.

        https://news.ycombinator.com/item?id=33047634

        What makes DVC especially useful for MLOps? Aren’t MLFlow or W&B solving that in a way that’s open source (the former) or just increases the speed and scale massively ( the latter)?

        Disclaimer: I work at W&B.

      • starkparker 43 minutes ago

        I've used it for storing rasters alongside georeferencing data in small GIS projects, as an alternative to git LFS. It not only works like git but can integrate with git repos through commit and push/pull hooks, storing DVC pointers and managing .gitignore files while retaining directory structure of the DVC-managed files. It's neat, even if the initial learning curve was a little steep.

        We used Google Drive as a storage backend and had to grow out of it to a WebDAV backend, and it was nearly trivial to swap them out and migrate.