Codeberg Reconsidering OSI License Approval in Terms of Use

(codeberg.org)

52 points | by pabs3 19 hours ago ago

16 comments

  • rettichschnidi 16 hours ago

    EU AI Act (https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:...) as of today (CTRL + F "open-source"):

    > (89) Third parties making accessible to the public tools, services, processes, or AI components other than general-purpose AI models, should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, in particular towards the provider that has used or integrated them, when those tools, services, processes, or AI components are made accessible under a free and open-source licence. ...

    > Article 2, 12. This Regulation does not apply to AI systems released under free and open-source licences, unless they are placed on the market or put into service as high-risk AI systems or as an AI system that falls under Article 5 or 50.

    Let's see if the EU AI Act will be adjusted in the same spirit as discussed in the linked discussion.

    • weebull 13 hours ago

      > > ..., should not be mandated to comply with requirements targeting the responsibilities along the AI value chain, ...

      What does that mean?

      • TheNewsIsHere 12 hours ago

        It reads to me that you can’t pass on upstream AI vendor/platform/service requirements downstream to users/customers/end parties.

        Which would be separate from any legislated requirements or limitations.

  • RobotToaster 16 hours ago

    Given that most AI is trained on data scraped from the internet (most of which isn't open source), isn't it basically impossible to release an entire training dataset under an open source licence?

    • regularfry 13 hours ago

      That would, I suspect, be the point. If your AI is trained on non-free content, the implication is that it would be impossible for it to be released with an open source licence. So don't do that, the argument goes: only use content that has been released with a sufficiently free licence that republishing it in your dataset is not a problem. And as a side effect, you have to show that there isn't any "misappropriated" content in your training set. That side effect is what gets some people excited here.

      I don't agree with that position legally, but I do mechanically. The point of the GPL family (to pick one random type of licence) is that the end user should have the capability to modify the product to their own ends, and I don't think fine-tuning provides enough capability to qualify.

    • pabs3 15 hours ago

      It has been done before, for eg the original RNNoise was trained on proprietary data, later there was crowd-sourced effort to record new data and have it under libre licenses.

      https://github.com/xiph/rnnoise/

    • guerrilla 16 hours ago

      They could release as much is as necessary to recreate it, the crawlers or list of links they used and configuration or scripts used to drive the training. Nobody is asking for the entire web in their git repo, only the ability to retrain from scratch, possibly with modifications.

      • echoangle 14 hours ago

        Not really, because there’s no guarantee that it will be available in the future. A script to download the data doesn’t mean I can reliably recreate the data in 5 years, I wouldn’t call that open source. To me, the data itself needs to be published.

        • guerrilla 12 hours ago

          Oh well, they did their best. You can't expect them to do better than what is possible. Enough Nirvana fallacy here.

          • echoangle 11 hours ago

            I’m not expecting them to do the impossible, but they shouldn’t call it open source then. Either you provide all the data and call it open source, or you don’t provide the data because it is proprietary and don’t call the model open source.

            • guerrilla 8 hours ago

              Well, I'm glad you exist to push the Overton window even further anyway. A lot of people are trying to claim that what's being pushed now (opaque data) is open source. I'll be satisfied if I can at least aplroximate the training witb whatever is online at the time I were to do it.

      • tourmalinetaco 4 hours ago

        This is incredibly unrealistic. Imagine hundreds to thousands of individuals scraping thousands of websites, DDoSing them because their mid-tier “open source” LLM project gave them a list of links to “recreate” their dataset. It’s far more sustainable to create a dataset filled GPL, public domain, and other permissively licensed data rather than nuke half the Internet’s bandwidth. And that’s ignoring the fact that scraping it yourself does not actually grant you a permissive license, that’s like saying that by watching a movie in theaters you have a legal right to sell that movie.

  • pabs3 15 hours ago

    I like Debian's policy for libre AI:

    https://salsa.debian.org/deeplearning-team/ml-policy/

  • guerrilla 16 hours ago

    It's heartening to see people take this seriously. Let's hope many more stand up for the basic ontology and spirit of free software.

  • 12 hours ago
    [deleted]