96 comments

  • didibus 7 hours ago

    > Maybe the supporter of the definition could demonstrate practically modifying a ML model without using the original training data, and show that it is just as easy as with the original data and it does not limit what you can do with it (e.g. demonstrate it can unlearn any parts of the original data as if they were not used).

    I quite like that comment that was left on the article. I know some models you can tweak the weights, without the source data, but it does seem like you are more restricted without the actual dataset.

    Personally, the data seems to be part of the source to me, in this case. I mean, the code is derived from the data itself, the weights are the artifact of training. If anything, they should provide the data, the training methodology, the model architecture, the code to train and infer, and the weights could be optional. I mean, the weights basically are equivalent to a built artifact, like the compiled software.

    And that means commercially, people would pay for the cost of training. I might not have the resources to "compile" it myself, aka, run the training, so maybe I pay a subscription to a service that did.

    • lolinder 6 hours ago

      A lot of people get hung up on `weights = compiled-artifact` because both are binary representations, but there are major limitations to this comparison.

      When we're dealing with source code, the cost of getting from source -> binary is minimal. The entire Linux kernel builds in two hours on one modest machine. Since it's cheap to compile and the source code is itself legible, the source code is the preferred form for making modifications.

      This doesn't work when we try to apply the same reasoning to `training data -> weights`. "Compilation" in this world costs hundreds of millions of dollars per compilation run. Cost of "compilation" alone means that the preferred form for making modifications can't possibly be the training data, even for the company that built the thing in the first place. As for the data itself, it's a far cry from source code—we're talking tens of terrabytes of data at a minimum, which is likewise infeasible to work with on a regular basis. The weights must be the preferred form for making modifications for simple logistics reasons.

      Importantly, the weights are the preferred form for modifications even for the companies that built them.

      I think a far more reasonable analogy, to the extent that any are reasonable, is that the training data is all the stuff that the developers of the FOSS software ever learned, and the thousands of computer-hours spent on training are the thousands of man-hours spent coding. The entire point of FOSS is for a few experts to do all that work once and then we all can share and modify the output of those years of work and millions of dollars invested as we see fit, without having to waste all that time and money doing it over again.

      We don't expect the authors of the Linux kernel to document their every waking thought so we could recreate the process they used to produce the kernel code... we just thank them for the kernel code and contribute to it as best we can.

      • dragonwriter 2 hours ago

        > A lot of people get hung up on `weights = compiled-artifact` because both are binary representations,

        No, that's not why weights are object code. Binary vs. text is irrelevant.

        Weights are object code because training data is declarative source code defining the desired behavior of the system and training code is a compiler which takes that source code and produces a system with the desired behavior.

        Now, the behavior produced is less exactly known from the source code than is the case with traditional programming, but the function is the same.

        You could have a system where the training and inference codes were open source and the model specified by the weights itself was not — that would be like having a system where software was not open source, but the compiler use to build it and the runtime library it relies on were. But one shouldn't confuse that with an open source model.

      • smolder 5 hours ago

        If you can't bootstrap a set of weights from public material, the weights aren't open source, because it's derivative content based on something non-open.

        Trying to draw an equivalency between code and weights is [edited for temperament, I guess] not right. They are built from the source material supplied to an algorithm. Weights are data, not code.

        Otherwise, everyone on the internet would be an author, and would have a say in the licensing of the weights.

        • lolinder 5 hours ago

          > Trying to draw an equivalency between code and weights is ridiculous, because the weights are not written in the same way as source code.

          By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense. Analogies will always fail, which is why "preferred form for making modifications" is the rod we use, not vague attempts at drawing analogies between completely different development processes.

          > They are built from the source material supplied to an algorithm. Weights are data, not code.

          As Lispers know well, code is data and data is code. You can't draw a line in the sand and definitively say that on this side of the line is just code and on that side is just data.

          In terms of how they behave, weights function as code that is executed by an interpreter that we call an inference engine.

          • smolder 3 hours ago

            I'm perfectly willing to draw a line in the sand instead of letting people define their models however it's most convenient for them. Analogies aside, here is what a set of weights is made of: A lot of data, mostly produced by humans who are not credited and have no say in how the output weights are licensed, some code written by people who might have some say, and then lots of work by computers running that code and consuming that data.

            I'm not comfortable with calling the resulting weights "open source", since people can't look at a set of weights and understand all of the components in the same way as actual source code. It's more like "freeware". You might be able to disassemble it with some work, but otherwise it's an incomprehensible thing you can run and have for free. I think it would be more appropriate to co-opt the term "open source" for weights generated from freely available material, because then there is no confusion whether the "source" is open.

            • lolinder 3 hours ago

              > A lot of data, mostly produced by humans who are not credited and have no say in how the output weights are licensed

              And this is what I think everyone is actually dancing around: I suspect the insistence on publishing the training data has very little to do with a sense of purity around the definition of Open Source and everything to do with frustrations about copyright and intellectual property.

              For that same reason, we won't see open source models by this definition any time soon, because the legal questions around data usage are profoundly unsettled and no company can afford to publicize the complete set of data that they trained on until they are.

              My personal ethic says that intellectual property is a cancer that sacrifices knowledge and curiosity on the altar of profit, so I'm not overly concerned about forcing companies to reveal where they got the data. If they're releasing the resulting weights under a free license (which, notably, Llama isn't) then that's good enough for me.

              • smolder 2 hours ago

                > For that same reason, we won't see open source models by this definition any time soon

                It's totally fine if we don't have many (or any) models meeting the definition of open source! How hard is it to use a different term that actually applies?

                The people on my side of the argument seem to be saying: "do not misapply these words", not "do not give away your weights".

                Insisting on calling a model with undisclosed sources "open source" has what benefit? Marketing? That's really all I can think of... that it's to satisfy the goals of propagandists.

          • dietr1ch 4 hours ago

            > > Trying to draw an equivalency between code and weights is ridiculous, because the weights are not written in the same way as source code.

            > By the same logic, the comparison between a compiled artifact and weights fails because the weights are not "compiled" in any meaningful sense.

            To me the weights map to assembly and the training data+models map to source code+compilers. Sure, you can hand me assembly, and with the assembly I may be able to execute the model/program, but having it does not mean that I can stare at it and learn nor modify it with a reasonable understanding of what's going to change.

            I got to add that the situation feels even worse than assembly, because assembly, hand-coded or mutilated by an optimizing compiler still does something very specific and deterministic, but the weights on the model makes things equivalent to programming without booleans, but seemingly random numbers and checking for inequalities to get a binary decision.

            • lolinder 4 hours ago

              This is the analogy that people keep saying, but as I observed above, the key difference is that the company that produces a binary executable doesn't prefer to work with that binary: they work with the source code and re-compile after changing it.

              In contrast, the weights are the preferred form for modification, even for the company that built it. They only very rarely start a brand new training run from scratch, and when they do so it's not to modify the existing work, it's to make a brand new work that builds on what they learned from the previous model.

              If the company makes the form of the work that they themselves use as the primary artifact freely available, I'm not sure why we wouldn't call the work open.

              • Nevermark 3 hours ago

                > In contrast, the weights are the preferred form for modification, even for the company that built it.

                Preferred is obviously not a particularly strong line.

                If someone ships object code for a bunch of stable modules, and only provides the source for code that’s expected to be where changed, is that really open?

                “Preferred” gets messy quick. Not sure how that can be defined in any consistent way. Models are going to get more & more complex. Training with competitive models, etc.

                I think you either have it all, or it isn’t really open. Or only some demarked subset is.

              • smolder 3 hours ago

                Is a .rom file open source because you can pipe it into an emulator and generate new machine code for a different platform?

                I don't think your argument holds any water.

                • lolinder 3 hours ago

                  Is a .rom file the preferred form for modifying the work?

                  • smolder 3 hours ago

                    To get it to run on different platforms and gain new features like super-resolution and so on, yes. Rom files are the preferred form for modifying old games. No one bothers digging up old source code and assets to reconstruct things when they can use an emulator to essentially spit out a derivative binary with new capability. See every re-release of a 16 bit era or earlier game.

                    Now that I've beat my head against this issue for a while, I think it's best summed up as: weights are a binary artifact, not source of any kind.

                    • lolinder 3 hours ago

                      If what you say is true—if the ROM is the preferred form for making modifications (even for the original company that produced it) and the ROM is released under a FOSS license—then sure, I have no problem calling it open source.

      • SOLAR_FIELDS 4 hours ago

        Is it sufficient to say something is open if it can be reproduced with no external dependencies? If it costs X gazillion dollars to reproduce it, that feels irrelevant to some extent. If it is not reproducible, then it is not open. If it is reproducible, then it is open. Probably there’s some argument to be made here that it’s not actually open if some random dev can’t reproduce it on their own machine over a weekend, but I honestly don’t buy that argument in this realm.

        • lolinder 4 hours ago

          > If it is not reproducible, then it is not open. If it is reproducible, then it is open.

          You're applying reproducibility unevenly, though.

          The Linux kernel source code cannot feasibly be reproduced, but it can be copied and modified. The Mistral weights cannot feasibly be reproduced, but they can be copied and modified. Why is the kernel code open source while the Mistral weights are not?

          Reproducibility is clearly not the deciding factor.

          • fragmede 4 hours ago

            The Linux kernel is considered Open Source because (among other things) the compiled kernel binary that is used to boot a computer can be reproduced from provided source code.

            source code -> compile -> kernel binary. That binary is what can be reproduced, given the source code.

            We don't have the equivalent for Mistral:

            source code (+ training data) -> training -> weights

            • lolinder 3 hours ago

              So people have said, but as I've noted I disagree with the characterization that training is equivalent to compilation. Even the companies that can afford to train a foundation model do so once and then fine-tune it from there to modify it. They only start a new training run when they're building a brand new model with totally different characteristics (such as a different parameter count).

              Training is too expensive for the training data to be the preferred form for making modifications to the work. Given that, the weights themselves are the closest thing these things have to "source code".

              And this is where the reproducibility argument falls apart: on what basis can we insist that the preferred form for modifying an LLM (the weights) must be reproducible to be open source but the preferred form for modifying a piece of regular software (the code) can be open sourced as is, with none of the processes used to produce the code?

              • fragmede 3 hours ago

                Just because I can hex edit Photoshop.exe to say what I want in the about dialog doesn't make it open source, even if it is faster and easier to hexedit it than it is to recompile from source.

                In order for the weights to take all the training data and embed it in the model, by definition, some data must be lost. That data can't be recovered, no matter how much you fine tune the model. Because we can't, we don't know how alignment gets set, or the extent of it.

                The closet thing these things have to source code is the source code and training data used to create the model. Because that's what's used to created the model. How big a system is necessary to train it doesn't factor in. It used to take many days to compile the Linux kernel, and many people at the time didn't have access to systems that could even compile it.

                • lolinder 3 hours ago

                  > Just because I can hex edit Photoshop.exe to say what I want in the about dialog doesn't make it open source, even if it is faster and easier to hexedit it than it is to recompile from source.

                  First, licenses matter. Photoshop.exe is closed source first and foremost because the license says so.

                  Second and more importantly for this discussion, Adobe doesn't prefer to work with hexedit, they prefer to work with the source code.

                  OpenAI prefers to fine tune their existing models rather than train new ones. They fine tune regularly, and have only trained from scratch four times total, with each of those being a completely new model, not a modification.

                  That means the weights of an LLM are the preferred form for modification, which meets the GPL's definition of 'source code':

                  > The “source code” for a work means the preferred form of the work for making modifications to it.

          • SOLAR_FIELDS 3 hours ago

            Interesting take. You appear to be defining reproducibility to be something like “could I write this source code again myself”. But no one I know uses the term reproducible in the way you’re saying. Everyone I know, including myself, takes reproducibility in this context to mean “if given source code, I can produce a binary /executable that is identical to the one that is produced by some other party when built the same way”

            Now I get that “Identical” is a bit more nebulous when it comes to LLMs due to their inherent nondeterminism, but let’s take it to mean the executable itself, not the results produced by the executable.

            • lolinder 3 hours ago

              > You appear to be defining reproducibility to be something like “could I write this source code again myself”.

              No, I'm using the strict definition "capable of being reproduced", where reproduce means "to cause to exist again or anew". In and of itself the word doesn't comment on whether you're starting from source code or anything else, it just says that something must be able to be created again.

              Yes, in the context of compilation this tends to refer to reproducible builds (which is a whole rabbit hole of its own), but here we're not dealing with two instances of compilation, so it's not appropriate to use the specialized meaning. We're dealing with two artifacts (a set of C files and a set of weights) that were each produced in different ways, and we're asking whether each one can be reproduced exclusively from data that was released alongside the artifact. The answer is that no, neither the source files or the weights can be reproduced given data that was released alongside them.

              So my question remains: on what basis can we say that the weights are not open source but the C files are? Neither can be reproduced from data released alongside them, and both are the preferred form which the original authors would choose to make modifications to. What distinguishes them?

              • SOLAR_FIELDS 3 hours ago

                I’ll go ahead and call this a false equivalency because the amount of work required to get “pretty close” to compiling a binary that looks something like the Linux kernel is pretty achievable. Not so for these models. I know my way around gcc and llvm enough to be able to compile something that will work mostly like the Linux kernel in some reasonable amount of time.

                Now I know it seems like I’m taking the opposite side of my original take here but come on - you can’t really genuinely believe that because I can’t produce a byte for byte representation of the Linux kernel immediately even if it behaves 99.999% the same that somehow that is even remotely the same as not being able to reproduce an “open” LLM.

                • lolinder 3 hours ago

                  All I'm saying is that reproducibility of the released primary artifact—be it source or weights—is not actually a factor in whether we consider something to be open source. Regardless of whether you believe you could rewrite the Linux kernel from scratch, you don't consider the Linux kernel to be open source because you can rewrite it.

                  It's open source because they licensed the preferred form of the work for making modifications under a FOSS license. That's it. Reproducibility of that preferred form from scratch doesn't factor into it.

                  • SOLAR_FIELDS 2 hours ago

                    Fair. In reading our thread again I realized we are probably talking past each other. Really this discussion should be about the philosophical aspect of what constitutes open source in the context of LLM’s. Which is a tough and more interesting topic.

                    Really the conversation should be reframed to be something along the lines of “is it even ethical for these companies to offer their LLM as anything other than open source”? The answer, if you look into what they do, is “probably not”. Arguing about the technicalities of whether they follow the letter of whatever rule or regulation is probably a waste of time. Because it is completely obvious to anyone who understands how this works that these models are built and sold off the backs of centuries of open source work not licensed or intended to be used for profit.

                    • lolinder 2 hours ago

                      > Really the conversation should be reframed to be something along the lines of “is it even ethical for these companies to offer their LLM as anything other than open source”? The answer, if you look into what they do, is “probably not”.

                      Agreed, but I'm personally of the opinion that this is true for all intellectual endeavors. Intellectual property is the great sin of our generation, and I hope we eventually learn better.

                      And I think you've hit at the heart of the matter: the push for open source training data has never been about the definition of open source, it's always been a cover for complaints about where the data was sourced from. Which is also why we won't see it any time soon—not until the lawsuits wind their way through the courts, and even then only if the results are favorable towards training.

                      • 9dev 35 minutes ago

                        It was a joy to follow along this discussion. Thank both of you.

      • kvemkon 5 hours ago

        > The entire Linux kernel builds in two hours on one modest machine.

        I think it is better to compare with something really big and fast evolving, e.g. Chromium. It will take a day to compile it. (~80000 seconds vs. ~8 seconds for a convenient/old Pascal program.)

        • lolinder 4 hours ago

          Even if we take Chromium, compiling it is still within the logistical capabilities of any individual with a modest computer, and it's still more reasonable to spend a day waiting for the compilation then to try to modify the binary directly.

          The same cannot be said for LLM weights, as evidenced by the fact that even the enormous megacorps that put these things out tend to follow up by fine tuning the weights (using different training data) rather than retraining from scratch.

          • kvemkon 4 hours ago

            I'd continue to compare it with how free software appeared. It was barely feasible back then (end of 60th - begin of 70th) to work on software at home on a PC, since there was no PC. Still the software could already be open/free and exchanged between big enough players with big computers. Only later the things changed. But even now there is free software which is nightmare to program on a usual single PC.

            Thus it is either too early to define "open" for AI. Or "open" must be truly open. Though it remains not practically achievable at home or even at small companies.

      • dahart 4 hours ago

        > The entire point of FOSS is for a few experts to do all that work once and then we all can share and modify the output of those years of work and millions of dollars invested as we see fit, without having to waste all that time and money doing it over again.

        Really? Hmm yeah maybe you’re right, but for some reason, said that way it somehow starts to seem a little disappointing and depressing. Maybe I’m reading it differently than you intended. I always considered the point of FOSS to be about the freedoms to use, study, customize, and share software, like to become an expert, not to avoid becoming an expert. But if the ultimate goal of all that is just a big global application of DRY so that most people rely on the corpus without having to learn as much, I feel like that is in a way antithetical to open source, and it could have a big downside or might end up being a net negative, but I dunno…

      • raverbashing 2 hours ago

        Completely agree

        There's a much simpler analogy: a photo

        You can't have an "Open source photo" because that would require shipping everything (but the camera) that shows up in the photo so that someone could "recreate" the photo

        It doesn't make sense.

        A public domain photo is enough

    • nextaccountic 6 hours ago

      The source is really the training data plus all code required to train the model. I might not have resources to "compile", and also "compilation" is not deterministic, but those are technical details

      You could have a programming language whose compiler is a superoptimizer that's very slow and is also stochastic, and it would amount to the same thing in practice.

    • a2128 7 hours ago

      The usefulness of data here is that you can retrain the model after making changes to its architecture, e.g. seeing if it works better with a different activation function. Of course this is most useful for models small enough that you could train it within a few days on a consumer GPU. When it comes to LLMs only the richest companies would have the adequate resources to retrain.

  • samj 7 hours ago

    The OSI apparently doesn't have the mandate from its members to even work on this, let alone approve it.

    The community is starting to regroup at https://discuss.opensourcedefinition.org because the OSI's own forums are now heavily censored.

    I encourage you to join the discussion about the future of Open Source, the first option being to keep everything as is.

    • justinclift 6 hours ago

      For reference, this is the OSI Forum mentioned: https://discuss.opensource.org

      Didn't personally know they even had one. ;)

    • jart 6 hours ago

      OSI must defend the open source trademark. Otherwise the community loses everything.

      The legal system in the US doesn't provide them any other options but to act.

      • tzs 5 hours ago

        They don’t have a US trademark on “open source”. Their trademarks are on “open source initiative” and “open source initiative approved license”.

  • looneysquash 3 hours ago

    The trained model is object code. Think of it as Java byte code.

    You have some sort of engine that runs the model. That's like the JVM, and the JIT.

    And you have the program that takes the training data and trains the model. That's your compiler, your javac, your Makefile and your make.

    And you have the training data itself, that's your source code.

    Each of the above pieces has its own source code. And the training set is also source code.

    All those pieces have to be open to have a fully open system.

    If only the training data is open, that's like having the source, but the compiler is proprietary.

    If everything but the training set is open, well, that's like giving me gcc and calling it Microsoft Word.

  • blogmxc 8 hours ago

    OSI sponsors include Meta, Microsoft, Salesforce and many others. It would seem unlikely that they'd demand the training data to be free and available.

    Well, another org is getting directors' salaries while open source writers get nothing.

    • dokyun 5 hours ago

      This is why I'd wait for the FSF to deliver their statement before taking anything OSI comes out with seriously.

      • JoshTriplett 2 hours ago

        The FSF delivering a statement on AI will have zero effect, no matter what position they take.

        • dokyun 35 minutes ago

          As programs that utilize AI continue to become more prevalent, the concern for their freedom is going to become very important. It might require a new license, like a new version or variant of the GPL. In any case I believe the FSF is going to continue to campaign for the ethical freedom of these new classes of software, even if it requires new insight into what it means for them to be free, as they have done before. The FSF is also a much larger and more vocal organization than OSI is, even without the latter's corporate monarc--I mean, monetizers.

    • whitehexagon 2 hours ago

      >It would seem unlikely that they'd demand the training data to be free and available.

      I wonder who has legal liability for the closed-data generated weights and some of the rubbish they spew out. Since users will be unable to change the source-data inputs, and will only be able to tweak these compiled-model outputs.

      Is such tweaking analogous to having a car resprayed, and the manufacturer washes their hands of any liability over design safety.

  • wmf 7 hours ago

    On one hand if you require people to provide data they just won't. People will never provide the data because it's full of smoking guns.

    On the other hand if the data isn't open you should probably use the term open weights not open source. They're so close.

    • samj 7 hours ago

      Yes, and Open Source started out with a much smaller set of software that has since grown exponentially thanks to the meaningful Open Source Definition.

      We risk giving AI the same opportunity to grow in an open direction, and by our own hand. Massive own goal.

      • bjornsing 2 hours ago

        > Yes, and Open Source started out with a much smaller set of software that has since grown exponentially thanks to the meaningful Open Source Definition.

        I thought it was thanks to a lot of software developers’ uncompensated labor. Silly me.

    • skissane 6 hours ago

      > On one hand if you require people to provide data they just won't. People will never provide the data because it's full of smoking guns.

      Tangential, but I wonder how well an AI performs when trained on genuine human data, versus a synthetic data set of AI-generated texts.

      If performance when trained on the synthetic data set is close to that when trained on the original human dataset – this could be a good way to "launder" the original training data and reduce any potential legal issues with it.

      • jart 5 hours ago

        That's basically what companies like Mistral do. Many open source models are trained on OpenAI API request output. That's how a couple guys in Europe are able to build something nearly as good as GPT4 almost overnight and license it Apache 2.0. If you want the training data to be both human and open source, then there aren't many good options besides things like https://en.wikipedia.org/wiki/The_Pile_%28dataset%29 which has Hacker News, Wikipedia, the Enron Emails, GitHub, arXiv, etc.

      • dartos 6 hours ago

        I believe there are several papers which show that synthetic data isn’t as good as real data.

        It makes sense as any bias in the model generated synthetic data will just get magnified as models are continuously trained on that biased data.

    • mistrial9 7 hours ago

      > ... require people to provide data they just won't. People will never provide the data ...

      the word "people" is so striking here... teams and companies, corporations and governments.. how can the cast of characters be so completely missed. An extreme opposite to a far previous era where one person could only be their group member. Vocabulary has to evolve in deliberations.

  • abecedarius 7 hours ago

    The side note on hidden backdoors links to a paper that apparently goes beyond the usual ordinary point that reverse engineering is harder without source:

    > We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer.

    (I didn't read the paper. The ordinary version of this point is already compelling imo, given the current state of the art of reverse-engineering large models.)

    • Terr_ 6 hours ago

      Reminds me of a saying usually about "bugs" but adapted from this bit from Tony Hoare:

      > I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.

      My impression is that LLMs are very much the latter-case, with respect to unwanted behaviors. You can't audit them, you can't secure them against malicious inputs, and whatever limited steering we have over the LSD-trip-generator involves a lot of arbitrary trial and error and hoping our luck holds.

  • swyx 8 hours ago

    i like this style of article with extensive citing of original sources.

    previously on: https://news.ycombinator.com/item?id=41791426

    its really interesting to contrast this "outsider" definition of open ai with people with real money at stake https://news.ycombinator.com/item?id=41046773

    • didibus 7 hours ago

      > its really interesting to contrast this "outsider" definition of open ai with people with real money at stake

      I guess this is a question of what we want out of "open source". Companies want to make money. Their asset is data, access to customers, hardware and integration. They want to "open source" models, so that other people improve their models for free, and then they can take them back, and sell them, or build something profitable using them.

      The idea is that, like with other software, eventually, the open source version becomes the best, or just as good as the commercial ones, and companies that build on top no longer have to pay for those, and can use the open source ones.

      But if what you want out of "open source" is open knowledge, peeking at how something is built, and being able to take that and fork it for your own. Well, you kind of need the data. And your goal in this case is more freedom, using things that you have full access to inspect, alter, repair, modify, etc.

      To me, both are valid, we just need a name for one and a name for the other, and then we can clearly filter for what we are looking for.

  • JumpCrisscross 8 hours ago

    > After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go

    Huh, then this will be a useful definition.

    The FSF position is untenable. Sure, it’s philosophically pure. But given a choice between a practical definition and a pedantically-correct but useless one, people will use the former. Irrespective of what some organisation claims.

    > would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate

    Not how language works.

    • SrslyJosh 6 hours ago

      Indeed it will be a useful definition, as this comment noted above: https://news.ycombinator.com/item?id=41951573

      • JumpCrisscross 3 hours ago

        Sure. Again, there is a pedantic argument with zero broad merit. And there is a practical one. No group owns words; even trademarks fight an uphill battle. If you want to convince people to use your definition, you have to compromise and make it useful. Precisely-defined useless terminology is, by definition, useless; it’s efficient to replace that word, especially if in common use, with something practical.

    • blackeyeblitzar 6 hours ago

      I don’t understand why the “practical” reality requires using the phrase “open source” then. It’s not open source. That label is false and fraudulent if you can’t produce the same artifact or approximately the same artifact. The data is part of the source for models.

      • JumpCrisscross 3 hours ago

        > don’t understand why the “practical” reality requires using the phrase “open source” then. It’s not open source. That label is false and fraudulent

        Natural languages are parsimonious; they reuse related words. In this case, the closest practical analogy to open-source software has the lower barrier to entry. Hence, it will win.

        There is no place for defining open source as data available. In software, too, this problem is solved by using “free software” for the extreme definition. The practical competition is between the Facebook model available with restrictions definition and this.

    • tourmalinetaco 6 hours ago

      It is in no way useful for the advancement of MLMs. Training data is literally the closest thing to source code MLMs have and to say it’s a “benefit” rather than a requirement only allows for the moat to be maintained. The OSI doesn’t care about the creation of truly free models, only what benefits companies like Facebook or IBM who release model weights but don’t open up the training data.

  • pabs3 an hour ago

    I prefer the Debian policy about this:

    https://salsa.debian.org/deeplearning-team/ml-policy

  • Legend2440 7 hours ago

    Does "open-source" even make sense as a category for AI models? There isn't really a source code in the traditional sense.

    • atq2119 2 hours ago

      There's code for training and inference that could be open-source. For the weights, I agree that open-source doesn't make sense as a category.

      They're really a kind of database. Perhaps a better way to think about it is in terms of "commons". Consider how creative commons licenses are explicit about requirements like attribution, noncommercial, share-alike, etc.; that feels like a useful model for talking about weights.

    • Barrin92 4 hours ago

      I had the same thought. "Source Code" is a human readable and modifiable set of instructions that describe the execution of a program. There's obviously parts of an AI system that include literal code, usually a bunch of python scripts or whatever, to interact and build the thing, but most of it is on the one hand data, and on the other an artifact, the AI model and neither is source code really.

      If you want to talk about the openness and accessibility of these systems I'd just ditch the "source" part and create some new criteria for what makes an AI model open.

    • mistrial9 7 hours ago

      I have heard government people talk about "the data is open-source" meaning it has public, no cost copy points to get data files e.g. csv or other.

    • paulddraper 6 hours ago

      Yeah, it's like an open-source jacket.

      I don't really know what you're referring to....

  • koolala 5 hours ago

    "sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system"".

    So a URL to the data? To download the data? Or what? Someone says "Just scrape the data from the web yourself." And a skilled person doesn't need a URL to the source data? No source? Source: The entire WWW?

  • mensetmanusman 5 hours ago

    The 1000 lines of code is open source, the $100,000,000 in electricity costs to train is not.

    • JoshTriplett 2 hours ago

      In the early days of Open Source, many people didn't have access to a computer, and those who had access to a computer often didn't have access to development tools. The aspirations of early Open Source became more and more feasible as more people had access to technology, but the definitions still targeted developers.

    • echelon 4 hours ago

      Training costs will come down. We already have hacks for switching mathematical operators and precision. We originally used to program machines on room-sized computers, yet we now all have access.

      "Open source" should include the training code and the data. Anything you need to train from scratch or fine tune. Otherwise it's just a binary artifact.

  • aithrowawaycomm 7 hours ago

    What I find frustrating is that this isn't just about pedantry - you can't meaningfully audit an "open-source" model for security or reliability problems if you don't know what's in the training data. I believe that should be the "know it when I see it" test for open-source: has enough information been released for a competent programmer (or team) to understand the how the software actually works?

    I understand the analogy to other types of critical data often not included in open-source distros (e.g Quake III's source is GPL but its resources like textures are not, as mentioned in the article). The distinction is in these cases the data does not clarify anything about the functioning of the engine, nor does its absence obscure anything. So by my earlier smell test it makes sense to say Quake III is open source.

    But open-sourcing a transformer ANN without the training data tells us almost nothing about the internal functioning of the software. The exact same source code might be a medical diagnosis machine, or a simple translator. It does not pass my smell test to say this counts as "open source." It makes more sense to say that ANNs are data-as-code programming paradigms, glued together by a bit of Python. An analogy would be if id released its build scripts and announced Quake III was open-source, but claimed the .cpp and .h files were proprietary data. The batch scripts tell you a lot of useful info - maybe even that Q3 has a client-server architecture - but they don't tell you that the game is an FPS, let alone the tricks and foibles in its renderer.

    • lolinder 7 hours ago

      > I believe that should be the "know it when I see it" test for open-source: has enough information been released for a competent programmer (or team) to understand the how the software actually works?

      Training data simply does not help you here. Our existing architectures are not explainable or auditable in any meaningful way, training data or no training data.

      • samj 6 hours ago

        That's why Open Source analyst Redmonk now "do not believe the term open source can or should be extended into the AI world." https://redmonk.com/sogrady/2024/10/22/from-open-source-to-a...

        I don't necessarily agree and suggest the Open Source Definition could be extended to cover data in general (media, databases, and yes, models) with a single sentence, but the lowest risk option is to not touch something that has worked well for a quarter century.

        The community is starting to regroup and discuss possible next steps over at https://discuss.opensourcedefinition.org

      • aithrowawaycomm 6 hours ago

        I don't think your comment is really true, LLM providers and researchers have been a bit too eager to claim their software is mystically complex. Anthropic's research is shedding light on interpretability, there has been good work done on the computational complexity side, and I am quite confident that the issue is LLM's newness and complexity, not that the problem is actually intractable (or specifically "more intractable" than other hopelessly complex software like Facebook or Windows).

        To the extent the problem is intractable, I think kt mostly reflects that LLMs have an enormous amount of training data and do an enormous amount of things. But for a given specific problem the training data can tell you a lot:

        - whether there is test contamination with respect to LLM benchmarks or other assessments of performance

        - whether there's any CSAM, racist rants, or other things you don't want

        - whether LLM weakness in a certain domain is due to an absence of data or if there's a more serious issue

        - whether LLM strength in a domain is due to unusually large amounts of synthetic training data and hence might not generalize very reliably in production (this is distinct from test contamination - it is issues like "the LLM is great at multiplication until you get to 8 digits, and after 12 digits it's useless")

        - investigating oddness like that LeetMagikarp (or whatever) glitch in ChatGPT

      • blackeyeblitzar 6 hours ago

        But training data can itself be examined for biases, and the curation of data also brings in biased. Auditing the software this way doesn’t require explainability in the way you’re talking about.

  • lolinder 7 hours ago

    > Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

    This makes sense. What the OSI gets right here is that the artifact that is made open source is the weights. Making modifications to the weights is called fine tuning and does not require the original training data any more than making modifications to a piece of source code requires the brain of the original developer.

    Tens of thousands of people have fine-tuned these models for as long as they've been around. Years ago I trained GPT-2 to produce text resembling Shakespeare. For that I needed Shakespeare, not GPT-2's training data.

    The training data is properly part of the development process of the open source artifact, not part of the artifact itself. Some open source companies (GitLab) make their development process fully open. Most don't, but we don't call IntelliJ Community closed source on the grounds that they don't record their meetings and stream them for everyone to watch their planning process.

    Edit: Downvotes are fine, but please at least deign to respond and engage. I realize that I'm expressing a controversial opinion here, but in all the times I've posted similar no one's yet given me a good reason why I'm wrong.

    • tourmalinetaco 6 hours ago

      By your logic, many video games have been “open source” for decades because tools were accessible to modify the binary files in certain ways. We lacked the source code, but that’s just “part of the development process”, and maybe parts like comments were lost during the compiling process, but really, why isn’t it open source? Tens of thousands have modified the binaries as long as they’ve been around, and for that I needed community tools, not the source code.

      In short, your argument doesn’t work because source code is to binaries as training data is to MLMs. Source code is the closest comparison we have with training data, and the useless OSI claims that’s only a “benefit” not a “requirement”. This isn’t a stance meant for long term growth but for maintaining a moat of training data for “AI” companies.

      • lolinder 6 hours ago

        > By your logic, many video games have been “open source” for decades because tools were accessible to modify the binary files in certain ways. We lacked the source code, but that’s just “part of the development process”, and maybe parts like comments were lost during the compiling process, but really, why isn’t it open source?

        Because the binaries were not licensed under a FOSS license?

        Also, as I note in another comment [0], source code is the preferred form of a piece of software for making modifications to it. The same cannot be said about the training data, because getting from that to weights costs hundreds of millions of dollars in compute. Even the original companies prefer to fine-tune their existing foundation models for as long as possible, rather than starting over from training data alone.

        > In short, your argument doesn’t work because source code is to binaries as training data is to MLMs.

        I disagree. Training data does not allow me to recreate an LLM. It might allow Jeff Bezos to recreate an LLM, but not me. But weights allow me to modify it, embed it, and fine tune it.

        The weights are all that really matters for practical modification in the real world, because in the real world people don't want to spend hundreds of millions to "recompile" Llama when someone already did that, any more than people want to rewrite the Linux kernel from scratch based on whiteboard sketches and mailing list discussions.

        [0] https://news.ycombinator.com/item?id=41951945

  • a-dub 5 hours ago

    the term "open source" means that all of the materials that were used to create a distribution are available to inspect and modify.

    anything else is closed source. it's as simple as that.

  • chrisfosterelli 8 hours ago

    I imagine that Open AI (the company) must really not like this.

    • talldayo 7 hours ago

      I hate OpenAI but Sam Altman is probably giddy with excitement watching the Open Source pundits fight about weights being "good enough". He's suffered the criticism over his brand for years but they own the trademark and openly have no fucks to give about the matter. Founding OpenAI more than 5 years before Open AI was defined is probably another perverse laurel he wears.

      At the end of the day, what threatens OpenAI is falling apart before they hit the runway. They can't lose the Microsoft deal, they can't lose more founders (almost literally at this point) and they can't afford to let their big-ticket partnerships collapse. They are financially unstable even by Valley standards - one year in a down market could decimate them.

  • AlienRobot 7 hours ago

    If I remember correctly, Stallman's whole point about FLOSS was that consumers were beholden to developers who monopolized the means to produce binaries.

    If I can't reproduce the model, I'm beholden to whoever trained it.

    >"If you're explaining, you're losing."

    That is an interesting point, but isn't this the same organization that makes "open source" vs. "source available" a topic? e.g. why Winamp wouldn't be open source?

    I don't think you can even call a trained AI model "source available." To me the "source" is the training data. The model is as much of a binary as machine code. It doesn't even feel right to have it GPL licensed like code. I think it should get the same license you would give to a fractal art released to the public, e.g. CC.

    • alwayslikethis 7 hours ago

      It's not clear that copyright applies to model weights at all, given they are generated by a computer and isn't really a creative work. It is closer to a quantitative description of the underlying data, like a dictionary or word frequency list.

    • klabb3 5 hours ago

      I think this makes the most sense. The only meaningful part of the term is whether or not you can hack on it, without permission from (or even coordination with) owners, founders or creators.

      Heck, a regular old binary is much less opaque than “open” weights. You can at least run it through a disassembler and slowly, dreadfully, figure out how it works. Just look at the game emulator community.

      For open weight AI models, is there anything close to that?

  • andrewmcwatters 8 hours ago

    I’m sure this will be controversial for some reason, but I think we should mostly reject the OSI’s definitions of “open” anything and leave that to the engineering public.

    I don’t need a board to tell me what’s open.

    And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.

    I don’t need people to tell me that.

    OSI approved this and that has turned into a Ministry of Magic approved thinking situation that feels gross to me.

    • didibus 7 hours ago

      I agree. If it's open source, surely I can at least "compile" it myself. If the data is missing, I can't do that.

      We'll end up with like 5 versions of the same "open source" model, all performing differently because they're all built with their own dataset. And yet, none of those will be considered a fork lol?

      I don't know what the obsession is either. If you don't want to give others permission to use and modify everything that was used to build the program, why are you wanting to trick me in thinking you are, and still calling it open source?

    • strangecasts 7 hours ago

      > And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.

      Making training exactly reproducible locks off a lot of optimizations, you are practically not going to get bit-for-bit reproducibility for nontrivial models

      • samj 7 hours ago

        Nobody's asking for exact reproducibility — if the source code produces the software and it's appropriately licensed then it's Open Source.

        Similarly, if you run the scripts and it produces the model then it's Open Source that happens to be AI.

        To quote Bruce Perens (definition author): the training data IS the source code. Not a perfect analogy but better than a recipe calling for unicorn horns (e.g., FB/IG social graphs) and other toxic candy (e.g., NYT articles that will get users sued).

      • didibus 7 hours ago

        That's kind of true for normal programs as well, depending on the compiler you use, and if it has non-deterministic processes in it's compilation. But still, it's about being able to reproduce the same build process, and get a true realization of the program, even if not bit-for-bit, it's the same intended program.

    • rockskon 7 hours ago

      To be fair, OSI approval also deters marketing teams from watering down the definition of open source into worthless feelgood slop.

      • tourmalinetaco 6 hours ago

        That’s already what’s happened though, even with MLMs. Without training data we’re back to modifying a binary file without the original source.

    • JumpCrisscross 8 hours ago

      > if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open

      This is the new cracker/hacker, GIF pronunciation, crypto(currency)/crypto(graphy) mole hill. Like sure, nobody forces you to recognise any word. But the common usage already precludes open training data—that will only get more ensconced as more contracts and jurisdictions embrace it.

    • mistrial9 6 hours ago

      in historical warfare, Roman soldiers easily and brutally defeated brave, individualist and social opponents on the battlefield, arguably in markets afterwards. It is a sad and essential lesson that applies to modern situations.

      In marketing terms, a simple market communication, consistently and diligently applied, in varied contexts and over time, can and usually will take hold despite untold number of individuals who shake their fists at the sky or cut with clever and cruel words that few hear IMHO

      OSI branding and market communications seem very likely to me to be effective in the future, even if the content is exactly what is being objected to here so vehemently.

  • eadwu 3 hours ago

    If only they kept their "Debian Free Software" name instead of hijacking another word ...