What's wrong with the JSON gem API?

(byroot.github.io)

63 points | by ezekg 4 days ago ago

25 comments

  • kazinator 4 days ago

    Since JSON does not specify what happens with duplicate keys, that means it is "unspecified behavior" in language spec speak.

    Software which prepares JSON with duplicate keys is causing unspecified behavior in downstream processors, and therefore its output can be regarded as broken.

    If you put out JSON with duplicate keys, you are responsible for whatever consequently blows up in your face or the faces of downstream users.

    • xg15 14 hours ago

      I don't know any format or application that uses this for legitimate purposes. The most frequent "application" seem to be exploits. As such I don't think the "you're responsible" call would be very helpful. Or rather, you might just as well say "if you send a packet with an overflowing length field, you're responsible for whatever happens downstream".

  • nurettin 4 days ago

    Not sure what they mean by "json gem", as it comes with ruby and not installed via gem install command.

    I've always used

    require 'json/ext'

    This uses the C extension version and is much faster.

    • byroot 3 days ago

      `json` is a "default gem": https://stdgems.org/

      Meaning Ruby ships with that gem pre-installed, but it's still a gem and you can upgrade/downgrade it in your gemfile if you so chose.

      As for `require "json/ext"` that's outdated knowledge.

      • nurettin 3 days ago

        Haha I'd imagine it is outdated. It is from 10+ years ago.

  • jmull 4 days ago

    Changing the default behavior for duplicate keys is unnecessary... and therefore should not be done.

    IMO, it's nuts to purposely introduce random bugs into the apps of everyone who uses your dependency.

    • anitil 4 days ago

      There was a post here recently about how this sort of behaviour on duplicate keys has actually led to security issues (in the context of Go) [0]

      [0] https://news.ycombinator.com/item?id=44308953

      • jmull 4 days ago

        The issue was triggered by using multiple parsers, with different behaviors.

        The change of behavior here would prevent the specific problem, but makes the general problem worse.

        This change creates a new code path that did not exist when all the app code was developed and tested, so there's a decent chance something bad can happen that it's not prepared for.

        • anitil 4 days ago

          The thing that I find surprising about this is that it seems so innocuous. Duplicate keys in an object? Meh just take the first or the last or something, no big deal, I'm sure whatever library I'm using will handle it

          • jmull 3 days ago

            Just to point out, the problem in that issue isn’t really about JSON parsing. It’s about the layer whose job it is to verify/sanitize the input passing the unverified/unsanitized input through to the next layer.

            The idea that your JSON parser can catch this kind of error in the general case just doesn’t make sense.

            Changing the default behavior of the parser would catch this specific problem — by coincidence — but will also create new problems. They are as yet unknown, but we will find out (unless someone gains the sense to back off this bad idea before it makes it into production).

          • kazinator 4 days ago

            RFC 8259 describes these possibilities, accurately quoted in the article: "Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates."

            Keeping the first key/value would be a poor idea since it is a behavior not mentioned in the RFC.

    • onli 11 hours ago

      I agree. This will actually cause me problems.

      I have a huge selection of mostly handwritten JSON data that powers a benchmark collection. Sometimes, mistakes happen and keys get duplicated. That is not a big problem, if e.g. a graphics card is two times in an array nothing bad can happen. Even if the actual value is wrong because of such a duplication (like a benchmark update gone wrong) that will usually not be a problem, cause there is a lot of other data that will place the component at around the correct position regardless.

      Now, what will happen is that a ruby update forces me to update the gems, and then the new JSON gem will crash with my data, and then I will have to take time I don't have to fix something that does not need fixing for a project that does not really generate income. Awesome.

      The right solution here is to specify the unspecified behaviour about the duplicate keys as it is currently for this parser. Then applications randomly running into new bugs is prevented. And then print a warning, but that's done now already. And then maybe offer an opt-in to disallow input data that has duplicated keys. Not to waste the developers time by making the breakage default.

      And that's why dependencies breaking are unacceptable (if the old behaviour is not completely broken), and relying on invisible deprecation messages is not okay. The sqlite gem did the same thing, completely broke their old API of how parameters could be supplied and did not revert the change even when the chaos that caused was reported, and then took it as a personal insult when I opened a discussion about how that's still a problem.

      Nice is also the YAML and psych gem, that just this month suddenly could not write a YAML file with YAML::Store.new anymore. I had to workaround that with `file.write(Psych.dump(theyamlhash))`, https://gitlab.com/onli/sustaphones/-/commit/fecae4bb2ee36c8... is the commit. If I got that right this is about `permitted_classes: [Date]` not being given and not being givable to psych in YAML::Store. Made me pause.

      Is the ruby community breaking? I'm not aware of gems producing developer hostile situations like that before.

      • byroot 11 hours ago

        If you don't care about duplicated keys, all you got to do is to set the `allow_duplicated_key: true` option.

        The whole point of the article is to explain why while I have empathy for code owners that will be impacted by various changes sometimes I believe the benefits outweigh the cost.

        I even linked to an example of a very nasty security issue that would have been prevented if that change had been made sooner.

        > Is the ruby community breaking?

        No, the community is trying to fix past mistakes. For instance the YAML change you are cursing about has been the source of numerous security vulnerabilities, so tenderlove had to do something to remove that massive footgun.

        I totally get that it annoyed you, you can't imagine how much code I had to update to deal with that one.

        But again, while I totally agree maintainers should have empathy for their users, and avoid needless deprecations, it'd be good if users had a bit of empathy for maintainers that are sometimes stuck in "damned if you do, damned if you don't" situations...

        • onli 11 hours ago

          I will try to understand your position, and thanks for answering.

          I appreciated the "avoiding breakage" sentiment included in the post (and commented accordingly below). I'm just really of the opinion that dependencies should not trigger me to do something like set `allow_duplicated_key: true` if it is not absolutely necessary, and I guess I do not see why it is absolutely necessary here. I saw the link to a security vulnerability, but why not let devs set `allow_duplicated_key: false` manually in contexts where it matters? Avoiding churn is more important - this is not a "the api will cause random code to be executed" situation, unlike the create_additions options situation you describe and possibly unlike the YAML situation. There I understand the need (even with YAML, there I'm just sad about the currently broken default code path).

          Also, we saw with Yaml how there it wasn't viable to do such a change without breakage via the intermittent dependencies (and I was very happy you mentioned that API not being nice) and churn via the direct. The same is very likely to happen here, programs will break because their dependencies to not set the new option when those store JSON.

          • byroot 10 hours ago

            > but why not let devs set `allow_duplicated_key: false` manually in contexts where it matters?

            Because that wouldn't have prevented the issue I linked to.

            Default settings are particularly important, because most of the gem's users have no idea that duplicated keys are even a concern.

            JSON is used a lot to parse untrusted data, as such having strict and safe default is particularly valuable.

            In your case the JSON documents are trusted, so I understand this change is of negative value for you, but I believe it has positive value overall when I account for all the users of the gem.

            Additionally, even for the trusted configuration file or similar case, I believe most users would see it as valuable to not let duplicated keys unanswered, because duplicated keys are almost always a mistake and can be hard to track down.

            e.g. a developer might have some `config.json` with:

                {
                  "enabled": true,
                  // many more keys,
                  "enabled": false,
                }
            
            And waste time figuring out why `JSON.parse(doc)["enabled"]` is `false` when they can see it's clearly `true` in their config.

            So again, I understand you are/will be annoyed, because your use case isn't the majority one, but I'm trying to cater to lots of different users.

            If going over your code to add the option is really too much churn for your low maintenance project, as I mention in the post, for such cases a totally OK solution is to monkey patch and move on:

                require "json"
            
                module JSONAllowDuplicateKey
                  def parse(doc, options = nil)
                    options = { allow_duplicate_key: true }.merge(options || {})
                    super(doc, options)
                  end
                end
                JSON.singleton_class.prepend(JSONAllowDuplicateKey)
            • onli 10 hours ago

              It is the other way around, I am convinced. In the config scenario you describe the input is also trusted, as will be the bulk of JSON usage. In all those scenarios duplications will happen regularly and developers rely on the JSON parser to work with the input regardless. Which means they all will have to churn the change required by this to get a working program again, one that does not crash with input that passed before.

              I get that the option to detect and prevent this is good. The warning is seriously useful. But it does not need to be a new default to forbid the old valid inputs. And I do not understand why the option to do so would not have helped in the hacker issue.

              Thanks for the code snippet. Actually, it is not a lot of code for me to change and likely I do not have an intermediate dependency, I hope. Problem is the dependencies vastly outnumbering my program and thus work like this potentially adding up.

              Dont feel obliged to further discuss if it is a waste of your time - but it is important for me to try to argue against churn.

              • byroot 9 hours ago

                > Dont feel obliged to further discuss

                Just clarifying this one:

                > I do not understand why the option to do so would not have helped in the hacker issue.

                Because users aren't omnipotent. When a user need to parse some JSON, they'll reach to `JSON.parse`, which they either already know about or will find from a cursory search. That will have solved their problem so they will not likely look in detail for the dozen or so various options this method takes and won't consider the possibility of duplicated keys nor their potential nefarious impact.

                Hence why the defaults are so important.

                > it is important for me to try to argue against churn

                It's alright, I'm with you on this in general, just not in this particular case.

                • onli 9 hours ago

                  Alright, thanks.

  • ezekg 4 days ago

    First thing we could do here is rename the JSON.parse :symbolize_names keyword to :symbolize_keys -- always trips me up for some reason.

    • fuzzy_biscuit 4 days ago

      Maybe it could just be an alias. People do also refer to them as key-value pairs, so that feels reasonable.

      • 4 days ago
        [deleted]
    • onli 11 hours ago

      Breaking the API like that would be extremely hostile and exactly the kind of churn byroot thankfully claimed he does not want to produce. If anything both options could be offered. Don't be disrespectful with developers time.

    • athorax 4 days ago

      Why? JSON is name-value pairs https://www.json.org/json-en.html

      • caseyohara 4 days ago

        Sure, but JSON.parse returns a Hash, which is key-value pairs. The method argument is about the return value, not the input value, so it is more like "Parse this JSON, and symbolize the keys in the resulting Hash before returning it to me". Plus, as mentioned, symbolize_keys is more conventional.

      • ezekg 4 days ago

        I guess I'm more thinking about Ruby/Rails conventions, e.g. methods like Hash#symbolize_keys.

        Mixing the two has always been a smell to me, but maybe you're right.