I think a neat route would be to use this as an authoring plugin in VS Code, like prettier: write Duper (or JSON5, or whatever), and then downlevel it to regular json automatically when pressing cmd-s. You wouldn't get to keep your comments (or they could be transformed to { "//": "comment text" }).
Outside of that, it's tough to compete with JSON in the "human readable unschematized serialization format" market, especially targetting JavaScript:
Use in the browser requires some degree of bundle size increase, since the parser code needs to be loaded before your format can be used. WebAssembly libraries are usually quite large compared to a pure-JS implementation. According to [bundlejs](https://bundlejs.com/?q=%40duper-js%2Fwasm&treeshake=%5B*%5D), @duper-js/wasm weighs in at about 488 kB uncompressed, 159 kB gzip.
Use in any JavaScript runtime means you're competing against the runtime's native `JSON.parse` and `JSON.stringify`. In v8, these are very quick and have runtime-level tricks to go faster, for example see [v8's recent post on making JSON.stringify 2x faster](https://v8.dev/blog/json-stringify) when serializing plain objects with no funny business .toJSON methods, replacer, or indent formatting.
Besides those points, my major complaint about JSON is how expensive it is to encode binary data for transmission; in JSON I usually do base64, with your format it's transformed to escape characters that are less efficient than base64, right? \xNN is base16 with 2 extra bytes wasted on the \ and x, or \uNNNN which is base 10 with 2 extra bytes. Is there a way you can fit binary with no expensive encode/decode step into the format?
So, for me this seems suitable as a config file format: there you get good benefit from comments, identifiers, easier string authoring. Not sure I need the binary raw string thingy in config files that much, but I guess it doesn't hurt.
> I think a neat route would be to use this as an authoring plugin in VS Code, like prettier: write Duper (or JSON5, or whatever),
This actually somewhat works right now. If you pass this JSON5 example through Prettier:
{
// comments
unquoted: 'and you can quote me on that',
singleQuotes: 'I can use "double quotes" here',
lineBreaks: "Look, Mom! \
No \\n's!",
hexadecimal: 0xdecaf,
leadingDecimalPoint: .8675309, andTrailing: 8675309.,
positiveSign: +1,
trailingComma: 'in objects', andIn: ['arrays',],
"backwardsCompatible": "with JSON",
}
You’ll get:
{
// comments
"unquoted": "and you can quote me on that",
"singleQuotes": "I can use \"double quotes\" here",
"lineBreaks": "Look, Mom! \
No \\n's!",
"hexadecimal": 0xdecaf,
"leadingDecimalPoint": 0.8675309,
"andTrailing": 8675309,
"positiveSign": +1,
"trailingComma": "in objects",
"andIn": ["arrays"],
"backwardsCompatible": "with JSON"
}
Which is still invalid JSON... but it does fix unquoted keys, floats, trailing comma, and single → double quote strings with correct escaping. So if you have “format on save” enabled in your editor, it might just work!
Where the ** is the grammar specification? Prose is nice, but with a BNF I could plug this into my parsing expression grammar library right quick and give it a rundown.
The object notation format that's going to win is the one that's going to maximally support LLM output. I've come across BAML before, but it's not widely used for some reason.
Today JSON is winning, but for more complex structures, there's still syntax issues in output. XML does reasonably well (given the deep react jsx/HTML in the training corpos), so perhaps that will make a comeback.
Are there benchmarks on this? I think the SOTA models are fine -- they can work with most models, but the fun is that models that are 90% of SOTA performance and cost 90% less - which output format do they work best with. This is where the winner will be found.
TLDR: probably JSON or XML will remain the config format for a while.
Nice work this actually looks great. Of course, it’s only a matter of time before someone drops the XKCD about standards proliferation, so I’ll save them the trouble. Pre-emptive XKCD #927 deployed.
I think a neat route would be to use this as an authoring plugin in VS Code, like prettier: write Duper (or JSON5, or whatever), and then downlevel it to regular json automatically when pressing cmd-s. You wouldn't get to keep your comments (or they could be transformed to { "//": "comment text" }).
Outside of that, it's tough to compete with JSON in the "human readable unschematized serialization format" market, especially targetting JavaScript:
Use in the browser requires some degree of bundle size increase, since the parser code needs to be loaded before your format can be used. WebAssembly libraries are usually quite large compared to a pure-JS implementation. According to [bundlejs](https://bundlejs.com/?q=%40duper-js%2Fwasm&treeshake=%5B*%5D), @duper-js/wasm weighs in at about 488 kB uncompressed, 159 kB gzip.
Use in any JavaScript runtime means you're competing against the runtime's native `JSON.parse` and `JSON.stringify`. In v8, these are very quick and have runtime-level tricks to go faster, for example see [v8's recent post on making JSON.stringify 2x faster](https://v8.dev/blog/json-stringify) when serializing plain objects with no funny business .toJSON methods, replacer, or indent formatting.
Besides those points, my major complaint about JSON is how expensive it is to encode binary data for transmission; in JSON I usually do base64, with your format it's transformed to escape characters that are less efficient than base64, right? \xNN is base16 with 2 extra bytes wasted on the \ and x, or \uNNNN which is base 10 with 2 extra bytes. Is there a way you can fit binary with no expensive encode/decode step into the format?
So, for me this seems suitable as a config file format: there you get good benefit from comments, identifiers, easier string authoring. Not sure I need the binary raw string thingy in config files that much, but I guess it doesn't hurt.
> I think a neat route would be to use this as an authoring plugin in VS Code, like prettier: write Duper (or JSON5, or whatever),
This actually somewhat works right now. If you pass this JSON5 example through Prettier:
You’ll get: Which is still invalid JSON... but it does fix unquoted keys, floats, trailing comma, and single → double quote strings with correct escaping. So if you have “format on save” enabled in your editor, it might just work!Where the ** is the grammar specification? Prose is nice, but with a BNF I could plug this into my parsing expression grammar library right quick and give it a rundown.
The object notation format that's going to win is the one that's going to maximally support LLM output. I've come across BAML before, but it's not widely used for some reason.
Today JSON is winning, but for more complex structures, there's still syntax issues in output. XML does reasonably well (given the deep react jsx/HTML in the training corpos), so perhaps that will make a comeback.
Are there benchmarks on this? I think the SOTA models are fine -- they can work with most models, but the fun is that models that are 90% of SOTA performance and cost 90% less - which output format do they work best with. This is where the winner will be found.
TLDR: probably JSON or XML will remain the config format for a while.
Nice work this actually looks great. Of course, it’s only a matter of time before someone drops the XKCD about standards proliferation, so I’ll save them the trouble. Pre-emptive XKCD #927 deployed.
https://xkcd.com/927/
The X on the date time support means we need a new standard :)