Skip to content

Conversation

byroot
Copy link
Member

@byroot byroot commented Aug 23, 2025

Because both strings and symbols keys are serialized the same, it always has been possible to generate documents with duplicated keys:

>> puts JSON.generate({ foo: 1, "foo" => 2 })
{"foo":1,"foo":2}

This is pretty much always a mistake and can cause various issues because it's not guaranteed how various JSON parsers will handle this.

Until now I didn't think it was possible to catch such case without tanking performance, hence why I only made the parser more strict.

But I finally found a way to check for duplicated keys cheaply enough.

@@ -1064,7 +1095,6 @@ json_object_i(VALUE key, VALUE val, VALUE _arg)
if (RB_UNLIKELY(state->space)) fbuffer_append_str(buffer, data->state->space);
generate_json(buffer, data, val);

arg->iter++;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of this increment does offset the small overhead:

== Encoding activitypub.json (52595 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     3.019k i/100ms
Calculating -------------------------------------
               after     30.785k (± 2.5%) i/s   (32.48 μs/i) -    153.969k in   5.004680s

Comparison:
              before:    30570.1 i/s
               after:    30784.6 i/s - same-ish: difference falls within error


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   159.000 i/100ms
Calculating -------------------------------------
               after      1.630k (± 0.6%) i/s  (613.58 μs/i) -      8.268k in   5.073274s

Comparison:
              before:     1603.6 i/s
               after:     1629.8 i/s - 1.02x  faster


== Encoding twitter.json (466906 bytes)
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after   308.000 i/100ms
Calculating -------------------------------------
               after      3.088k (± 0.6%) i/s  (323.86 μs/i) -     15.708k in   5.087390s

Comparison:
              before:     3041.1 i/s
               after:     3087.8 i/s - 1.02x  faster

@byroot byroot force-pushed the generate-dup-keys branch from 1ec4ecd to 036dadf Compare August 24, 2025 13:58
Because both strings and symbols keys are serialized the same,
it always has been possible to generate documents with duplicated
keys:

```ruby
>> puts JSON.generate({ foo: 1, "foo" => 2 })
{"foo":1,"foo":2}
```

This is pretty much always a mistake and can cause various
issues because it's not guaranteed how various JSON parsers
will handle this.

Until now I didn't think it was possible to catch such case without
tanking performance, hence why I only made the parser more strict.

But I finally found a way to check for duplicated keys cheaply enough.
@byroot byroot force-pushed the generate-dup-keys branch from 036dadf to 8454149 Compare August 24, 2025 13:58
@byroot byroot marked this pull request as ready for review August 24, 2025 14:02
@byroot byroot merged commit 36fefff into ruby:master Aug 24, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant