Fixes#2379
= Overlong (non-shortest) sequences
UTF-8's unique encoding scheme allows for some Unicode codepoints
to be represented in multiple ways. For any of these characters,
the spec forbids all but the shortest form. These disallowed longer
sequences are called "overlong". As an interesting side effect of
this rule, the bytes C0 and C1 never appear in valid UTF-8.
= Codepoint range
UTF-8 disallows representation of codepoints beyond U+10FFFF,
which is the highest character which can be encoded in UTF-16.
Because a 4-byte sequence is capable of resulting in such characters,
they must be explicitly rejected. This rule also has an interesting
side effect, which is that bytes F5 to FF never appear.
= References
Detecting an overlong version of a codepoint could get gnarly, but
luckily The Unicode Consortium did the hard work by creating this
handy table of valid byte sequences:
https://unicode.org/versions/corrigendum1.html
I thought this mapped nicely to the parser's state machine, so I
rearranged the relevant states to make use of it.
This change was mostly made with `zig fmt` and this also modified some whitespace. Note that in some files, `zig fmt` produced incorrect code, so the change was made manually.
Thanks to the Windows Process Environment Block, it is possible to
obtain handles to the standard input, output, and error streams without
possibility of failure.
parseString() created a copy of the string using the wrong allocator.
Instead of using the ArenaAllocator, it was using the allocator passed
into Parser.init(). This lead to a leak as the copied string was not
freed when the ArenaAllocator was deinited.
* introduce std.json.WriteStream API for writing json
data to a stream
* add WIP tools/merge_anal_dumps.zig for merging multiple semantic
analysis dumps into one. See #3028
* add std.json.Array, improves generated docs
* add test for `std.process.argsAlloc`, improves test coverage and
generated docs