But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.
The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.
On top of that, for the vast majority of performance/cost parameter spaces, you're better off both in developer ergonomics and speed/space slapping zstd across a flatter binary format, supposing no better tool fits your use case better. Especially if your messages aren't exceptionally tiny. You're not using them in a raw DB or doing raw bulk analysis on varints (else basically zero choices of parameters make varints win out), so you're transferring them somewhere and decoding them. That decoding step, even for highly optimized solutions like bijou64, is on par with (slightly better than, if you have an older datacenter link) your raw network. If you spend 1s on networking, you spend 1s on parsing. That's a bad tradeoff almost always, and that assumes a good varint solution.
Even when varints make sense for some set of perf/cost parameters, it's still only for developer ergonomics 99.9999% of the time. Even simple changes like operating on a sequence of values rather than a single scalar enable vastly better CPU/space tradeoffs, and being willing to craft a proper data layout usually offers huge gains on top of that.
It's interesting that you pick delta encoding (or, its natural extension, double-delta encoding often being valuable) for time-series databases as an example. That's an obvious case where you have a solution which is extremely cheap in storage/network/CPU. Varints suck comparatively, almost always.
Not to rip on them too much, especially since it's nice to have primitives available which let you not have to do hard thinking for literally every problem, but they're not amazing and not a great default.
I spent WAYYYYYYYY too much time exploring this...
It might be slightly more instructions than some other serial VL (variable-length) integer codec choices, but overall I don't think it's more difficult.
The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.
ULEB128 works in SIMD because there's only one dependent bit per byte, so you can speculatively decode and then correct later cheaply. Bijou requires you to check the first byte and then branch based on the value using all 8 bits in the decision matrix (to handle branches 0-247, 248, 249, 250, 251, 252, 253, 254, 255). This absolutely DESTROYS any parallelization opportunities.
Not to mention that non-canonical sized ints (3, 5, 6, 7) have abysmal performance compared to unaligned 2, 4, and 8 byte reads on modern processors.
Even though decoding the lengths must be serial (since's there's no unambiguous way to differentiate a tag and data byte), it's still doable within the wider SIMD registers, so there's some theoretical efficiency gain to be had (depending on the shape of the data).
On a general note, the continuation bit and prefix byte forms are equivalent, you just broadcast the prefix byte and compare against an increasing vector to convert it to a mask. Yeah, there's probably more fiddly SIMD if there are multiple prefixes in the register, but doable (it's just not byte-parallel, you eg. unroll the serial decode loop 8 times or whatever your maximum output byte width is, and mask out).
Simplified:
// Just maps a byte to its position in the register
__m128i idx = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
// Broadcast the prefix
__m128i nn = _mm_set1_epi8((char)prefix_byte);
// Get applicable locations: prefix_byte contains the length, if byte_pos < len, the corresponding byte will be set
__m128i m = _mm_cmpgt_epi8(nn, idx);
// If you *really* want a high-bit mask:
m = _mm_and_si128(m, _mm_set1_epi8((char)0x80));Interleaved Bijou has no such signal (tag and payload bytes both span 0x00–0xFF), so finding the boundaries is a dependent per-value walk with no opportunities for parallelism.
With that, it's mostly byte-parallel (though data-dependent as I mentioned).
Can you explain this part a bit? I feel like intuitively (and therefore probably incorrectly) these should have the same difficulties.
EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround:
-- The Yubikey 4 has an off-by-one bug which
-- declares tag length of 255 (for the 0x53 outer
-- tag of a certficate DO) when there are only 254
-- bytes remaining in the reply. The reply is
-- chained across two packets, but the off-by-one is
-- probably related to the over-long encoded length
-- (0x82 0x00 0xff instead of 0x81 0xff).
--
-- [snip packet captures]
--
-- Yubico's ykpiv_fetch_object function in ykpiv.c
-- (confirmed 1.4.3-1.5.0) contains a read (memmove)
-- overflow when the declared inner BER-TLV length
-- (of the 0x53 tag) is longer than what was
-- received over the wire. That makes Yubico's
-- library oblivious to the issue. Relatedly, the
-- set_length function has an off-by-one bug (length
-- < 0xff instead of length <= 0xff) which produces
-- an over-long encoded length. That doesn't by
-- itself explain why the Yubikey 4 transmits a
-- truncated logical reply unless the same code is
-- being used.The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!
The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.
I'm wary of introducing these forced-canonical encodings by someone hyper focused on "efficiency" and "security" without reconsidering additional use cases.
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
vu128: https://john-millikin.com/vu128-efficient-variable-length-in...
metric/imperial varint: https://dcreager.net/2021/03/a-better-varint/
vectorizing VByte: https://arxiv.org/abs/1503.07387
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
Does this encoding have a name?
[0]: https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki
> A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more bytes which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are more significant than bits taken from the later bytes.
from https://www.sqlite.org/fileformat2.html#varint
The one you linked for SQLite4 (abandoned project) is probably a better approach. I recall that the author has said that SQLite3's varint implementation is regretful.
The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.
The second is that they are implementing a spec that requires appending a varint length-prefixed field to a buffer but don't really care about the space optimization, don't know the field's length when they start appending it, and don't want to put the field into a second, temporary buffer or slide it down into place. https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08f47081... vs say my own code which does the slide: https://github.com/scottlamb/retina/blob/6972ac4261ce7bf5b58...
If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.
But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).
If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.
It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.
I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.
Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.
Edit: to clarify, I was talking about the author's argument being strange, not yours.
Edit: a properly written decoder is a lot more than 10 lines if you properly deal with integer overflow and both signed and unsigned ints.
An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.
It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.
In a contrived example of a pbuf {length:int, payload:byte[1]}
LEB128 can trick you into reading the payload as part of the length, but then hopefully trigger a code check against invalid buffer read. (or one byte outside the struct if the payload is also malicious)
Binou64 can trick you to read 7 bytes into other memory, before any buffer size validation is done.
It's then not uncommon to log with a helpful; "buffer with length: 26624894573377(7 bytes of stolen data) is invalid", or just crash.
It's to the point that Bijou64_decode should perhaps take "end_adress" or "max_read" to catch this kind of attack.
(If you dont validate a malicious pbuf, you're in for a bad time regardless of integer format, but these int formats add their own way to trigger a buffer overrun despite a proper check.)
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
They might have a different definition of adversarial than you.
> My tests for quite pedestrian APIs often contain adversarial input of obvious shapes.
This doesn't seem like what I would call adversarial.
This seems like standard negative testing or boundary value analysis - which I would be shocked if they didn't do.
I'm not saying SQLite's varint implementation is ideal for every application. It's just an implementation that is one of the most used implementations, if not the most (I'd bet it is by a large margin though). It just seemed like a missed opportunity to compare it with the implementation they landed on.
EDIT: Just wanted to add, thanks for sharing that link. Interesting!
* It does not require canonicality, it allows multiple encodings of the same value. To make things even more fun, the QUIC spec requires shortest encoding in some uses but not others.
* It uses 2 bits rather than cutting out a range.
* It only encodes values up to 62 bits long.
So, some similarities but also some differences.
[1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-le...
Given that the context up to this point had been representation of integers, I initially trip on this. :)
1: https://kizu.dev/svg-linked-parameters-workaround/ 2: https://www.seaofclouds.com
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.
And say you have it as part of some other data. If you want to be able to hash it by the raw memory bytes, many different ways to represent a number becomes a problem.
If you don't do this properly, you end up with things like: - SAML XSW attack due to XML signature wrapping - ASN.1 BER/DER signature forgery - Bitcoin transaction malleability attacks