RFC 9841 | Shared Brotli Data Format | August 2025 |
Alakuijala, et al. | Informational | [Page] |
This specification defines a data format for shared brotli compression, which adds support for shared dictionaries, large window, and a container format to brotli (RFC 7932). Shared dictionaries and large window support allow significant compression gains compared to regular brotli. This document updates RFC 7932.¶
This document is not an Internet Standards Track specification; it is published for informational purposes.¶
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are candidates for any level of Internet Standard; see Section 2 of RFC 7841.¶
Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc9841.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The purpose of this specification is to extend the brotli compressed data format [RFC7932] with new abilities that allow further compression gains.¶
Shared dictionaries allow a static shared context between encoder and decoder for significant compression gains.¶
Large window brotli allows much larger back reference distances to give compression gains for files over 16 MiB.¶
The framing format is a container format that allows storage of multiple resources and references dictionaries.¶
This document is the authoritative specification of shared brotli data formats and the backwards compatible changes to brotli. This document also defines the following:¶
This specification is intended for use by software implementers to compress data into and/or decompress data from the shared brotli dictionary format.¶
The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations. Familiarity with the technique of LZ77 coding [LZ77] is helpful, but not required.¶
This specification defines a data format for shared brotli compression, which adds support for dictionaries and extended features to brotli [RFC7932].¶
Unless otherwise indicated below, a compliant decompressor must be able to accept and decompress any data set that conforms to all the specifications presented here. Additionally, a compliant compressor must produce data sets that conform to all the specifications presented here.¶
Bytes stored within a computer do not have a "bit order" since they are always treated as a unit. However, a byte considered as an integer between 0 and 255 does have a most significant bit (MSB) and least significant bit (LSB), and since we write numbers with the most significant digit on the left, bytes with the MSB are also written on the left. In the diagrams below, the bits of a byte are written so that bit 0 is the LSB, i.e., the bits are numbered as follows:¶
+--------+ |76543210| +--------+¶
Within a computer, a number may occupy multiple bytes. All multi-byte numbers in the format described here are unsigned and stored with the least significant byte first (at the lower memory address). For example, the decimal 16-bit number 520 is stored as:¶
0 1 +--------+--------+ |00001000|00000010| +--------+--------+ ^ ^ | | | + more significant byte = 2 x 256 + less significant byte = 8¶
This document does not address the issue of the order in which bits of a byte are transmitted on a bit-sequential medium, since the final data format described here is byte- rather than bit-oriented. However, the compressed block format is described below as a sequence of data elements of various bit lengths, not a sequence of bytes. Therefore, we must specify how to pack these data elements into bytes to form the final compressed byte sequence:¶
Data elements are packed into bytes in order of increasing bit number within the byte, i.e., starting with the LSB of the byte.¶
Data elements other than prefix codes are packed starting with the LSB of the data element. These are referred to here as integer values and are considered unsigned.¶
Prefix codes are packed starting with the MSB of the code.¶
In other words, if one were to print out the compressed data as a sequence of bytes starting with the first byte at the *right* margin and proceeding to the *left*, with the MSB of each byte on the left as usual, one would be able to parse the result from right to left with fixed-width elements in the correct MSB-to-LSB order and prefix codes in bit-reversed order (i.e., with the first bit of the code in the relative LSB position).¶
As an example, consider packing the following data elements into a sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2, 3-bit prefix code b'110, 2-bit prefix code b'10, and 12-bit integer value 3628.¶
byte 2 byte 1 byte 0 +--------+--------+--------+ |11100010|11000101|10010110| +--------+--------+--------+ ^ ^ ^ ^ ^ | | | | | | | | | +------ integer value 6 | | | +---------- integer value 2 | | +-------------- prefix code 110 | +---------------- prefix code 10 +----------------------------- integer value 3628¶
Shared brotli extends brotli [RFC7932] with support for shared dictionaries, a larger LZ77 window, and a framing format.¶
A shared dictionary is a piece of data shared by a compressor and decompressor. The compressor can take advantage of the dictionary context to encode the input in a more compact manner. The compressor and the decompressor must use exactly the same dictionary. A shared dictionary is specially useful to compress short input sequences.¶
A shared brotli dictionary can use two methods of sharing context:¶
If no shared dictionary is set, the decoder behaves the same as in [RFC7932] on a brotli stream.¶
If a shared dictionary is set, then it can set LZ77 dictionaries, override static dictionary words, and/or override transforms.¶
If a custom word list is set, then the following behavior of the RFC 7932 decoder [RFC7932] is overridden:¶
Instead of the Static Dictionary Data from Appendix A of [RFC7932], one or more word lists from the custom static dictionary data are used.¶
Instead of NDBITS at the end of Appendix A of [RFC7932], a custom SIZE_BITS_BY_LENGTH per custom word list is used.¶
The copy length for a static dictionary reference must be between 4 and 31 and may not be a value for which SIZE_BITS_BY_LENGTH of this dictionary is 0.¶
If a custom transforms list is set without context dependency, then the following behavior of the RFC 7932 decoder [RFC7932] is overridden:¶
The "List of Word Transformations" from Appendix B of [RFC7932] is overridden by one or more lists of custom prefixes, suffixes, and transform operations.¶
The transform_id must be smaller than the number of transforms given in the custom transforms list.¶
If the dictionary is context dependent, it includes a lookup table of a 64-word list and transform list combinations. When resolving a static dictionary word, the decoder computes the literal Context ID as described in Section 7.1 of [RFC7932]. The literal Context ID is used as the index in the lookup tables to select the word list and transforms to use. If the dictionary is not context dependent, this ID is implicitly 0 instead.¶
If a distance goes beyond the dictionary for the current ID and multiple word/transform list combinations are defined, then a next dictionary is used in the following order: if not context dependent, the same order as defined in the shared dictionary. If context dependent, the index matching the current context is used first, the same order as defined in the shared dictionary excluding the current context are used next.¶
A shared dictionary may include custom word transformations to replace those specified in Section 8 and Appendix B of [RFC7932]. A transform consists of a possible prefix, a transform operation, for some operations a parameter, and a possible suffix. In the shared dictionary format, the transform operation is represented by a numerical ID, which is listed in the table below.¶
ID | Operation |
---|---|
0 | Identity |
1 | OmitLast1 |
2 | OmitLast2 |
3 | OmitLast3 |
4 | OmitLast4 |
5 | OmitLast5 |
6 | OmitLast6 |
7 | OmitLast7 |
8 | OmitLast8 |
9 | OmitLast9 |
10 | FermentFirst |
11 | FermentAll |
12 | OmitFirst1 |
13 | OmitFirst2 |
14 | OmitFirst3 |
15 | OmitFirst4 |
16 | OmitFirst5 |
17 | OmitFirst6 |
18 | OmitFirst7 |
19 | OmitFirst8 |
20 | OmitFirst9 |
21 | ShiftFirst (by PARAMETER) |
22 | ShiftAll (by PARAMETER) |
Operations 0 to 20 are specified in Section 8 of [RFC7932]. ShiftFirst and ShiftAll transform specifically encoded SCALARs.¶
A SCALAR is a 7-, 11-, 16-, or 21-bit unsigned integer encoded with 1, 2, 3, or 4 bytes, respectively, with the following bit contents:¶
7-bit SCALAR: +--------+ |0sssssss| +--------+ 11-bit SCALAR: +--------+--------+ |110sssss|XXssssss| +--------+--------+ 16-bit SCALAR: +--------+--------+--------+ |1110ssss|XXssssss|XXssssss| +--------+--------+--------+ 21-bit SCALAR: +--------+--------+--------+--------+ |11110sss|XXssssss|XXssssss|XXssssss| +--------+--------+--------+--------+¶
Given the input bytes matching the SCALAR encoding pattern, the SCALAR value is obtained by concatenation of the "s" bits, with the MSBs coming from the earliest byte. The "X" bits could have arbitrary value.¶
An ADDEND is defined as the result of limited sign extension of a 16-bit unsigned PARAMETER:¶
At first, the PARAMETER is zero-extended to 32 bits. After this, 0xFF0000 is added if the resulting value is greater or equal than 0x8000.¶
ShiftAll starts at the beginning of the word and repetitively applies the following transformation until the whole word is transformed:¶
If the next untransformed byte matches the first byte of the 7-, 11-, 16-, or 21-bit SCALAR pattern, then:¶
If the untransformed part of the word is not long enough to match the whole SCALAR pattern, then the whole word is marked as transformed.¶
Otherwise, let SHIFTED be the sum of the ADDEND and the encoded SCALAR. The lowest bits from SHIFTED are written back into the corresponding "s" bits. The "0", "1", and "X" bits remain unchanged. Next, 1, 2, 3, or 4 untransformed bytes are marked as transformed according to the SCALAR pattern length.¶
Otherwise, the next untransformed byte is marked as transformed.¶
ShiftFirst applies the same transformation as ShiftAll, but does not iterate.¶
If an LZ77 dictionary is set, the decoder treats it as a regular LZ77 copy but behaves as if the bytes of this dictionary are accessible as the uncompressed bytes outside of the regular LZ77 window for backwards references.¶
Let LZ77_DICTIONARY_LENGTH be the length of the LZ77 dictionary. Then word_id, described in Section 8 of [RFC7932], is redefined as:¶
word_id = distance - (max allowed distance + 1 + LZ77_DICTIONARY_LENGTH)¶
For the case when LZ77_DICTIONARY_LENGTH is 0, word_id matches the [RFC7932] definition.¶
Let dictionary_address be:¶
LZ77_DICTIONARY_LENGTH + max allowed distance - distance¶
Then distance values of <length, distance> pairs [RFC7932] in range (max allowed distance + 1)..(LZ77_DICTIONARY_LENGTH + max allowed distance) are interpreted as references starting in the LZ77 dictionary at the byte at dictionary_address. If length is longer than (LZ77_DICTIONARY_LENGTH - dictionary_address), then the reference continues to copy (length - LZ77_DICTIONARY_LENGTH + dictionary_address) bytes from the regular LZ77 window starting at the beginning.¶
A varint is encoded in base 128 in one or more bytes as follows:¶
+--------+--------+ +--------+ |1xxxxxxx|1xxxxxxx| {0-8 times} |0xxxxxxx| +--------+--------+ +--------+¶
where the "x" bits of the first byte are the LSBs of the value and the "x" bits of the last byte are the MSBs of the value. The last byte must have its MSB set to 0, all other bytes to 1 to indicate there is a next byte.¶
The maximum allowed amount of bits to read is 63 bits; if the 9th byte is present and has its MSB set, then the stream must be considered as invalid.¶
The shared dictionary stream encodes a custom dictionary for brotli, including custom words and/or custom transformations. A shared dictionary may appear as a standalone or as contents of a resource in a framing format container.¶
A compliant shared brotli dictionary stream must have the following format:¶
NUM_CUSTOM_WORD_LISTS. May have a value of 0 to 64.¶
Prefix/suffix stringlet. NUM_PREFIX_SUFFIX is the number of stringlets parsed and must be in range 1..256.¶
Data for each transform:¶
The DICTIONARY_MAP:¶
Large window brotli allows a sliding window beyond the 24-bit maximum of regular brotli [RFC7932].¶
The compressed data stream is backwards compatible to brotli [RFC7932] and may optionally have the following differences:¶
The following new pattern of 14 bits is supported:¶
A decoder that does not support 64-bit integers may reject a stream if WBITS is higher than 30 or a distance symbol from the distance alphabet is able to encode a distance larger than 2147483644.¶
The format of a shared brotli compressed data stream without a framing format is backwards compatible with brotli [RFC7932] with the following optional differences:¶
A compliant shared brotli framing format stream has the format described below.¶
Container flags that are 8 bits and have the following meanings:¶
CHUNK_TYPE¶
Number of dictionary references. Multiple dictionary references are possible with the following restrictions: there can be 1 serialized dictionary and 15 prefix dictionaries maximum (a serialized dictionary may already contain one of those). Circular references are not allowed (any dictionary reference that directly or indirectly uses this chunk itself as dictionary).¶
Flags:¶
Dictionary source:¶
Dictionary type:¶
Extra header bytes, depending on CHUNK_TYPE. If present, they are specified in the subsequent sections.¶
The chunk contents. The uncompressed data in the chunk content depends on CHUNK_TYPE and is specified in the subsequent sections. The compressed data has following format depending on CODEC:¶
All the metadata chunk types use the following format for the uncompressed content:¶
Code to identify this metadata field. This must be two lowercase or two uppercase alpha ASCII characters. If the decoder encounters a lowercase field that it does not recognize for the current chunk type, non-ASCII characters, or non-alpha characters, the decoder must reject the data stream as invalid. Uppercase codes may be used for custom user metadata and can be ignored by a compliant decoder.¶
Length of the content of this field in bytes, excluding the code bytes and this varint.¶
The last field is reached when the chunk content end is reached. If the length of the last field does not end at the same byte as the end of the uncompressed content of the chunk, the decoder must reject the data stream as invalid.¶
All bytes in this chunk must be zero except for the initial varint that specifies the remaining chunk length.¶
Since the varint itself takes up bytes as well, when the goal is to introduce a number of padding bytes, the dependence of the length of the varint on the value it encodes must be taken into account.¶
A single byte varint with a value of 0 is a padding chunk of length 1. For more padding, use higher varint values. Do not use multiple shorter padding chunks since this is slower to decode.¶
This chunk contains metadata that applies to the resource whose beginning is encoded in the subsequent data chunk or first partial data chunk.¶
The contents of this chunk follows the format described in Section 8.3.¶
The following field types are recognized:¶
Name field. May appear 0 or 1 times. Has the following format:¶
Name in UTF-8 encoding, length determined by the field length. Treated generically but may be used as a filename. If used as a filename, forward slashes '/' should be used as directory separators, relative paths should be used, and filenames ending in a slash with 0-length content in the matching data chunk should be treated as an empty directory.¶
A data chunk contains the actual data of a resource.¶
This chunk has the following extra header bytes:¶
Flags:¶
The uncompressed content bytes of this chunk are the actual data of the resource.¶
This chunk contains partial data of a resource. This is the first chunk in a series containing the entire data of the resource.¶
The format of this chunk is the same as the format of a data chunk (Section 8.4.3) except for the differences noted below.¶
The second bit of flags must be set to 0 and no hash code given.¶
The uncompressed data size is only of this part of the resource, not of the full resource.¶
This chunk contains partial data of a resource and is neither the first nor the last part of the full resource.¶
The format of this chunk is the same as the format of a data chunk (Section 8.4.3) except for the differences noted below.¶
The first and second bits of flags must be set to 0.¶
The uncompressed data size is only of this part of the resource, not of the full resource.¶
This chunk contains the final piece of partial data of a resource.¶
The format of this chunk is the same as the format of a data chunk (Section 8.4.3) except for the differences noted below.¶
The first bit of flags must be set to 0.¶
If a hash code is given, the hash code of the full resource (concatenated from all previous chunks and this chunk) is given in this chunk.¶
The uncompressed data size is only of this part of the resource, not of the full resource.¶
The type of this chunk indicates that there are no further chunk encoding this resource, so the full resource is now known.¶
This metadata applies to the resource whose encoding ended in the preceding data chunk or last partial data chunk.¶
The contents of this chunk follows the format described in Section 8.3.¶
There are no lowercase field types defined for footer metadata. Uppercase field types can be used as custom user data.¶
This metadata applies to the whole container instead of a single resource.¶
The contents of this chunk follows the format described in Section 8.3.¶
There are no lowercase field types defined for global metadata. Uppercase field types can be used as custom user data.¶
These chunks optionally repeat metadata that is interleaved between data chunks. To use these chunks, it is necessary to also read additional information, such as pointers to the original chunks, from the central directory.¶
The contents of this chunk follows the format described in Section 8.3.¶
This chunk has an extra header byte:¶
This set of chunks must follow the following restrictions:¶
The fields contained in this metadata chunk must follow the following restrictions:¶
The central directory chunk along with the repeat metadata chunks allow quickly finding and listing compressed resources in the container file.¶
The central directory chunk is always uncompressed and does not have the codec byte. It instead has the following format:¶
Pointer into the file where the repeat metadata chunks are located or 0 if they are not present per chunk listed:¶
The last listed chunk is reached when the end of the contents of the central directory are reached. If the end does not match the last byte of the central directory, the decoder must reject the data stream as invalid.¶
If present, the central directory must list all data and metadata chunks of all types.¶
The final footer chunk closes the file and is only present if in the initial container header flags bit 2 was set.¶
This chunk has the following content, which is always uncompressed:¶
Size of this entire framing format file, including these bytes themselves, or 0 if this size is not given.¶
A reversed varint has the same format as a varint but its bytes are in reversed order, and it is designed to be parsed from the end of the file towards the beginning.¶
The chunk ordering must follow the rules described below. If the decoder sees otherwise, it must reject the data stream as invalid.¶
Padding chunks may be inserted anywhere, even between chunks for which the rules below say no other chunk types may come in between.¶
Metadata chunks must come immediately before the data chunks of the resource they apply to.¶
Footer metadata chunks must come immediately after the data chunks of the resource they apply to.¶
There may be only 0 or 1 metadata chunks per resource.¶
There may be only 0 or 1 footer metadata chunks per resource.¶
A resource must exist out of either 1 data chunk or 1 first partial data chunk, 0 or more middle partial data chunks, and 1 last partial data chunk, in that order.¶
Repeat metadata chunks must follow the rules of Section 8.4.9.¶
There may be only 0 or 1 central directory chunks.¶
If bit 2 of the container flags is set, there may be only a single resource, no metadata chunks of any type, no central directory, and no final footer.¶
If bit 2 of the container flags is not set, there must be exactly 1 final footer chunk, and it must be the last chunk in the file.¶
The security considerations for brotli [RFC7932] apply to shared brotli as well.¶
In addition, the same considerations apply to the decoding of new file format streams for shared brotli, including shared dictionaries, the framing format, and the shared brotli format.¶
The dictionary must be treated with the same security precautions as the content because a change to the dictionary can result in a change to the decompressed content.¶
The CRIME attack [CRIME] shows that it's a bad idea to compress data from mixed (e.g., public and private) sources -- the data sources include not only the compressed data but also the dictionaries. For example, if you compress secret cookies using a public-data-only dictionary, you still leak information about the cookies.¶
Not only can the dictionary reveal information about the compressed data, but vice versa; data compressed with the dictionary can reveal the contents of the dictionary when an adversary can control parts of data to compress and see the compressed size. On the other hand, if the adversary can control the dictionary, the adversary can learn information about the compressed data.¶
The most robust defense against CRIME is not to compress private data, e.g., sensitive headers like cookies or any content with personally identifiable information (PII). The challenge has been to identify secrets within a vast amount of data to be compressed. Cloudflare uses a regular expression [CLOUDFLARE]. Another idea is to extend existing web template systems (e.g., Soy [SOY]) to allow developers to mark secrets that must not be compressed.¶
A less robust idea, but easier to implement, is to randomize the compression algorithm, i.e., adding randomly generated padding, varying the compression ratio, etc. The tricky part is to find the right balance between cost and security (i.e., on one hand, we don't want to add too much padding because it adds a cost to data, but on the other hand, we don't want to add too little because the adversary can detect a small amount of padding with traffic analysis).¶
Additionally, another defense is to not use dictionaries for cross- domain requests and to only use shared brotli for the response when the origin is the same as where the content is hosted (using CORS). This prevents an adversary from using a private dictionary with user secrets to compress content hosted on the adversary's origin. It also helps prevent CRIME attacks that try to benefit from a public dictionary by preventing data compression with dictionaries for requests that do not originate from the host itself.¶
The content of the dictionary itself should not be affected by external users; allowing adversaries to control the dictionary allows a form of chosen plaintext attack. Instead, only base the dictionary on content you control or generic large scale content such as a spoken language and update the dictionary with large time intervals (days, not seconds) to prevent fast probing.¶
The use of HighwayHash [HWYHASH] for dictionary identifiers does not guarantee against collisions in an adversarial environment and is intended to be used for identifying the dictionary within a trusted, known set of dictionaries. In an adversarial environment, users of shared brotli should use another mechanism to validate a negotiated dictionary such as a cryptographically proven secure hash.¶
This document has no IANA actions.¶
The authors would like to thank Robert Obryk for suggesting improvements to the format and the text of the specification.¶