rfc9839.original | rfc9839.txt | |||
---|---|---|---|---|
Network Working Group T. Bray | Internet Engineering Task Force (IETF) T. Bray | |||
Internet-Draft Textuality Services | Request for Comments: 9839 Textuality Services | |||
Intended status: Standards Track P. Hoffman | Category: Standards Track P. Hoffman | |||
Expires: 28 November 2025 ICANN | ISSN: 2070-1721 ICANN | |||
27 May 2025 | August 2025 | |||
Unicode Character Repertoire Subsets | Unicode Character Repertoire Subsets | |||
draft-bray-unichars-15 | ||||
Abstract | Abstract | |||
This document discusses subsets of the Unicode character repertoire | This document discusses subsets of the Unicode character repertoire | |||
for use in protocols and data formats, and specifies three subsets | for use in protocols and data formats and specifies three subsets | |||
recommended for use in IETF specifications. | recommended for use in IETF specifications. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This is an Internet Standards Track document. | |||
provisions of BCP 78 and BCP 79. | ||||
Internet-Drafts are working documents of the Internet Engineering | ||||
Task Force (IETF). Note that other groups may also distribute | ||||
working documents as Internet-Drafts. The list of current Internet- | ||||
Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Further information on | |||
Internet Standards is available in Section 2 of RFC 7841. | ||||
This Internet-Draft will expire on 28 November 2025. | Information about the current status of this document, any errata, | |||
and how to provide feedback on it may be obtained at | ||||
https://www.rfc-editor.org/info/rfc9839. | ||||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2025 IETF Trust and the persons identified as the | Copyright (c) 2025 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
extracted from this document must include Revised BSD License text as | to this document. Code Components extracted from this document must | |||
described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
provided without warranty as described in the Revised BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
in the Revised BSD License. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction | |||
1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Notation | |||
2. Characters and Code Points . . . . . . . . . . . . . . . . . 3 | 2. Characters and Code Points | |||
2.1. Encoding forms . . . . . . . . . . . . . . . . . . . . . 4 | 2.1. Encoding Forms | |||
2.2. Problematic Code Points . . . . . . . . . . . . . . . . . 4 | 2.2. Problematic Code Points | |||
2.2.1. Surrogates . . . . . . . . . . . . . . . . . . . . . 5 | 2.2.1. Surrogates | |||
2.2.2. Control Codes . . . . . . . . . . . . . . . . . . . . 5 | 2.2.2. Control Codes | |||
2.2.3. Noncharacters . . . . . . . . . . . . . . . . . . . . 5 | 2.2.3. Noncharacters | |||
3. Dealing With Problematic Code Points . . . . . . . . . . . . 6 | 3. Dealing with Problematic Code Points | |||
4. Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | 4. Subsets | |||
4.1. Unicode Scalars . . . . . . . . . . . . . . . . . . . . . 7 | 4.1. Unicode Scalars | |||
4.2. XML Characters . . . . . . . . . . . . . . . . . . . . . 7 | 4.2. XML Characters | |||
4.3. Unicode Assignables . . . . . . . . . . . . . . . . . . . 8 | 4.3. Unicode Assignables | |||
5. Using Subsets . . . . . . . . . . . . . . . . . . . . . . . . 8 | 5. Using Subsets | |||
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 | 6. IANA Considerations | |||
7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 | 7. Security Considerations | |||
8. Normative References . . . . . . . . . . . . . . . . . . . . 9 | 8. References | |||
9. Informative References . . . . . . . . . . . . . . . . . . . 10 | 8.1. Normative References | |||
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 11 | 8.2. Informative References | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 | Acknowledgements | |||
Authors' Addresses | ||||
1. Introduction | 1. Introduction | |||
Protocols and data formats frequently contain or are made up of | Protocols and data formats frequently contain or are made up of | |||
textual data. Such text is normally composed of Unicode [UNICODE] | textual data. Such text is normally composed of Unicode [UNICODE] | |||
characters, to support use by speakers of many languages. Unicode | characters, to support use by speakers of many languages. Unicode | |||
characters are represented by numeric code points, and the "set of | characters are represented by numeric code points, and the "set of | |||
all Unicode code points" is generally not a good choice for use in | all Unicode code points" is generally not a good choice for use in | |||
text fields. Unicode recognizes different types of code points, not | text fields. Unicode recognizes different types of code points, not | |||
all of which are appropriate in protocols, or even associated with | all of which are appropriate in protocols or even associated with | |||
characters. Therefore, even if the desire is to support "all Unicode | characters. Therefore, even if the desire is to support "all Unicode | |||
characters" a subset of the Unicode code point repertoire should be | characters", a subset of the Unicode code point repertoire should be | |||
specified. Subsets such as those discussed in this document are | specified. Subsets such as those discussed in this document are | |||
appropriate choices when more-specific limitations do not apply. | appropriate choices when more-specific limitations do not apply. | |||
In this document, "subset" means a subset of the Unicode character | In this document, "subset" means a subset of the Unicode character | |||
repertoire. This document specifies subsets that exclude some or all | repertoire. This document specifies subsets that exclude some or all | |||
of the code points that are "problematic" as defined in Section 2.2. | of the code points that are "problematic" as defined in Section 2.2. | |||
Authors should have a way to concisely and exactly reference a stable | Authors should have a way to concisely and exactly reference a stable | |||
specification that identifies which subset a protocol or data format | specification that identifies which subset a protocol or data format | |||
accepts. | accepts. | |||
This document discusses issues that apply in choosing subsets, names | This document discusses issues that apply in choosing subsets, names | |||
two subsets that have been popular in practice, and suggests one new | two subsets that have been popular in practice, and suggests one new | |||
subset. The intended use is to serve as a convenient target for | subset. The intended use is to serve as a convenient target for | |||
cross-reference from other specifications whose authors wish to | cross-reference from other specifications whose authors wish to | |||
exclude problematic code points from the data format or protocol | exclude problematic code points from the data format or protocol | |||
being specified. | being specified. | |||
Note that this document only provides guidance on avoiding the use of | Note that this document only provides guidance on avoiding the use of | |||
code points which cannot be used for interoperable interchange of | code points that cannot be used for interoperable interchange of | |||
Unicode textual data. Dealing with strings, particularly in the | Unicode textual data. Dealing with strings, particularly in the | |||
context of user interfaces, requires addressing language, text | context of user interfaces, requires addressing language, text | |||
rendering direction, alternate representations of the same abstract | rendering direction, alternate representations of the same abstract | |||
character, and so on. These issues, among many others, led to many | character, and so on. These issues, among many others, led to many | |||
efforts by the Unicode Consortium, IETF efforts like [IDN] and | efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | |||
[PRECIS], and W3C internationalization efforts such as [W3C-CHAR]. | and [PRECIS], and internationalization efforts by W3C such as | |||
The results of these efforts should be consulted by anyone engaging | [W3C-CHAR]. The results of these efforts should be consulted by | |||
in such work. | anyone engaging in such work. | |||
1.1. Notation | 1.1. Notation | |||
In this document, the numeric values assigned to Unicode characters | In this document, the numeric values assigned to Unicode characters | |||
are provided in hexadecimal. This document uses Unicode's standard | are provided in hexadecimal. This document uses Unicode's standard | |||
notation of "U+" followed by four or more hexadecimal digits. For | notation of "U+" followed by four or more hexadecimal digits. For | |||
example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black | example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black | |||
Heart), decimal 128,420, is U+1F5A4. | Heart), decimal 128,420, is U+1F5A4. | |||
Groups of numeric values described in Section 4 are given in ABNF | Groups of numeric values described in Section 4 are given in ABNF | |||
[RFC5234]. In ABNF, hexadecimal values are preceded by "%x" rather | [RFC5234]. In ABNF, hexadecimal values are preceded by "%x" rather | |||
than "U+". | than "U+". | |||
All the numeric ranges in this document are inclusive. | All the numeric ranges in this document are inclusive. | |||
The subsets are described in ABNF. | The subsets are described in ABNF. | |||
2. Characters and Code Points | 2. Characters and Code Points | |||
Definition D9 in section 3.4 of [UNICODE] defines "Unicode codespace" | Definition D9 in Section 3.4 of [UNICODE] defines "Unicode codespace" | |||
as "a range of integers from 0 to 10FFFF_16". Definition D10 defines | as "a range of integers from 0 to 10FFFF_16". Definition D10 defines | |||
"code point" as "Any value in the Unicode codespace". | "code point" as "Any value in the Unicode codespace". | |||
The Unicode Standard's definition of "Unicode character" is | The Unicode Standard's definition of "Unicode character" is | |||
conceptual. However, each Unicode character is assigned a code | conceptual. However, each Unicode character is assigned a code | |||
point, used to represent the characters in computer memory and | point, used to represent the characters in computer memory and | |||
storage systems and, in specifications, to specify allowed subsets. | storage systems and to specify allowed subsets in specifications. | |||
There are 1,114,112 (17 ⨉ 2^16) code points; as of Unicode 16.0 | There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | |||
(2024), about 155,000 have been assigned to characters. Since | (2024), about 155,000 have been assigned to characters. Since | |||
unassigned code points regularly become assigned when new characters | unassigned code points regularly become assigned when new characters | |||
are added to Unicode, it is usually not a good practice to specify | are added to Unicode, it is usually not a good practice to specify | |||
that unassigned code points should be avoided. | that unassigned code points should be avoided. | |||
2.1. Encoding forms | 2.1. Encoding Forms | |||
Unicode describes a variety of encoding forms, ways to marshal code | Unicode describes a variety of encoding forms, ways to marshal code | |||
points into byte sequences. A survey of these is beyond the scope of | points into byte sequences. A survey of these is beyond the scope of | |||
this document. However, it is useful to note that "UTF-16" | this document. However, it is useful to note that "UTF-16" | |||
represents each code point with one or two 16-bit chunks, while "UTF- | represents each code point with one or two 16-bit chunks, while "UTF- | |||
8" uses variable-length byte sequences [RFC3629]. | 8" uses variable-length byte sequences [RFC3629]. | |||
The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | |||
says "Protocols MUST be able to use the UTF-8 charset", which becomes | says "Protocols MUST be able to use the UTF-8 charset", which becomes | |||
a mandate to use UTF-8 for any protocol or data format that specifies | a mandate to use UTF-8 for any protocol or data format that specifies | |||
a single encoding form. UTF-8 is widely used for interoperable data | a single encoding form. UTF-8 is widely used for interoperable data | |||
formats such as JSON, YAML, CBOR, and XML. | formats such as JSON, YAML, CBOR, and XML. | |||
2.2. Problematic Code Points | 2.2. Problematic Code Points | |||
This section classifies as "problematic" all the code points which | This section classifies all the code points that can never represent | |||
can never represent useful text and in some cases can lead to | useful text and, in some cases, can lead to software misbehavior as | |||
software misbehavior. This is a low bar; the PRECIS [RFC8264] | "problematic". This is a low bar; the PRECIS [RFC8264] framework's | |||
framework's "IdentifierClass" and "FreeformClass" exclude many more | "IdentifierClass" and "FreeformClass" exclude many more code points | |||
code points which can cause problems when displayed to humans, in | that can cause problems when displayed to humans, in some cases | |||
some cases presenting security risks. Specifications of fields in | presenting security risks. Specifications of fields in protocols and | |||
protocols and data formats whose contents are designed for display to | data formats whose contents are designed for display to and | |||
and interactions with humans would benefit from careful consideration | interactions with humans would benefit from careful consideration of | |||
of the issues described by PRECIS; its more-restrictive subsets might | the issues described by PRECIS; its more-restrictive subsets might be | |||
be better choices than those specified in this document. | better choices than those specified in this document. | |||
Definition D10a in section 3.4 of [UNICODE] defines seven code point | Definition D10a in Section 3.4 of [UNICODE] defines seven code point | |||
types. Three types of code points are assigned to entities which are | types. Three types of code points are assigned to entities that are | |||
not actually characters or whose value as Unicode characters in text | not actually characters or whose value as Unicode characters in text | |||
fields is questionable: "Surrogate", "Control", and "Noncharacter". | fields is questionable: "Surrogate", "Control", and "Noncharacter". | |||
In this document, "problematic" refers to code points whose type is | In this document, "problematic" refers to code points whose type is | |||
"Surrogate" or "Noncharacter", and to "legacy controls" as defined in | "Surrogate" or "Noncharacter" and to "legacy controls" as defined in | |||
Section 2.2.2.2 below. | Section 2.2.2.2 below. | |||
Unicode's definition D49 concerns the "private-use" type and section | Definition D49 in [UNICODE] concerns the "private-use" type, and | |||
3.5.10 states that they "are considered to be assigned characters". | Section 3.5.10 states that they "are considered to be assigned | |||
Section 23.5 further states that these characters' "use may be | characters". Section 23.5 further states that these characters' "use | |||
determined by private agreement among cooperating users". Because | may be determined by private agreement among cooperating users". | |||
private-use code points may have uses based on private agreements, | Because private-use code points may have uses based on private | |||
this document does not classify them as "problematic". | agreements, this document does not classify them as "problematic". | |||
2.2.1. Surrogates | 2.2.1. Surrogates | |||
A total of 2,048 code points, the range U+D800-U+DFFF, is divided | A total of 2,048 code points, in the range U+D800-U+DFFF, are divided | |||
into two blocks called "high surrogates" and "low surrogates"; | into two blocks called "high surrogates" and "low surrogates"; | |||
collectively the 2,048 code points are referred to as "surrogates". | collectively, the 2,048 code points are referred to as "surrogates". | |||
[UNICODE] section 23.6 specifies how surrogates may be used in | Section 23.6 of [UNICODE] specifies how surrogates may be used in | |||
Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate | Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate | |||
pair represents a code point greater than U+FFFF. | pair represents a code point greater than U+FFFF. | |||
A surrogate which occurs in text encoded in any encoding form other | A surrogate that occurs in text encoded in any encoding form other | |||
than UTF-16 has no meaning. In particular, [UNICODE] section 3.9.3 | than UTF-16 has no meaning. In particular, Section 3.9.3 of | |||
forbids representing a surrogate in UTF-8. | [UNICODE] forbids representing a surrogate in UTF-8. | |||
2.2.2. Control Codes | 2.2.2. Control Codes | |||
Section 23.1 of [UNICODE] introduces the control codes for | Section 23.1 of [UNICODE] introduces the control codes for | |||
compatibility with legacy pre-Unicode standards. They comprise 65 | compatibility with legacy pre-Unicode standards. They comprise 65 | |||
code points in the ranges U+0000-U+001F ("C0 controls") and | code points in the ranges U+0000-U+001F ("C0 controls") and | |||
U+0080-U+009F ("C1 controls"), plus U+007F, "DEL". | U+0080-U+009F ("C1 controls"), plus U+007F, "DEL". | |||
2.2.2.1. Useful Controls | 2.2.2.1. Useful Controls | |||
skipping to change at page 6, line 7 ¶ | skipping to change at line 233 ¶ | |||
asserts repeatedly that they are not designed or used for open | asserts repeatedly that they are not designed or used for open | |||
interchange. | interchange. | |||
Code points are organized into 17 "planes", each containing 2^16 code | Code points are organized into 17 "planes", each containing 2^16 code | |||
points. The last two code points in each plane are noncharacters: | points. The last two code points in each plane are noncharacters: | |||
U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | |||
U+10FFFE, U+10FFFF. | U+10FFFE, U+10FFFF. | |||
The code points in the range U+FDD0-U+FDEF are noncharacters. | The code points in the range U+FDD0-U+FDEF are noncharacters. | |||
3. Dealing With Problematic Code Points | 3. Dealing with Problematic Code Points | |||
[RFC9413], "Maintaining Robust Protocols", provides a thorough | [RFC9413], "Maintaining Robust Protocols", provides a thorough | |||
discussion of strategies for dealing with issues in input data. | discussion of strategies for dealing with issues in input data. | |||
Different types of problematic code points cause different issues. | Different types of problematic code points cause different issues. | |||
Noncharacters and legacy controls are unlikely to cause software | Noncharacters and legacy controls are unlikely to cause software | |||
failures, but they cannot usefully be displayed to humans, and can be | failures, but they cannot usefully be displayed to humans, and they | |||
used in attacks based on attempting to display text that includes | can be used in attacks based on attempting to display text that | |||
them. | includes them. | |||
The behavior of software which encounters surrogates is unpredictable | The behavior of software that encounters surrogates is unpredictable | |||
and differs among programming-language implementations, even between | and differs among programming-language implementations, even between | |||
different API calls in the same language. | different API calls in the same language. | |||
Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence | Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence | |||
which would map to a surrogate is ill-formed. If a specification | that would map to a surrogate is ill-formed. If a specification | |||
requires that input data be encoded with UTF-8, and if all input were | requires that input data be encoded with UTF-8, and if all input were | |||
well-formed, implementors would never have to concern themselves with | well-formed, implementors would never have to concern themselves with | |||
surrogates. | surrogates. | |||
Unfortunately, industry experience teaches that problematic code | Unfortunately, industry experience teaches that problematic code | |||
points, including surrogates, can and do occur in program input where | points, including surrogates, can and do occur in program input where | |||
the source of input data is not controlled by the implementor. In | the source of input data is not controlled by the implementor. In | |||
particular, the specification of JSON allows any code point to appear | particular, the specification of JSON allows any code point to appear | |||
in object member names and string values [RFC8259]. | in object member names and string values [RFC8259]. | |||
For example, the following is a conforming JSON text: | For example, the following is a conforming JSON text: | |||
{"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"} | {"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"} | |||
The value of the "example" field contains the C0 control NUL, the C1 | The value of the "example" field contains the C0 control NUL, the C1 | |||
control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired | control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired | |||
surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two | surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two | |||
escaped UTF-16 surrogate code points as described in [RFC8259] | escaped UTF-16 surrogate code points as described in Section 7 of | |||
section 7. It is unlikely to be useful as the value of a text field. | [RFC8259]. It is unlikely to be useful as the value of a text field. | |||
That value cannot be serialized into well-formed UTF-8, but the | That value cannot be serialized into well-formed UTF-8, but the | |||
behavior of libraries asked to parse the sample is unpredictable; | behavior of libraries asked to parse the sample is unpredictable; | |||
some will silently parse this and generate an ill-formed UTF-8 | some will silently parse this and generate an ill-formed UTF-8 | |||
string. | string. | |||
Two reasonable options for dealing with problematic input are either | Two reasonable options for dealing with problematic input are either | |||
rejecting text containing problematic code points, or replacing the | rejecting text containing problematic code points or replacing the | |||
problematic code points with placeholders. | problematic code points with placeholders. | |||
Silently deleting an ill-formed part of a string is a known security | Silently deleting an ill-formed part of a string is a known security | |||
risk. Responding to that risk, [UNICODE] section 3.2 recommends | risk. Responding to that risk, Section 3.2 of [UNICODE] recommends | |||
dealing with ill-formed byte sequences by signaling an error, or | dealing with ill-formed byte sequences by signaling an error or | |||
replacing problematic code points, ideally with "�" (U+FFFD, | replacing problematic code points, ideally with "�" (U+FFFD, | |||
REPLACEMENT CHARACTER). | REPLACEMENT CHARACTER). | |||
4. Subsets | 4. Subsets | |||
This section describes three increasingly restrictive subsets that | This section describes three increasingly restrictive subsets that | |||
can be used in specifying acceptable content for text fields in | can be used in specifying acceptable content for text fields in | |||
protocols and data types. Specifications can refer to these subsets | protocols and data types. Specifications can refer to these subsets | |||
by the names "Unicode Scalars", "XML Characters", and "Unicode | by the names "Unicode Scalars", "XML Characters", and "Unicode | |||
Assignables". | Assignables". | |||
4.1. Unicode Scalars | 4.1. Unicode Scalars | |||
Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode | Definition D76 in Section 3.9 of [UNICODE] defines the term "Unicode | |||
scalar value" as "Any Unicode code point except high-surrogate and | scalar value" as "Any Unicode code point except high-surrogate and | |||
low-surrogate code points." | low-surrogate code points". | |||
The "Unicode Scalars" subset can be expressed as an ABNF production: | The "Unicode Scalars" subset can be expressed as an ABNF production: | |||
unicode-scalar = | unicode-scalar = | |||
%x0-D7FF / ; exclude surrogates | %x0-D7FF / ; exclude surrogates | |||
%xE000-10FFFF | %xE000-10FFFF | |||
This subset is the default for CBOR [RFC8949], and has the advantage | This subset is the default for Concise Binary Object Representation | |||
of excluding surrogates. However, it includes legacy controls and | (CBOR) [RFC8949] and has the advantage of excluding surrogates. | |||
noncharacters. | However, it includes legacy controls and noncharacters. | |||
4.2. XML Characters | 4.2. XML Characters | |||
The XML 1.0 Specification [XML], in its grammar production labeled | The XML 1.0 Specification [XML], in its grammar production labeled | |||
"Char", specifies a subset of Unicode code points that excludes | "Char", specifies a subset of Unicode code points that excludes | |||
surrogates, legacy C0 controls, and the noncharacters U+FFFE and | surrogates, legacy C0 controls, and the noncharacters U+FFFE and | |||
U+FFFF. | U+FFFF. | |||
The "XML Characters" subset can be expressed as an ABNF production: | The "XML Characters" subset can be expressed as an ABNF production: | |||
xml-character = | xml-character = | |||
%x9 / %xA / %xD / ; useful controls | %x9 / %xA / %xD / ; useful controls | |||
%x20-D7FF / ; exclude surrogates | %x20-D7FF / ; exclude surrogates | |||
%xE000-FFFD / ; exclude FFFE and FFFF nonchars | %xE000-FFFD / ; exclude FFFE and FFFF nonchars | |||
%x100000-10FFFF | %x10000-10FFFF | |||
While this subset does not exclude all the problematic code points, | While this subset does not exclude all the problematic code points, | |||
the C1 controls are less likely than the C0 controls to appear | the C1 controls are less likely than the C0 controls to appear | |||
erroneously in data, and have not been observed to be a frequent | erroneously in data and have not been observed to be a frequent | |||
source of problems. Also, the noncharacters greater in value than | source of problems. Also, the noncharacters greater in value than | |||
U+FFFF are rarely encountered. | U+FFFF are rarely encountered. | |||
4.3. Unicode Assignables | 4.3. Unicode Assignables | |||
This document defines the "Unicode Assignables" subset as all the | This document defines the "Unicode Assignables" subset as all the | |||
Unicode code points that are not problematic. This, a proper subset | Unicode code points that are not problematic. This, a proper subset | |||
of each of the others, comprises all code points that are currently | of each of the others, comprises all code points that are currently | |||
assigned, excluding legacy control codes, or that might in future be | assigned, excluding legacy control codes, or that might be assigned | |||
assigned. | in the future. | |||
Unicode Assignables can be expressed as an ABNF production: | Unicode Assignables can be expressed as an ABNF production: | |||
unicode-assignable = | unicode-assignable = | |||
%x9 / %xA / %xD / ; useful controls | %x9 / %xA / %xD / ; useful controls | |||
%x20-7E / ; exclude C1 controls and DEL | %x20-7E / ; exclude C1 controls and DEL | |||
%xA0-D7FF / ; exclude surrogates | %xA0-D7FF / ; exclude surrogates | |||
%xE000-FDCF / ; exclude FDD0 nonchars | %xE000-FDCF / ; exclude FDD0 nonchars | |||
%xFDF0-FFFD / ; exclude FFFE and FFFF nonchars | %xFDF0-FFFD / ; exclude FFFE and FFFF nonchars | |||
%x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane) | %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane) | |||
skipping to change at page 8, line 39 ¶ | skipping to change at line 357 ¶ | |||
%x50000-5FFFD / %x60000-6FFFD / | %x50000-5FFFD / %x60000-6FFFD / | |||
%x70000-7FFFD / %x80000-8FFFD / | %x70000-7FFFD / %x80000-8FFFD / | |||
%x90000-9FFFD / %xA0000-AFFFD / | %x90000-9FFFD / %xA0000-AFFFD / | |||
%xB0000-BFFFD / %xC0000-CFFFD / | %xB0000-BFFFD / %xC0000-CFFFD / | |||
%xD0000-DFFFD / %xE0000-EFFFD / | %xD0000-DFFFD / %xE0000-EFFFD / | |||
%xF0000-FFFFD / %x100000-10FFFD | %xF0000-FFFFD / %x100000-10FFFD | |||
5. Using Subsets | 5. Using Subsets | |||
Many IETF specifications rely on well-known data formats such as | Many IETF specifications rely on well-known data formats such as | |||
JSON, I-JSON, CBOR, YAML, and XML. These formats specify default | JSON, Internet JSON (I-JSON), CBOR, YAML, and XML. These formats | |||
subsets. For example, JSON allows object member names and string | specify default subsets. For example, JSON allows object member | |||
values to include any Unicode code point, including all the | names and string values to include any Unicode code point, including | |||
problematic types. | all the problematic types. | |||
A protocol based on JSON can be made more robust and implementor- | A protocol based on JSON can be made more robust and implementor- | |||
friendly by restricting the contents of object member names and | friendly by restricting the contents of object member names and | |||
string values to one of the subsets described in Section 4. | string values to one of the subsets described in Section 4. | |||
Equivalent restrictions are possible for other packaging formats such | Equivalent restrictions are possible for other packaging formats such | |||
as I-JSON, XML, YAML, and CBOR. | as I-JSON, XML, YAML, and CBOR. | |||
Note that escaping techniques such as those in the JSON example in | Note that escaping techniques such as those in the JSON example in | |||
Section 3 cannot be used to circumvent this sort of restriction, | Section 3 cannot be used to circumvent this sort of restriction, | |||
which applies to data content, not textual representation in | which applies to data content, not textual representation in | |||
packaging formats. If a specification restricted a JSON field value | packaging formats. If a specification restricted a JSON field value | |||
to the Unicode Assignables, the example would remain a conforming | to the Unicode Assignables, the example would remain a conforming | |||
JSON Text but the data it represents would not constitute Unicode | JSON text but the data it represents would not constitute Unicode | |||
Assignable code points. | Assignable code points. | |||
6. IANA Considerations | 6. IANA Considerations | |||
This document has no actions for IANA. | This document has no IANA actions. | |||
7. Security Considerations | 7. Security Considerations | |||
Section 3 of this document discusses security issues. | Section 3 of this document discusses security issues. | |||
Unicode Security Considerations [TR36] is a wide-ranging survey of | Unicode Security Considerations [TR36] is a wide-ranging survey of | |||
the issues implementors should consider while writing software to | the issues implementors should consider while writing software to | |||
process Unicode text. Unicode Source Code Handling [TR55] discusses | process Unicode text. Unicode Source Code Handling [TR55] discusses | |||
use of Unicode in programming languages, with a focus on security | use of Unicode in programming languages, with a focus on security | |||
issues. Many of the attacks they discuss are aimed at deceiving | issues. Many of the attacks they discuss are aimed at deceiving | |||
human readers, but vulnerabilities involving issues such as | human readers, but vulnerabilities involving issues such as | |||
surrogates and noncharacters are also covered, and in fact can | surrogates and noncharacters are also covered and, in fact, can | |||
contribute to human-deceiving exploits. | contribute to human-deceiving exploits. | |||
The Security Considerations in Section 12 of [RFC8264] generally | The security considerations in Section 12 of [RFC8264] generally | |||
applies to this document as well. | apply to this document as well. | |||
Note that the Unicode-character subsets specified in this document | Note that the Unicode-character subsets specified in this document | |||
are increasingly restrictive, omitting more and more problematic code | are increasingly restrictive, omitting more and more problematic code | |||
points, and thus should be less and less susceptible to many of these | points, and thus should be less and less susceptible to many of these | |||
exploits. The Section 4.3 subset, "Unicode Assignables", excludes | exploits. The subset in Section 4.3, "Unicode Assignables", excludes | |||
all of these code points. | all of these code points. | |||
8. Normative References | 8. References | |||
8.1. Normative References | ||||
[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | |||
Specifications: ABNF", STD 68, RFC 5234, | Specifications: ABNF", STD 68, RFC 5234, | |||
DOI 10.17487/RFC5234, January 2008, | DOI 10.17487/RFC5234, January 2008, | |||
<https://www.rfc-editor.org/info/rfc5234>. | <https://www.rfc-editor.org/info/rfc5234>. | |||
[TR36] The Unicode Consortium, "Unicode Security Considerations", | [TR36] Davis, M., Ed. and M. Suignard, Ed., "Unicode Security | |||
<https://www.unicode.org/reports/tr36/>. Note that this | Considerations", <https://www.unicode.org/reports/tr36/>. | |||
reference is to the latest version of this document, | Note that this reference is to the latest version of this | |||
rather than to a specific release. It is not expected | document, rather than to a specific release. It is not | |||
that future updates will affect the referenced | expected that future updates will affect the referenced | |||
discussions. | discussions. | |||
[TR55] The Unicode Consortium, "Unicode Source Code Handling", | [TR55] Leroy, R., Ed. and M. Davis, Ed., "Unicode Source Code | |||
<https://www.unicode.org/reports/tr55/>. Note that this | Handling", <https://www.unicode.org/reports/tr55/>. Note | |||
reference is to the latest version of this document, | that this reference is to the latest version of this | |||
rather than to a specific release. It is not expected | document, rather than to a specific release. It is not | |||
that future updates will affect the referenced | expected that future updates will affect the referenced | |||
discussions. | discussions. | |||
[UNICODE] The Unicode Consortium, "The Unicode Standard", | [UNICODE] The Unicode Consortium, "The Unicode Standard", | |||
<http://www.unicode.org/versions/latest/>. Note that this | <http://www.unicode.org/versions/latest/>. Note that this | |||
reference is to the latest version of Unicode, rather than | reference is to the latest version of Unicode, rather than | |||
to a specific release. It is not expected that future | to a specific release. It is not expected that future | |||
changes in the Unicode Standard will affect the referenced | changes in the Unicode Standard will affect the referenced | |||
definitions. | definitions. | |||
9. Informative References | 8.2. Informative References | |||
[IDN] "Internationalized Domain Name Working Group", | [IDN] "Internationalized Domain Name Working Group", | |||
<https://datatracker.ietf.org/group/idn/>. | <https://datatracker.ietf.org/group/idn/>. | |||
[PRECIS] "PRECIS Working Group", | [PRECIS] "PRECIS Working Group", | |||
<https://datatracker.ietf.org/group/precis/>. | <https://datatracker.ietf.org/group/precis/>. | |||
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | |||
Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, | Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, | |||
January 1998, <https://www.rfc-editor.org/info/rfc2277>. | January 1998, <https://www.rfc-editor.org/info/rfc2277>. | |||
skipping to change at page 11, line 9 ¶ | skipping to change at line 472 ¶ | |||
<https://www.rfc-editor.org/info/rfc8949>. | <https://www.rfc-editor.org/info/rfc8949>. | |||
[RFC9413] Thomson, M. and D. Schinazi, "Maintaining Robust | [RFC9413] Thomson, M. and D. Schinazi, "Maintaining Robust | |||
Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023, | Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023, | |||
<https://www.rfc-editor.org/info/rfc9413>. | <https://www.rfc-editor.org/info/rfc9413>. | |||
[W3C-CHAR] W3C, "Character encodings: Essential concepts", | [W3C-CHAR] W3C, "Character encodings: Essential concepts", | |||
<https://www.w3.org/International/articles/definitions- | <https://www.w3.org/International/articles/definitions- | |||
characters/>. | characters/>. | |||
[XML] Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F. | [XML] Bray, T., Ed., Paoli, J., Ed., McQueen, C.M., Ed., Maler, | |||
Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth | E., Ed., and F. Yergeau, Ed., "Extensible Markup Language | |||
Edition)", 26 November 2008, | (XML) 1.0 (Fifth Edition)", W3C Recommendation, 26 | |||
November 2008, | ||||
<http://www.w3.org/TR/2008/REC-xml-20081126/>. Note that | <http://www.w3.org/TR/2008/REC-xml-20081126/>. Note that | |||
this reference is to a specific release, based on a | this reference is to a specific release, based on a | |||
history of previous "Edition" releases having changed this | history of previous "Edition" releases having changed this | |||
production. | production. | |||
Acknowledgements | Acknowledgements | |||
Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata | Thanks are due to Guillaume Fortin-Debigaré, who filed an errata | |||
Report against RFC 8259, The JavaScript Object Notation, noting | report against RFC 8259, "The JavaScript Object Notation (JSON) Data | |||
frequent references to "Unicode characters", when in fact the RFC | Interchange Format", noting frequent references to "Unicode | |||
formally specifies the use of Unicode Code Points. | characters", when in fact the RFC formally specifies the use of | |||
Unicode code points. | ||||
Thanks also to Asmus Freytag for careful review and many constructive | Thanks also to Asmus Freytag for careful review and many constructive | |||
suggestions aimed at making the language more consistent with the | suggestions aimed at making the language more consistent with the | |||
structure of the Unicode Standard. | structure of the Unicode Standard. | |||
Thanks also to James Manger for the correctness of the ABNF and JSON | Thanks also to James Manger for the correctness of the ABNF and JSON | |||
samples. | samples. | |||
Thanks also to Addison Phillips and the W3C Internationalization | Thanks also to Addison Phillips and the W3C Internationalization | |||
Working Group for helpful suggestions on language and references. | Working Group for helpful suggestions on language and references. | |||
Thoughtful comments during the many iterations of this draft, which | Thoughtful comments during the many draft versions of this document, | |||
helped tighten up wording and make difficult points clearer, were | which helped tighten up wording and make difficult points clearer, | |||
contributed by Harald Alvestrand, Martin J Dürst, Donald E. | were contributed by Harald Alvestrand, Martin J. Dürst, Donald | |||
Eastlake, John Klensin, Barry Leiba, Glyn Normington, Peter Saint- | E. Eastlake, John Klensin, Barry Leiba, Glyn Normington, Peter Saint- | |||
Andre, and Rob Sayre. | Andre, and Rob Sayre. | |||
Authors' Addresses | Authors' Addresses | |||
Tim Bray | Tim Bray | |||
Textuality Services | Textuality Services | |||
Email: tbray@textuality.com | Email: tbray@textuality.com | |||
Paul Hoffman | Paul Hoffman | |||
ICANN | ICANN | |||
End of changes. 49 change blocks. | ||||
135 lines changed or deleted | 137 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |