rfc9839.original.xml   rfc9839.xml 
<?xml version='1.0' encoding='utf-8'?> <?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE rfc []>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" <!DOCTYPE rfc [
ipr="trust200902" <!ENTITY nbsp "&#160;">
docName="draft-bray-unichars-15" <!ENTITY zwsp "&#8203;">
category="std" consensus="true" <!ENTITY nbhy "&#8209;">
submissionType="IETF" tocInclude="true" <!ENTITY wj "&#8288;">
sortRefs="true" ]>
symRefs="true"
version="3">
<front> <rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft
<title abbrev="Unicode Subsets">Unicode Character Repertoire Subsets</title> -bray-unichars-15" number="9839" updates="" obsoletes="" xml:lang="en" category=
"std" consensus="true" submissionType="IETF" tocInclude="true" sortRefs="true" s
ymRefs="true" version="3">
<author initials="T." surname="Bray" fullname="Tim Bray"> <front>
<organization>Textuality Services</organization> <title abbrev="Unicode Subsets">Unicode Character Repertoire Subsets</title>
<address> <seriesInfo name="RFC" value="9839"/>
<email>tbray@textuality.com</email> <author initials="T." surname="Bray" fullname="Tim Bray">
</address> <organization>Textuality Services</organization>
</author> <address>
<email>tbray@textuality.com</email>
</address>
</author>
<author initials="P." surname="Hoffman" fullname="Paul Hoffman"> <author initials="P." surname="Hoffman" fullname="Paul Hoffman">
<organization>ICANN</organization> <organization>ICANN</organization>
<address> <address>
<email>paul.hoffman@icann.org</email> <email>paul.hoffman@icann.org</email>
</address> </address>
</author> </author>
<date/> <date month="August" year="2025"/>
<keyword>Internet-Draft</keyword>
<area>ART</area>
<!-- [rfced] Please insert any keywords (beyond those that appear in
the title) for use on https://www.rfc-editor.org/search. -->
<keyword>example</keyword>
<abstract> <abstract>
<t>This document discusses subsets of the Unicode character repertoire for use i n protocols and data formats, and specifies three subsets recommended for use in IETF specifications.</t> <t>This document discusses subsets of the Unicode character repertoire for use i n protocols and data formats and specifies three subsets recommended for use in IETF specifications.</t>
</abstract> </abstract>
</front> </front>
<middle> <middle>
<section anchor="intro" title="Introduction"> <section anchor="intro">
<name>Introduction</name>
<t>Protocols and data formats frequently contain or are made up of textual data. <t>Protocols and data formats frequently contain or are made up of textual data.
Such text is normally composed of Unicode <xref target="UNICODE"/> characters, t o support use by speakers of many languages. Such text is normally composed of Unicode <xref target="UNICODE"/> characters, t o support use by speakers of many languages.
Unicode characters are represented by numeric code points, and the "set of all U nicode code points" is generally not a good choice for use in text fields. Unicode characters are represented by numeric code points, and the "set of all U nicode code points" is generally not a good choice for use in text fields.
Unicode recognizes different types of code points, not all of which are appropri Unicode recognizes different types of code points, not all of which are appropri
ate in protocols, or even associated with characters. ate in protocols or even associated with characters.
Therefore, even if the desire is to support "all Unicode characters" a subset of Therefore, even if the desire is to support "all Unicode characters", a subset o
the Unicode code point repertoire should be specified. f the Unicode code point repertoire should be specified.
Subsets such as those discussed in this document are appropriate choices when mo re-specific limitations do not apply.</t> Subsets such as those discussed in this document are appropriate choices when mo re-specific limitations do not apply.</t>
<t>In this document, "subset" means a subset of the Unicode character repertoire . <t>In this document, "subset" means a subset of the Unicode character repertoire .
This document specifies subsets that exclude some or all of the code points that are "problematic" as defined in <xref target="problematic"/>. This document specifies subsets that exclude some or all of the code points that are "problematic" as defined in <xref target="problematic"/>.
Authors should have a way to concisely and exactly reference a stable specificat ion that identifies which subset a protocol or data format accepts.</t> Authors should have a way to concisely and exactly reference a stable specificat ion that identifies which subset a protocol or data format accepts.</t>
<t>This document discusses issues that apply in choosing subsets, names two subs ets that have been popular in practice, and suggests one new subset. <t>This document discusses issues that apply in choosing subsets, names two subs ets that have been popular in practice, and suggests one new subset.
The intended use is to serve as a convenient target for cross-reference from oth er specifications whose authors wish to exclude problematic code points from the data format or protocol being specified.</t> The intended use is to serve as a convenient target for cross-reference from oth er specifications whose authors wish to exclude problematic code points from the data format or protocol being specified.</t>
<t>Note that this document only provides guidance on avoiding the use of code po ints which cannot be used for interoperable interchange of Unicode textual data. <t>Note that this document only provides guidance on avoiding the use of code po ints that cannot be used for interoperable interchange of Unicode textual data.
Dealing with strings, particularly in the context of user interfaces, requires a ddressing language, text rendering direction, alternate representations of the s ame abstract character, and so on. Dealing with strings, particularly in the context of user interfaces, requires a ddressing language, text rendering direction, alternate representations of the s ame abstract character, and so on.
These issues, among many others, led to many efforts by the Unicode Consortium, These issues, among many others, led to many efforts by the Unicode Consortium,
IETF efforts like <xref target="IDN"/> and <xref target="PRECIS"/>, efforts by the IETF such as <xref target="IDN"/> and <xref target="PRECIS"/>,
and W3C internationalization efforts such as <xref target="W3C-CHAR"/>. and internationalization efforts by W3C such as <xref target="W3C-CHAR"/>.
The results of these efforts should be consulted by anyone engaging in such work .</t> The results of these efforts should be consulted by anyone engaging in such work .</t>
<section anchor="notation" title="Notation"> <section anchor="notation">
<name>Notation</name>
<t>In this document, the numeric values assigned to Unicode characters are provi ded in hexadecimal. <t>In this document, the numeric values assigned to Unicode characters are provi ded in hexadecimal.
This document uses Unicode's standard notation of "U+" followed by four or more hexadecimal digits. This document uses Unicode's standard notation of "U+" followed by four or more hexadecimal digits.
For example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black Heart), dec imal 128,420, is U+1F5A4.</t> For example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black Heart), dec imal 128,420, is U+1F5A4.</t>
<t>Groups of numeric values described in <xref target="subsets"/> are given in A BNF <xref target="RFC5234"/>. <t>Groups of numeric values described in <xref target="subsets"/> are given in A BNF <xref target="RFC5234"/>.
In ABNF, hexadecimal values are preceded by "%x" rather than "U+".</t> In ABNF, hexadecimal values are preceded by "%x" rather than "U+".</t>
<t>All the numeric ranges in this document are inclusive.</t> <t>All the numeric ranges in this document are inclusive.</t>
<t>The subsets are described in ABNF.</t> <t>The subsets are described in ABNF.</t>
</section> </section>
</section> </section>
<section anchor="char-concepts" title="Characters and Code Points"> <section anchor="char-concepts">
<name>Characters and Code Points</name>
<t>Definition D9 in section 3.4 of <xref target="UNICODE"/> defines "Unicode cod espace" as "a range of integers from 0 to 10FFFF<sub>16</sub>". <t>Definition D9 in Section 3.4 of <xref target="UNICODE"/> defines "Unicode cod espace" as "a range of integers from 0 to 10FFFF<sub>16</sub>".
Definition D10 defines "code point" as "Any value in the Unicode codespace".</t> Definition D10 defines "code point" as "Any value in the Unicode codespace".</t>
<t>The Unicode Standard's definition of "Unicode character" is conceptual. <t>The Unicode Standard's definition of "Unicode character" is conceptual.
However, each Unicode character is assigned a code point, used to represent the However, each Unicode character is assigned a code point, used to represent the
characters in computer memory and storage systems and, in specifications, to spe characters in computer memory and storage systems and to specify allowed subsets
cify allowed subsets.</t> in specifications.</t>
<!--[rfced] FYI: In Section 2, we updated the enlarged "⨉" to "*"
for multiplication as the enlarged "⨉" is not used in the RFC
Series. If "x" (lowercase) is preferred instead of "*", please
let us know.
Original:
There are 1,114,112 (17 ⨉ 2^16) code points;...
Current:
There are 1,114,112 (17 * 2^16) code points;...
-->
<t>There are 1,114,112 (17 * 2<sup>16</sup>) code points; as of Unicode 16.0 (20
24), about 155,000 have been assigned to characters.
<t>There are 1,114,112 (17 ⨉ 2<sup>16</sup>) code points; as of Unicode 16.0 (20 24), about 155,000 have been assigned to characters.
Since unassigned code points regularly become assigned when new characters are a dded to Unicode, it is usually not a good practice to specify that unassigned co de points should be avoided.</t> Since unassigned code points regularly become assigned when new characters are a dded to Unicode, it is usually not a good practice to specify that unassigned co de points should be avoided.</t>
<section anchor="encoding" title="Encoding forms"> <section anchor="encoding">
<name>Encoding Forms</name>
<!--[rfced] Please clarify this sentence - does option A or B capture
the intended meaning, or do you prefer otherwise?
Original:
Unicode describes a variety of encoding forms, ways to marshal code
points into byte sequences.
Perhaps A:
Unicode describes a variety of encoding forms and ways to
marshal code points into byte sequences.
Perhaps B:
Unicode describes a variety of encoding forms that can be used to
marshal code points into byte sequences.
-->
<t>Unicode describes a variety of encoding forms, ways to marshal code points in to byte sequences. <t>Unicode describes a variety of encoding forms, ways to marshal code points in to byte sequences.
A survey of these is beyond the scope of this document. A survey of these is beyond the scope of this document.
However, it is useful to note that "UTF-16" represents each code point with one or two 16-bit chunks, while "UTF-8" uses variable-length byte sequences <xref ta rget="RFC3629"/>.</t> However, it is useful to note that "UTF-16" represents each code point with one or two 16-bit chunks, while "UTF-8" uses variable-length byte sequences <xref ta rget="RFC3629"/>.</t>
<t>The "IETF Policy on Character Sets and Languages", BCP 18 <xref target="RFC22 77"/>, says "Protocols MUST be able to use the UTF-8 charset", which becomes a m andate to use UTF-8 for any protocol or data format that specifies a single enco ding form. <t>The "IETF Policy on Character Sets and Languages", BCP 18 <xref target="RFC22 77"/>, says "Protocols <bcp14>MUST</bcp14> be able to use the UTF-8 charset", wh ich becomes a mandate to use UTF-8 for any protocol or data format that specifie s a single encoding form.
UTF-8 is widely used for interoperable data formats such as JSON, YAML, CBOR, an d XML.</t> UTF-8 is widely used for interoperable data formats such as JSON, YAML, CBOR, an d XML.</t>
</section> </section>
<section anchor="problematic" title="Problematic Code Points"> <section anchor="problematic">
<name>Problematic Code Points</name>
<t>This section classifies as "problematic" all the code points which can never <t>This section classifies all the code points that can never represent useful t
represent useful text and in some cases can lead to software misbehavior. ext and, in some cases, can lead to software misbehavior as "problematic".
This is a low bar; the PRECIS <xref target="RFC8264"/> framework's "IdentifierCl This is a low bar; the PRECIS <xref target="RFC8264"/> framework's "IdentifierCl
ass" and "FreeformClass" exclude many more code points which can cause problems ass" and "FreeformClass" exclude many more code points that can cause problems w
when displayed to humans, in some cases presenting security risks. hen displayed to humans, in some cases presenting security risks.
Specifications of fields in protocols and data formats whose contents are design ed for display to and interactions with humans would benefit from careful consid eration of the issues described by PRECIS; its more-restrictive subsets might be better choices than those specified in this document.</t> Specifications of fields in protocols and data formats whose contents are design ed for display to and interactions with humans would benefit from careful consid eration of the issues described by PRECIS; its more-restrictive subsets might be better choices than those specified in this document.</t>
<t>Definition D10a in section 3.4 of <xref target="UNICODE"/> defines seven code <t>Definition D10a in Section 3.4 of <xref target="UNICODE"/> defines seven code
point types. point types.
Three types of code points are assigned to entities which are not actually chara Three types of code points are assigned to entities that are not actually charac
cters or whose value as Unicode characters in text fields is questionable: "Surr ters or whose value as Unicode characters in text fields is questionable: "Surro
ogate", "Control", and "Noncharacter". gate", "Control", and "Noncharacter".
In this document, "problematic" refers to code points whose type is "Surrogate" In this document, "problematic" refers to code points whose type is "Surrogate"
or "Noncharacter", and to "legacy controls" as defined in <xref target="legacy-c or "Noncharacter" and to "legacy controls" as defined in <xref target="legacy-co
ontrols"/> below.</t> ntrols"/> below.</t>
<t>Unicode's definition D49 concerns the "private-use" type and section 3.5.10 s tates that they "are considered to be assigned characters". <t>Definition D49 in <xref target="UNICODE"/> concerns the "private-use" type, a nd Section 3.5.10 states that they "are considered to be assigned characters".
Section 23.5 further states that these characters' "use may be determined by pri vate agreement among cooperating users". Section 23.5 further states that these characters' "use may be determined by pri vate agreement among cooperating users".
Because private-use code points may have uses based on private agreements, this document does not classify them as "problematic".</t> Because private-use code points may have uses based on private agreements, this document does not classify them as "problematic".</t>
<section anchor="surrogates" title="Surrogates"> <section anchor="surrogates">
<name>Surrogates</name>
<t>A total of 2,048 code points, the range U+D800-U+DFFF, is divided into two bl <t>A total of 2,048 code points, in the range U+D800-U+DFFF, are divided into tw
ocks called "high surrogates" and "low surrogates"; collectively the 2,048 code o blocks called "high surrogates" and "low surrogates"; collectively, the 2,048
points are referred to as "surrogates". code points are referred to as "surrogates".
<xref target="UNICODE"/> section 23.6 specifies how surrogates may be used in Un Section 23.6 of <xref target="UNICODE"/> specifies how surrogates may be used in
icode texts encoded in UTF-16, Unicode texts encoded in UTF-16,
where a high-surrogate/low-surrogate pair represents a code point greater than U +FFFF.</t> where a high-surrogate/low-surrogate pair represents a code point greater than U +FFFF.</t>
<t>A surrogate which occurs in text encoded in any encoding form other than UTF- <t>A surrogate that occurs in text encoded in any encoding form other than UTF-1
16 has no meaning. 6 has no meaning.
In particular, <xref target="UNICODE"/> section 3.9.3 forbids representing a sur In particular, Section 3.9.3 of <xref target="UNICODE"/> forbids representing a
rogate in UTF-8.</t> surrogate in UTF-8.</t>
</section> </section>
<section anchor="controls" title="Control Codes"> <section anchor="controls">
<name>Control Codes</name>
<t>Section 23.1 of <xref target="UNICODE"/> introduces the control codes for com patibility with legacy pre-Unicode standards. <t>Section 23.1 of <xref target="UNICODE"/> introduces the control codes for com patibility with legacy pre-Unicode standards.
They comprise 65 code points in the ranges U+0000-U+001F ("C0 controls") and U+0 080-U+009F ("C1 controls"), plus U+007F, "DEL".</t> They comprise 65 code points in the ranges U+0000-U+001F ("C0 controls") and U+0 080-U+009F ("C1 controls"), plus U+007F, "DEL".</t>
<section anchor="useful-controls" title="Useful Controls"> <section anchor="useful-controls">
<name>Useful Controls</name>
<t>The C0 controls include newline (U+000A), carriage return (U+000D), and tab ( U+0009); this document refers to these three characters as the "useful controls" .</t> <t>The C0 controls include newline (U+000A), carriage return (U+000D), and tab ( U+0009); this document refers to these three characters as the "useful controls" .</t>
</section> </section>
<section anchor="legacy-controls" title="Legacy Controls"> <section anchor="legacy-controls">
<name>Legacy Controls</name>
<t>Aside from the useful controls, both the C0 and C1 control codes are mostly obsolete and generally lack interoperable semantics. <t>Aside from the useful controls, both the C0 and C1 control codes are mostly obsolete and generally lack interoperable semantics.
This document uses the phrase "legacy controls" to describe control codes that a re not useful controls.</t> This document uses the phrase "legacy controls" to describe control codes that a re not useful controls.</t>
<t>Because the code points for C0 controls include the 32 smallest integers incl uding zero, they are likely to occur in data as a result of programming errors.< /t> <t>Because the code points for C0 controls include the 32 smallest integers incl uding zero, they are likely to occur in data as a result of programming errors.< /t>
</section> </section>
</section> </section>
<section anchor="noncharacters" title="Noncharacters"> <section anchor="noncharacters">
<name>Noncharacters</name>
<t>Certain code points are classified as "noncharacters", and <xref target="UNIC ODE"/> asserts repeatedly that they are not designed or used for open interchang e.</t> <t>Certain code points are classified as "noncharacters", and <xref target="UNIC ODE"/> asserts repeatedly that they are not designed or used for open interchang e.</t>
<t>Code points are organized into 17 "planes", each containing 2<sup>16</sup> co de points. <t>Code points are organized into 17 "planes", each containing 2<sup>16</sup> co de points.
The last two code points in each plane are noncharacters: U+FFFE, U+FFFF, U+1FFF E, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to U+10FFFE, U+10FFFF.</t> The last two code points in each plane are noncharacters: U+FFFE, U+FFFF, U+1FFF E, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to U+10FFFE, U+10FFFF.</t>
<t>The code points in the range U+FDD0-U+FDEF are noncharacters.</t> <t>The code points in the range U+FDD0-U+FDEF are noncharacters.</t>
</section> </section>
</section> </section>
</section> </section>
<section anchor="dealing" title="Dealing With Problematic Code Points"> <section anchor="dealing">
<name>Dealing with Problematic Code Points</name>
<t><xref target="RFC9413"/>, "Maintaining Robust Protocols", provides a thorough discussion of strategies for dealing with issues in input data.</t> <t><xref target="RFC9413"/>, "Maintaining Robust Protocols", provides a thorough discussion of strategies for dealing with issues in input data.</t>
<t>Different types of problematic code points cause different issues. <t>Different types of problematic code points cause different issues.
Noncharacters and legacy controls are unlikely to cause software failures, but t hey cannot usefully be displayed to humans, and can be used in attacks based on attempting to display text that includes them.</t> Noncharacters and legacy controls are unlikely to cause software failures, but t hey cannot usefully be displayed to humans, and they can be used in attacks base d on attempting to display text that includes them.</t>
<t>The behavior of software which encounters surrogates is unpredictable and dif fers among programming-language implementations, even between different API call s in the same language.</t> <t>The behavior of software that encounters surrogates is unpredictable and diff ers among programming-language implementations, even between different API calls in the same language.</t>
<t>Section 3.9 of <xref target="UNICODE"/> makes it clear that a UTF-8 byte sequ ence which would map to a surrogate is ill-formed. <t>Section 3.9 of <xref target="UNICODE"/> makes it clear that a UTF-8 byte sequ ence that would map to a surrogate is ill-formed.
If a specification requires that input data be encoded with UTF-8, and if all in put were well-formed, implementors would never have to concern themselves with s urrogates.</t> If a specification requires that input data be encoded with UTF-8, and if all in put were well-formed, implementors would never have to concern themselves with s urrogates.</t>
<t>Unfortunately, industry experience teaches that problematic code points, incl uding surrogates, can and do occur in program input where the source of input da ta is not controlled by the implementor. <t>Unfortunately, industry experience teaches that problematic code points, incl uding surrogates, can and do occur in program input where the source of input da ta is not controlled by the implementor.
In particular, the specification of JSON allows any code point to appear in obje ct member names and string values <xref target="RFC8259"/>.</t> In particular, the specification of JSON allows any code point to appear in obje ct member names and string values <xref target="RFC8259"/>.</t>
<t>For example, the following is a conforming JSON text:</t> <t>For example, the following is a conforming JSON text:</t>
<sourcecode>{"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}</sourcecode> <sourcecode type="json"><![CDATA[{"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}]] ></sourcecode>
<t>The value of the "example" field contains the C0 control NUL, the C1 control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncha racter U+7FFFF encoded per JSON rules as two escaped UTF-16 surrogate code point s as described in <xref target="RFC8259"/> section 7. <t>The value of the "example" field contains the C0 control NUL, the C1 control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncha racter U+7FFFF encoded per JSON rules as two escaped UTF-16 surrogate code point s as described in <xref target="RFC8259" section="7"/>.
It is unlikely to be useful as the value of a text field. It is unlikely to be useful as the value of a text field.
That value cannot be serialized into well-formed UTF-8, but the behavior of libr aries asked to parse the sample is unpredictable; some will silently parse this and generate an ill-formed UTF-8 string.</t> That value cannot be serialized into well-formed UTF-8, but the behavior of libr aries asked to parse the sample is unpredictable; some will silently parse this and generate an ill-formed UTF-8 string.</t>
<t>Two reasonable options for dealing with problematic input are either rejectin g text containing problematic code points, or replacing the problematic code poi nts with placeholders.</t> <t>Two reasonable options for dealing with problematic input are either rejectin g text containing problematic code points or replacing the problematic code poin ts with placeholders.</t>
<t>Silently deleting an ill-formed part of a string is a known security risk. <t>Silently deleting an ill-formed part of a string is a known security risk.
Responding to that risk, <xref target="UNICODE"/> section 3.2 recommends dealing with ill-formed byte sequences by signaling an error, or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).</t> Responding to that risk, Section 3.2 of <xref target="UNICODE"/> recommends deal ing with ill-formed byte sequences by signaling an error or replacing problemati c code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).</t>
</section> </section>
<section anchor="subsets" title="Subsets"> <section anchor="subsets">
<name>Subsets</name>
<t>This section describes three increasingly restrictive subsets that can be use d in specifying acceptable content for text fields in protocols and data types. <t>This section describes three increasingly restrictive subsets that can be use d in specifying acceptable content for text fields in protocols and data types.
Specifications can refer to these subsets by the names "Unicode Scalars", "XML C haracters", and "Unicode Assignables".</t> Specifications can refer to these subsets by the names "Unicode Scalars", "XML C haracters", and "Unicode Assignables".</t>
<section anchor="scalars" title="Unicode Scalars"> <section anchor="scalars">
<name>Unicode Scalars</name>
<t>Definition D76 in section 3.9 of <xref target="UNICODE"/> defines the term "U nicode scalar value" as "Any Unicode code point except high-surrogate and low-su rrogate code points."</t> <t>Definition D76 in Section 3.9 of <xref target="UNICODE"/> defines the term "U nicode scalar value" as "Any Unicode code point except high-surrogate and low-su rrogate code points".</t>
<t>The "Unicode Scalars" subset can be expressed as an ABNF production:</t> <t>The "Unicode Scalars" subset can be expressed as an ABNF production:</t>
<sourcecode> <sourcecode type="abnf"><![CDATA[
unicode-scalar = unicode-scalar =
%x0-D7FF / ; exclude surrogates %x0-D7FF / ; exclude surrogates
%xE000-10FFFF %xE000-10FFFF
</sourcecode> ]]></sourcecode>
<t>This subset is the default for CBOR <xref target="RFC8949"/>, and has the adv antage of excluding surrogates. <t>This subset is the default for Concise Binary Object Representation (CBOR) <x ref target="RFC8949"/> and has the advantage of excluding surrogates.
However, it includes legacy controls and noncharacters.</t> However, it includes legacy controls and noncharacters.</t>
</section> </section>
<section anchor="xml" title="XML Characters"> <section anchor="xml">
<name>XML Characters</name>
<t>The XML 1.0 Specification <xref target="XML"/>, in its grammar production lab eled "Char", specifies a subset of Unicode code points that excludes surrogates, legacy C0 controls, and the noncharacters U+FFFE and U+FFFF.</t> <t>The XML 1.0 Specification <xref target="XML"/>, in its grammar production lab eled "Char", specifies a subset of Unicode code points that excludes surrogates, legacy C0 controls, and the noncharacters U+FFFE and U+FFFF.</t>
<t>The "XML Characters" subset can be expressed as an ABNF production:</t> <t>The "XML Characters" subset can be expressed as an ABNF production:</t>
<!-- <!--
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
--> -->
<sourcecode>
<!--[rfced] AD: As requested by the authors, we made the following
update within the sourcecode in Section 2 (see the last
line). Please review and provide your approval of this change.
Original:
xml-character =
%x9 / %xA / %xD / ; useful controls
%x20-D7FF / ; exclude surrogates
%xE000-FFFD / ; exclude FFFE and FFFF nonchars
%x100000-10FFFF
Current:
xml-character =
%x9 / %xA / %xD / ; useful controls
%x20-D7FF / ; exclude surrogates
%xE000-FFFD / ; exclude FFFE and FFFF nonchars
%x10000-10FFFF
-->
<sourcecode type="abnf"><![CDATA[
xml-character = xml-character =
%x9 / %xA / %xD / ; useful controls %x9 / %xA / %xD / ; useful controls
%x20-D7FF / ; exclude surrogates %x20-D7FF / ; exclude surrogates
%xE000-FFFD / ; exclude FFFE and FFFF nonchars %xE000-FFFD / ; exclude FFFE and FFFF nonchars
%x100000-10FFFF %x10000-10FFFF
</sourcecode> ]]></sourcecode>
<t>While this subset does not exclude all the problematic code points, the C1 co ntrols are less likely than the C0 controls to appear erroneously in data, and h ave not been observed to be a frequent source of problems. <t>While this subset does not exclude all the problematic code points, the C1 co ntrols are less likely than the C0 controls to appear erroneously in data and ha ve not been observed to be a frequent source of problems.
Also, the noncharacters greater in value than U+FFFF are rarely encountered.</t> Also, the noncharacters greater in value than U+FFFF are rarely encountered.</t>
</section> </section>
<section anchor="unicode-assignables" title="Unicode Assignables"> <section anchor="unicode-assignables">
<name>Unicode Assignables</name>
<t>This document defines the "Unicode Assignables" subset as all the Unicode cod e points that are not problematic. <t>This document defines the "Unicode Assignables" subset as all the Unicode cod e points that are not problematic.
This, a proper subset of each of the others, comprises all code points that are currently assigned, excluding legacy control codes, or that might in future be a ssigned.</t> This, a proper subset of each of the others, comprises all code points that are currently assigned, excluding legacy control codes, or that might be assigned in the future.</t>
<t>Unicode Assignables can be expressed as an ABNF production:</t> <t>Unicode Assignables can be expressed as an ABNF production:</t>
<sourcecode> <sourcecode type="abnf"><![CDATA[
unicode-assignable = unicode-assignable =
%x9 / %xA / %xD / ; useful controls %x9 / %xA / %xD / ; useful controls
%x20-7E / ; exclude C1 controls and DEL %x20-7E / ; exclude C1 controls and DEL
%xA0-D7FF / ; exclude surrogates %xA0-D7FF / ; exclude surrogates
%xE000-FDCF / ; exclude FDD0 nonchars %xE000-FDCF / ; exclude FDD0 nonchars
%xFDF0-FFFD / ; exclude FFFE and FFFF nonchars %xFDF0-FFFD / ; exclude FFFE and FFFF nonchars
%x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane) %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
%x30000-3FFFD / %x40000-4FFFD / %x30000-3FFFD / %x40000-4FFFD /
%x50000-5FFFD / %x60000-6FFFD / %x50000-5FFFD / %x60000-6FFFD /
%x70000-7FFFD / %x80000-8FFFD / %x70000-7FFFD / %x80000-8FFFD /
%x90000-9FFFD / %xA0000-AFFFD / %x90000-9FFFD / %xA0000-AFFFD /
%xB0000-BFFFD / %xC0000-CFFFD / %xB0000-BFFFD / %xC0000-CFFFD /
%xD0000-DFFFD / %xE0000-EFFFD / %xD0000-DFFFD / %xE0000-EFFFD /
%xF0000-FFFFD / %x100000-10FFFD %xF0000-FFFFD / %x100000-10FFFD
</sourcecode> ]]></sourcecode>
</section> </section>
</section> </section>
<section anchor="restricting" title="Using Subsets"> <section anchor="restricting">
<name>Using Subsets</name>
<t>Many IETF specifications rely on well-known data formats such as JSON, I-JSON , CBOR, YAML, and XML. <t>Many IETF specifications rely on well-known data formats such as JSON, Intern et JSON (I-JSON), CBOR, YAML, and XML.
These formats specify default subsets. These formats specify default subsets.
For example, JSON allows object member names and string values to include any Un icode code point, including all the problematic types.</t> For example, JSON allows object member names and string values to include any Un icode code point, including all the problematic types.</t>
<t>A protocol based on JSON can be made more robust and implementor-friendly by restricting the contents of object member names and string values to one of the subsets described in <xref target="subsets"/>. <t>A protocol based on JSON can be made more robust and implementor-friendly by restricting the contents of object member names and string values to one of the subsets described in <xref target="subsets"/>.
Equivalent restrictions are possible for other packaging formats such as I-JSON, XML, YAML, and CBOR.</t> Equivalent restrictions are possible for other packaging formats such as I-JSON, XML, YAML, and CBOR.</t>
<t>Note that escaping techniques such as those in the JSON example in <xref targ et="dealing"/> cannot be used to circumvent this sort of restriction, which appl ies to data content, not textual representation in packaging formats. <t>Note that escaping techniques such as those in the JSON example in <xref targ et="dealing"/> cannot be used to circumvent this sort of restriction, which appl ies to data content, not textual representation in packaging formats.
If a specification restricted a JSON field value to the Unicode Assignables, the
example would remain a conforming JSON Text but the data it represents would no <!--[rfced] FYI: As requested by the authors, we made the following
t constitute Unicode Assignable code points.</t> update in Section 6:
Original:
...the example would remain a conforming JSON Text but...
Current:
...the example would remain a conforming JSON text but...
-->
If a specification restricted a JSON field value to the Unicode Assignables, the
example would remain a conforming JSON text but the data it represents would no
t constitute Unicode Assignable code points.</t>
</section> </section>
<section anchor="iana-considerations" title="IANA Considerations"> <section anchor="iana-considerations">
<name>IANA Considerations</name>
<t>This document has no actions for IANA.</t> <t>This document has no IANA actions.</t>
</section> </section>
<section anchor="security-considerations" title="Security Considerations"> <section anchor="security-considerations">
<name>Security Considerations</name>
<t><xref target="dealing"/> of this document discusses security issues.</t> <t><xref target="dealing"/> of this document discusses security issues.</t>
<t>Unicode Security Considerations <xref target="TR36"/> is a wide-ranging surve y of the issues implementors should consider while writing software to process U nicode text. <t>Unicode Security Considerations <xref target="TR36"/> is a wide-ranging surve y of the issues implementors should consider while writing software to process U nicode text.
Unicode Source Code Handling <xref target="TR55"/> discusses use of Unicode in p rogramming languages, with a focus on security issues. Unicode Source Code Handling <xref target="TR55"/> discusses use of Unicode in p rogramming languages, with a focus on security issues.
Many of the attacks they discuss are aimed at deceiving human readers, but vulne rabilities involving issues such as surrogates and noncharacters are also covere d, and in fact can contribute to human-deceiving exploits.</t> Many of the attacks they discuss are aimed at deceiving human readers, but vulne rabilities involving issues such as surrogates and noncharacters are also covere d and, in fact, can contribute to human-deceiving exploits.</t>
<t>The Security Considerations in Section 12 of <xref target="RFC8264"/> general ly applies to this document as well.</t> <t>The security considerations in <xref target="RFC8264" section="12"/> generall y apply to this document as well.</t>
<t>Note that the Unicode-character subsets specified in this document are increa singly restrictive, omitting more and more problematic code points, and thus sho uld be less and less susceptible to many of these exploits. <t>Note that the Unicode-character subsets specified in this document are increa singly restrictive, omitting more and more problematic code points, and thus sho uld be less and less susceptible to many of these exploits.
The <xref target="unicode-assignables"/> subset, "Unicode Assignables", excludes all of these code points.</t> The subset in <xref target="unicode-assignables"/>, "Unicode Assignables", exclu des all of these code points.</t>
</section> </section>
</middle> </middle>
<back> <back>
<references title="Normative References"> <references>
<name>References</name>
<references>
<name>Normative References</name>
<reference anchor="UNICODE" target="http://www.unicode.org/versions/latest/"> <reference anchor="UNICODE" target="http://www.unicode.org/versions/latest/">
<front> <front>
<title abbrev="Unicode">The Unicode Standard</title> <title abbrev="Unicode">The Unicode Standard</title>
<author><organization>The Unicode Consortium</organization><address /></author> <author><organization>The Unicode Consortium</organization><address /></author>
</front> </front>
<annotation>Note that this reference is to the latest version of <annotation>Note that this reference is to the latest version of
Unicode, rather than to a specific release. It is not expected that Unicode, rather than to a specific release. It is not expected that
future changes in the Unicode Standard will affect the referenced future changes in the Unicode Standard will affect the referenced
definitions.</annotation> definitions.</annotation>
</reference> </reference>
<reference anchor="TR36" target="https://www.unicode.org/reports/tr36/"> <reference anchor="TR36" target="https://www.unicode.org/reports/tr36/">
<front> <front>
<title abbrev="Unicode Security Considerations">Unicode Security Considerations< /title> <title abbrev="Unicode Security Considerations">Unicode Security Considerations< /title>
<author><organization>The Unicode Consortium</organization><address /></author> <author fullname="Mark Davis" role="editor"/>
<author fullname="Michel Suignard" role="editor"/>
</front> </front>
<annotation>Note that this reference is to the latest version of <annotation>Note that this reference is to the latest version of
this document, rather than to a specific release. It is not expected that this document, rather than to a specific release. It is not expected that
future updates will affect the referenced discussions.</annotation> future updates will affect the referenced discussions.</annotation>
</reference> </reference>
<reference anchor="TR55" target="https://www.unicode.org/reports/tr55/"> <reference anchor="TR55" target="https://www.unicode.org/reports/tr55/">
<front> <front>
<title abbrev="Unicode Source Code Handling">Unicode Source Code Handling</title > <title abbrev="Unicode Source Code Handling">Unicode Source Code Handling</title >
<author><organization>The Unicode Consortium</organization><address /></author> <author fullname="Robin Leroy" role="editor"/>
<author fullname="Mark Davis" role="editor"/>
</front> </front>
<annotation>Note that this reference is to the latest version of <annotation>Note that this reference is to the latest version of
this document, rather than to a specific release. It is not expected that this document, rather than to a specific release. It is not expected that
future updates will affect the referenced discussions.</annotation> future updates will affect the referenced discussions.</annotation>
</reference> </reference>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5234.xml" /> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5234.xml" />
</references> </references>
<references title="Informative References"> <references>
<name>Informative References</name>
<reference anchor="IDN" target="https://datatracker.ietf.org/group/idn/"> <reference anchor="IDN" target="https://datatracker.ietf.org/group/idn/">
<front> <front>
<title>Internationalized Domain Name Working Group</title> <title>Internationalized Domain Name Working Group</title>
<author><organization></organization></author><date/> <author><organization></organization></author><date/>
</front> </front>
</reference> </reference>
<reference anchor="PRECIS" target="https://datatracker.ietf.org/group/precis/"> <reference anchor="PRECIS" target="https://datatracker.ietf.org/group/precis/">
<front> <front>
skipping to change at line 361 skipping to change at line 448
<reference anchor="W3C-CHAR" target="https://www.w3.org/International/articles/d efinitions-characters/"> <reference anchor="W3C-CHAR" target="https://www.w3.org/International/articles/d efinitions-characters/">
<front> <front>
<title>Character encodings: Essential concepts</title> <title>Character encodings: Essential concepts</title>
<author><organization>W3C</organization></author><date/> <author><organization>W3C</organization></author><date/>
</front> </front>
</reference> </reference>
<reference anchor="XML" target="http://www.w3.org/TR/2008/REC-xml-20081126/"> <reference anchor="XML" target="http://www.w3.org/TR/2008/REC-xml-20081126/">
<front> <front>
<title abbrev="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth Edition)</ti tle> <title abbrev="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth Edition)</ti tle>
<author fullname="Tim Bray" surname="Bray"><organization>Textuality and Netscape <author fullname="Tim Bray" surname="Bray" role="editor"><organization>Textualit
</organization></author> y and Netscape</organization></author>
<author fullname="Jean Paoli" surname="Paoli"><organization>Microsoft</organizat <author fullname="Jean Paoli" surname="Paoli" role="editor"><organization>Micros
ion></author> oft</organization></author>
<author fullname="C.M. Sperberg-McQueen" initials="C.M." surname="McQueen"><orga <author fullname="C.M. Sperberg-McQueen" initials="C.M." surname="McQueen" role=
nization>W3C</organization></author> "editor"><organization>W3C</organization></author>
<author fullname="Eve Maler" surname="Maler"><organization>Sun Microsystems, Inc <author fullname="Eve Maler" surname="Maler" role="editor"><organization>Sun Mic
.</organization></author> rosystems, Inc.</organization></author>
<author fullname="François Yergeau" surname="Yergeau"></author> <author fullname="François Yergeau" surname="Yergeau" role="editor"></author>
<date year='2008' month='November' day='26'/> <date year='2008' month='November' day='26'/>
</front> </front>
<refcontent>W3C Recommendation</refcontent>
<annotation>Note that this reference is to a specific release, based on a histor y of previous "Edition" releases having changed this production.</annotation> <annotation>Note that this reference is to a specific release, based on a histor y of previous "Edition" releases having changed this production.</annotation>
</reference> </reference>
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2277.xml" /> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2277.xml" />
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3629.xml" /> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3629.xml" />
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8259.xml" /> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8259.xml" />
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8264.xml" /> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8264.xml" />
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8949.xml" /> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8949.xml" />
<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9413.xml" /> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9413.xml" />
</references> </references>
</references>
<section numbered="false" anchor="acknowledgements" title="Acknowledgements"> <section numbered="false" anchor="acknowledgements">
<name>Acknowledgements</name>
<t>Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata Report again <t>Thanks are due to <contact fullname="Guillaume Fortin-Debigaré"/>, who
st RFC 8259, The JavaScript Object Notation, noting frequent references to "Unic filed an errata report against RFC 8259, "The JavaScript Object Notation (JSON)
ode characters", when in fact the RFC formally specifies the use of Unicode Code Data Interchange Format",
Points.</t> noting frequent references to "Unicode characters", when in fact the RFC
<t>Thanks also to Asmus Freytag for careful review and many constructive suggest formally specifies the use of Unicode code points.</t>
ions aimed at making the language more consistent with the structure of the Unic <t>Thanks also to <contact fullname="Asmus Freytag"/> for careful review and
ode Standard.</t> many constructive suggestions aimed at making the language more consistent
<t>Thanks also to James Manger for the correctness of the ABNF and JSON samples. with the structure of the Unicode Standard.</t>
</t> <t>Thanks also to <contact fullname="James Manger"/> for the correctness of
<t>Thanks also to Addison Phillips and the W3C Internationalization Working Grou the ABNF and JSON samples.</t>
p for helpful suggestions on language and references.</t> <t>Thanks also to <contact fullname="Addison Phillips"/> and the W3C
<t>Thoughtful comments during the many iterations of this draft, which helped ti Internationalization Working Group for helpful suggestions on language and
ghten up wording and make difficult points clearer, were contributed by Harald A references.</t>
lvestrand, Martin J Dürst, Donald E. Eastlake, John Klensin, Barry Leiba, Glyn N <t>Thoughtful comments during the many draft versions of this document, which he
ormington, Peter Saint-Andre, and Rob Sayre.</t> lped
tighten up wording and make difficult points clearer, were contributed by
<contact fullname="Harald Alvestrand"/>, <contact fullname="Martin J. Dürst"/>,
<contact fullname="Donald E. Eastlake"/>, <contact fullname="John Klensin"/>,
<contact fullname="Barry Leiba"/>, <contact fullname="Glyn Normington"/>,
<contact fullname="Peter Saint-Andre"/>, and <contact fullname="Rob
Sayre"/>.</t>
</section> </section>
</back> </back>
<!-- [rfced] Please review the "type" attribute of each sourcecode element
in the XML file to ensure correctness. If the current list of preferred
values for "type"
(https://www.rfc-editor.org/rpc/wiki/doku.php?id=sourcecode-types)
does not contain an applicable type, then feel free to let us know.
Also, it is acceptable to leave the "type" attribute not set.
-->
<!-- [rfced] Some author comments are present in the XML. Please confirm that
no updates related to these comments are outstanding. Note that the
comments will be deleted prior to publication.
-->
<!-- [rfced] FYI - We have added expansions for the following abbreviations
per Section 3.6 of RFC 7322 ("RFC Style Guide"). Please review each
expansion in the document carefully to ensure correctness.
Concise Binary Object Representation (CBOR)
Internet JSON (I-JSON)
-->
<!-- [rfced] Please review the "Inclusive Language" portion of the online
Style Guide <https://www.rfc-editor.org/styleguide/part2/#inclusive_language>
and let us know if any changes are needed. Updates of this nature typically
result in more precise language, which is helpful for readers.
Note that our script did not flag any words in particular, but this should
still be reviewed as a best practice.
-->
</rfc> </rfc>
 End of changes. 70 change blocks. 
138 lines changed or deleted 269 lines changed or added

This html diff was produced by rfcdiff 1.48.