Macaulay Essay Invalid Characters In Xml

This article describes and classifies the Unicode characters that may validly appear in XML.

XML 1.0[edit]

Unicode code points in the following ranges are valid in XML 1.0 documents:[1]

  • U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;
  • U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden);
  • U+10000–U+10FFFF: this includes all code points in supplementary planes, including non-characters.

The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.0 documents, and whose usage is restricted and highly discouraged:

  • U+007F–U+0084, U+0086–U+009F: this includes a C0 control character and all but one C1 control.

XML 1.1[edit]

Unicode code points in the following code point ranges are always valid in XML 1.1 documents:[2]

  • U+0001–U+D7FF, U+E000–U+FFFD: this includes most C0 and C1 control characters, but excludes some (not all) non-characters in the BMP (surrogates, U+FFFE and U+FFFF are forbidden);
  • U+10000–U+10FFFF: this includes all code points in supplementary planes, including non-characters.

The preceding code points ranges contain the following controls which are only valid in certain contexts in XML 1.1 documents, and whose usage is restricted and highly discouraged:

  • U+0001–U+0008, U+000B–U+000C, U+000E–U+001F : this includes most (not all) C0 control characters
  • U+007F–U+0084, U+0086–U+009F  : this includes a C0 control character, and all but one C1 control.

Characters allowed but discouraged[edit]

In addition, the following code points, even though they are valid in all XML 1.0 and XML 1.1 documents, are also restricted and discouraged in both versions of XML, as they are permanently assigned to non-characters in Unicode and ISO/IEC 10646. Some XML parsers may even signal them as invalid in their character set decoder, and XML documents containing them may not pass through some restricted interfaces or may not be interchangeable. These non-characters can still be encoded in standard UTFs (such as UTF-8) because these UTFs only restrict the code points assigned to surrogate non-characters:

  • U+FDD0–U+FDEF
  • U+1FFFE–U+1FFFF, U+2FFFE–U+2FFFF, U+3FFFE–U+3FFFF, U+4FFFE–U+4FFFF, U+5FFFE–U+5FFFF, U+6FFFE–U+6FFFF, U+7FFFE–U+7FFFF, U+8FFFE–U+8FFFF, U+9FFFE–U+9FFFF, U+AFFFE–U+AFFFF, U+BFFFE–U+BFFFF, U+CFFFE–U+CFFFF, U+DFFFE–U+DFFFF, U+EFFFE–U+EFFFF, U+FFFFE–U+FFFFF, U+10FFFE–U+10FFFF.

Note that the code point U+0000, assigned to the null control character, is the only character encoded in Unicode and ISO/IEC 10646 that is always invalid in any XML 1.0 and 1.1 document.

On the opposite, the code point U+0085 is a valid control character in Unicode and ISO/IEC 10646, as well as in XML 1.0 and XML 1.1 documents (in all contexts), and its usage is not discouraged (it is treated as whitespace in many XML contexts, or as a line-break control similar to U+000D and U+000A in preformatted texts in some XML applications).

Non-restricted characters[edit]

For these reasons, the non-restricted repertoire which can be used in all versions of XML and in all contexts (as permitted by the XML syntax) contains only code points that are permanently assigned to characters (excluding non-characters), or reserved for possible future encoding in Unicode and ISO/IEC 10646, and excludes the restricted repertoire, for better interoperability. They are:

  • U+0009, U+000A, U+000D: these are the only C0 control characters accepted in both XML 1.0 and XML 1.1 (they are treated as whitespaces or line-breaks in many contexts);
  • U+0020–U+007E: these are all the non-control characters in the Basic Latin block (the "graphic" subset of US-ASCII), and excludes the last C0 control;
  • U+0085: this is the only C1 control character accepted in both XML 1.0 and XML 1.1 (it is treated as whitespace or line-break in many contexts);
  • U+00A0–U+D7FF, U+E000–U+FDCF, U+FDF0–U+FFFD: this includes all the other characters in the BMP, excluding all non-characters (such as surrogates);
  • U+10000–U+1FFFD, U+20000–U+2FFFD, U+30000–U+3FFFD, U+40000–U+4FFFD, U+50000–U+5FFFD, U+60000–U+6FFFD, U+70000–U+7FFFD, U+80000–U+8FFFD, U+90000–U+9FFFD, U+A0000–U+AFFFD, U+B0000–U+BFFFD, U+C0000–U+CFFFD, U+D0000–U+DFFFD, U+E0000–U+EFFFD, U+F0000–U+FFFFD, U+100000–U+10FFFD: this excludes all non-characters in supplementary planes.

See also[edit]

References[edit]

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

For normal text (not markup), there are no special characters except < and &: just make sure your XML Declaration refers to the correct encoding scheme for the language and/or writing system you want to use, and that your computer correctly stores the file using that encoding scheme. See the question on non-Latin characters for a longer explanation.

Apart from the invisible ASCII control characters (the ones you can't type), all other characters are just normal text. Currency signs (€, £, $, ƒ, ₨, Ƀ, and others), all the punctuation (except < and &), and all other letters, signs, and symbols in any language or writing system are just text (assuming you have the correct character encoding).

If your keyboard will not allow you to type the characters you want, or if you want to use characters outside the limits of the encoding scheme you have chosen, you can use a symbolic notation called ‘entity referencing’. Entity references can either be numeric, using the decimal or hexadecimal Unicode code point for the character (eg if your keyboard has no Euro symbol (€) you can type &#8364;); or they can be character, using an established set of names which you can declare in your DTD (eg <!ENTITY euro "&#8364;">) which then lets you use the name &euro; in your document. If you are using a Schema, you must use the numeric form for all except the five below because Schemas have no way to make character entity declarations.

If you use XML with no DTD, then the five character entities listed at the top of this question are assumed to be predeclared, and you can use them without declaring them separately (indeed, most software prevents you redeclaring them):

&lt;

The less-than character (<) starts element markup (the first character of a start-tag or an end-tag).

&amp;

The ampersand character (&) starts entity markup (the first character of a character entity reference).

&gt;

The greater-than character (>) ends a start-tag or an end-tag.

&quot;

The double-quote character (") can be symbolised with this character entity reference when you need to embed a double-quote inside a string which is already double-quoted.

&apos;

The apostrophe or single-quote character (') can be symbolised with this character entity reference when you need to embed a single-quote or apostrophe inside a string which is already single-quoted.

If you are using a DTD then you must declare all character entities you need to use, so it would be good practice also to declare any of the five above that you plan on using. If you are using a Schema, you must use the numeric form for all except the five above because Schemas have no way to make character entity declarations.

There are circumstances where you can use special characters as themselves, such as in CDATA Sections. Most control characters are prohibited in XML: see the Specification for exact details.

There are also no reserved words as such in the user namespace of XML: you can call an element element and an attribute attribute and so on as in the following (perverse) example:

<?xml version="1.0"?> <!DOCTYPE DOCTYPE SYSTEM "SYSTEM" [ <!ELEMENT DOCTYPE (ELEMENT+)> <!ATTLIST ELEMENT ATTLIST ENTITY #IMPLIED> <!NOTATION DOCTYPE SYSTEM "ENTITY"> <!ENTITY NOTATION SYSTEM "ENTITY" NDATA DOCTYPE> ]> <DOCTYPE> <ELEMENT ATTLIST="NOTATION">foo</ELEMENT> </DOCTYPE>

where the file SYSTEM contains the declaration: <!ELEMENT ELEMENT (#PCDATA)> and the file ENTITY does not even exist ☺

There are keywords like DOCTYPE and IMPLIED which are reserved Names, but they are prefixed by a flag character (the Markup Declaration Open character or the Reserved Name Indicator) so that they cannot be confused with user-specified Names.

One thought on “Macaulay Essay Invalid Characters In Xml

Leave a Reply

Your email address will not be published. Required fields are marked *