Special characters: ordinary characters requiring special treatment

entity delimiter escaping punctuation special character
tag

Many characters which appear in printed texts (e.g. dashes, digraphs, and accented characters) cannot be typed on a standard keyboard and hence must be treated specially, using an entity reference or a numeric character reference. These are described in separate entries. However, in addition to these there are characters which can be typed using a standard keyboard, but which still need special treatment because of their special use in markup. In these cases, the special character cannot be entered as it normally would, i.e. as itself, because then it would be interpreted as markup. Thus, in these cases the special character needs to be entered with some special notation that signals to the processor treat this as a character, not as markup. The various types of special notation are referred to as escape mechanisms or escapes. We say that a character so indicated as itself has been escaped.

For our purposes, there are three cases in which particular characters must be escaped, as follows.

In text content or attribute values

In general, the less-than character ( < ) is used to indicate the start of an XML tag (or comment, or processing instruction, or CDATA marked section), and the ampersand character ( & ) is used to indicate the start of an entity reference or a numeric character reference. Therefore, these two characters must be escaped in text content or attribute values, lest XML software not be able to properly parse the file. Typically they are escaped with entity references (it is also possible to use numeric character references). The entity reference that represents a less-than is &lt;; the entity reference that represents an ampersand is &amp;.

In an attribute value

Attribute values must be delimited by either single ( ) or double ( " ) straight quotation marks. Whichever character was used to delimit a particular attribute value cannot appear within that attribute value, lest the XML software think it is the end of the attribute value. Thus whichever of these is used to delimit an attribute value must be escaped inside that value. Typically they are escaped with entity references (it is also possible to use numeric character references). The entity reference that represents a single straight quotation mark, also called an apostrophe, is &apos;; the entity reference that represents a double straight quotation mark is &quot;.

Do not confuse these characters with the left and right double- and single- quotes (i.e., , , , and characters, which themselves may be, but do not have to be, escaped with the &ldquo;, &rdquo;, &lsquo;, and &rsquo; entity references). Also note that the straight quote characters that need to be escaped in attribute values do not need to be escaped in text content. Thus in the possessive Orinda’s, the apostrophe may be typed in using the standard keyboard apostrophe.

In a rendition ladder

Characters occurring within rendition ladders which could be confused with the characters used to demarcate the syntax of the rendition ladder must be escaped. In this case, a non-XML escaping mechanism is used, as follows:

  • (, the left parenthesis, should be preceded by a backslash; thus they are encoded within rendition ladders as \(
  • ), the right parenthesis, should be preceded by a backslash; thus they are encoded within rendition ladders as \)
  • \, the backslash, also should be preceded by a backslash; thus they are encoded within rendition ladders as \\

Thus for a stage direction surrounded by parentheses, such as (Enter, stage left) the correct encoding would be:

<stage rend="pre(\() post(\))">Enter, stage left</stage>

Note that these characters do not need to be treated specially when they occur in the body of the text.