4 Phases of Serialization

For the XML, HTML, XHTML and Text output methods, serialization comprises five phases of processing (preceded by the sequence normalization process described in 2 Sequence Normalization). For the JSON and Adaptive output methods, serialization is described in 9 JSON Output Method and 10 Adaptive Output Method respectively.

For an implementation-defined output method, any of these phases MAY be skipped or MAY be performed in a different order than is specified here. For the output methods defined in this specification, these phases are carried out sequentially as follows:

  1. A meta element is added to the sequence along with discarding an existing meta element, as controlled by the include-content-type parameter for the XHTML and HTML output methods. This step is skipped for the other output methods defined by this specification.

  2. Markup generation produces the character representation of those parts of the serialized result that describe the structure of the sequence. In the cases of the XML, HTML and XHTML output methods, this phase produces the character representations of the following:

    • the document type declaration;

    • start tags and end tags (except for attribute values, whose representation is produced by the character expansion phase);

    • processing instructions; and

    • comments.

    In the cases of the XML and XHTML output methods, this phase also produces the following:

    • the XML or text declaration; and

    • empty element tags (except for the attribute values);

    In the case of the text output method, this phase replaces the single document node produced by sequence normalization with a new document node that has exactly one child, which is a text node. The string value of the new text node is the string value of the document node that was produced by sequence normalization.

  3. Character expansion is concerned with the representation of characters appearing in text and attribute nodes in the sequence. For each text and attribute node, the following rules are applied in sequence.

    1. If the node is an attribute that is a URI attribute value and the escape-uri-attributes parameter is set to require escaping of URI attributes, apply URI escaping as defined below, and skip rules b-e. Otherwise, continue with rule b.

      [Definition: URI escaping consists of the following three steps applied in sequence to the content of URI attribute values:]

      1. normalize to NFC using the method defined in Section 5.4.6 fn:normalize-unicode FO31

      2. percent-encode any special characters in the URI using the method defined in Section 6.4 fn:escape-html-uri FO31

      3. escape according to the rules of the XML or HTML output method, whichever is applicable, any characters that require escaping, and any characters that cannot be represented in the selected encoding. For example, replace < with &lt; (See also section 7.3 Writing Character Data).

      [Definition: The values of attributes listed in D List of URI Attributes are URI attribute values. Attributes are not considered to be URI attributes simply because they are namespace declaration attributes or have the type annotation xs:anyURI.]

    2. If the node is a text node whose parent element is selected by the rules of the cdata-section-elements parameter for the applicable output method, create CDATA sections as described below, and skip rules c-e. Otherwise, continue with rule c.

      Apply the following two processes in sequence to create CDATA sections

      1. Unicode Normalization if requested by the normalization-form parameter.

      2. The application of changes as detailed in the description of the cdata-section-elements parameter for the applicable output method.

    3. Apply character mapping as determined by the use-character-maps parameter for the applicable output method. For characters that were substituted by this process, skip rules d and e. For the remaining characters that were not modified by character mapping, continue with rule d.

    4. Apply Unicode Normalization if requested by the normalization-form parameter.

      [Definition: Unicode Normalization is the process of removing alternative representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence, as specified in [UAX #15: Unicode Normalization Forms]. For specific recommendations for character normalization on the World Wide Web, see [Character Model for the World Wide Web 1.0: Normalization].]

      The meanings associated with the possible values of the normalization-form parameter are defined in section 5.1.9 XML Output Method: the normalization-form Parameter.

      Continue with step e.

    5. Escape according to the rules of the XML or HTML output method, whichever is applicable, any characters (such as < and &) where XML or HTML requires escaping, and any characters that cannot be represented in the selected encoding. For example, replace < with &lt;. (See also section 7.3 Writing Character Data). For characters such as > where XML defines a built-in entity but does not require its use in all circumstances, it is implementation-dependent whether the character is escaped.

  4. Indentation, as controlled by the indent parameter and the suppress-indentation parameter, MAY add or remove whitespace according to the rules defined by the applicable output method.

  5. Encoding, as controlled by the encoding parameter, converts the character stream produced by the previous phases into an octet stream.

    Note:

    Serialization is only defined in terms of encoding the result as a stream of octets. However, a serializer MAY provide an option that allows the encoding phase to be skipped, so that the result of serialization is a stream of Unicode characters. The effect of any such option is implementation-defined, and a serializer is not required to support such an option.