For the XML, HTML, XHTML and Text output methods, serialization comprises five phases of processing (preceded by the sequence normalization process described in 2 Sequence Normalization). For the JSON and Adaptive output methods, serialization is described in 9 JSON Output Method and 10 Adaptive Output Method respectively.
For an implementation-defined output method, any of these phases MAY be skipped or MAY be performed in a different order than is specified here. For the output methods defined in this specification, these phases are carried out sequentially as follows:
A meta
element is added to the
sequence
along with discarding an existing meta
element, as
controlled by the include-content-type
parameter
for the XHTML and HTML output methods.
This step is skipped for the
other output methods defined by this
specification.
Markup generation produces the character representation of those parts of the serialized result that describe the structure of the sequence. In the cases of the XML, HTML and XHTML output methods, this phase produces the character representations of the following:
the document type declaration;
start tags and end tags (except for attribute values, whose representation is produced by the character expansion phase);
processing instructions; and
comments.
In the cases of the XML and XHTML output methods, this phase also produces the following:
the XML or text declaration; and
empty element tags (except for the attribute values);
In the case of the text output method, this phase replaces the single document node produced by sequence normalization with a new document node that has exactly one child, which is a text node. The string value of the new text node is the string value of the document node that was produced by sequence normalization.
Character expansion is concerned with the representation of characters appearing in text and attribute nodes in the sequence. For each text and attribute node, the following rules are applied in sequence.
If the node is an attribute that is
a URI attribute value
and the escape-uri-attributes
parameter is set to
require escaping of URI attributes,
apply URI escaping as defined below,
and skip rules b-e. Otherwise, continue with rule b.
[Definition: URI escaping consists of the following three steps applied in sequence to the content of URI attribute values:]
normalize to NFC using the method defined in Section 5.4.6 fn:normalize-unicode FO31
percent-encode any special characters in the URI using the method defined in Section 6.4 fn:escape-html-uri FO31
escape according to
the rules of the XML or HTML output
method, whichever is applicable, any characters that require
escaping, and any characters that cannot be represented in the
selected encoding.
For example, replace <
with <
(See also section 7.3 Writing Character Data).
[Definition: The values of attributes listed in
D List of URI Attributes are URI attribute values.
Attributes are not considered to be URI attributes simply because they are namespace
declaration attributes or have the type annotation xs:anyURI
.]
If the node is a text node whose parent element is selected by the rules of the
cdata-section-elements
parameter for the applicable output method,
create CDATA sections as described below, and skip rules c-e. Otherwise, continue
with rule c.
Apply the following two processes in sequence to create CDATA sections
Unicode Normalization if requested by the normalization-form
parameter.
The application of changes as detailed in the description of the cdata-section-elements
parameter for the applicable output method.
Apply character mapping as determined by the
use-character-maps
parameter for the applicable output method.
For characters that were substituted by this process, skip rules d and e.
For the remaining characters that were not modified by character mapping, continue
with rule d.
Apply Unicode Normalization if requested by the normalization-form
parameter.
[Definition: Unicode Normalization is the process of removing alternative representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence, as specified in [UAX #15: Unicode Normalization Forms]. For specific recommendations for character normalization on the World Wide Web, see [Character Model for the World Wide Web 1.0: Normalization].]
The meanings associated with the possible values of the normalization-form
parameter
are defined in section 5.1.9 XML Output Method: the normalization-form Parameter.
Continue with step e.
Escape according to
the rules of the XML or HTML output
method, whichever is applicable,
any characters (such as <
and &
) where XML or HTML requires
escaping,
and any characters that cannot be represented in the selected encoding.
For example, replace <
with <
.
(See also section 7.3 Writing Character Data).
For characters such as >
where XML defines a built-in entity but does not
require its use in all circumstances, it is implementation-dependent whether the character
is escaped.
Indentation, as controlled by
the indent
parameter and the
suppress-indentation
parameter, MAY
add or remove
whitespace according to the rules defined by the applicable output method.
Encoding, as controlled by the
encoding
parameter, converts the character stream
produced by the previous phases into an octet stream.
Note:
Serialization is only defined in terms of encoding the result as a stream of octets. However, a serializer MAY provide an option that allows the encoding phase to be skipped, so that the result of serialization is a stream of Unicode characters. The effect of any such option is implementation-defined, and a serializer is not required to support such an option.