The grammar of XPath 4.0 uses the same simple Extended Backus-Naur Form (EBNF) notation as [XML 1.0] with the following minor differences.
All named symbols have a name that begins with an uppercase letter.
It adds a notation for referring to productions in external specifications.
Comments or extra-grammatical constraints on grammar productions are between '/*' and '*/' symbols.
A 'xgc:' prefix is an extra-grammatical constraint, the details of which are explained in A.1.2 Extra-grammatical Constraints
A 'ws:' prefix explains the whitespace rules for the production, the details of which are explained in A.2.4 Whitespace Rules
A 'gn:' prefix means a 'Grammar Note', and is meant as a clarification for parsing rules, and is explained in A.1.3 Grammar Notes. These notes are not normative.
The terminal symbols for this grammar include the quoted strings used in the production rules below, and the terminal symbols defined in section A.2.1 Terminal Symbols.
The EBNF notation is described in more detail in A.1.1 Notation.
[Definition: Each rule in the grammar defines one symbol, using the following format:
symbol ::= expression
]
[Definition: A terminal is a symbol or string or pattern that can appear in the right-hand side of a rule, but never appears on the left-hand side in the main grammar, although it may appear on the left-hand side of a rule in the grammar for terminals.] The following constructs are used to match strings of one or more characters in a terminal:
matches any Char with a value in the range(s) indicated (inclusive).
matches any Char with a value among the characters enumerated.
matches any Char with a value not among the characters given.
matches the sequence of characters that appear inside the double quotes.
matches the sequence of characters that appear inside the single quotes.
matches any string matched by the production defined in the external specification as per the provided reference.
Patterns (including the above constructs) can be combined with grammatical operators to form more complex patterns, matching more complex sets of character strings. In the examples that follow, A and B represent (sub-)patterns.
A
is treated as a unit and may be combined as described in this list.
matches A
or nothing; optional A
.
matches A
followed by B
. This operator has higher
precedence than alternation; thus A B | C D
is identical to (A B) |
(C D)
.
matches A
or B
but not both.
matches any string that matches A
but does not match B
.
matches one or more occurrences of A
. Concatenation has higher
precedence than alternation; thus A+ | B+
is identical to (A+) |
(B+)
.
matches zero or more occurrences of A
. Concatenation has higher
precedence than alternation; thus A* | B*
is identical to (A*) |
(B*)
This section contains constraints on the EBNF productions, which are required to parse syntactically valid sentences. The notes below are referenced from the right side of the production, with the notation: /* xgc: <id> */.
Constraint: leading-lone-slash
A single slash may appear either as a complete path expression or as the first part of a
path expression in which it is followed by a RelativePathExpr. In some cases, the next token after the slash is insufficient to
allow a parser to distinguish these two possibilities: the *
token and
keywords like union
could be either an operator or a NameTest
. For example,
without lookahead the first part of the expression / * 5
is easily taken to
be a complete expression, / *
, which has a very different
interpretation (the child nodes of /
).
If the token immediately following a slash can form the start of a RelativePathExpr, then the slash must be the beginning of a PathExpr, not the entirety of it.
A single slash may be used as the left-hand argument of an operator by parenthesizing it:
(/) * 5
. The expression 5 *
/
, on the other hand, is syntactically valid without parentheses.
The version of XML and XML Names (e.g. [XML 1.0] and [XML Names],
or [XML 1.1] and [XML Names 1.1]) is implementation-defined. It is recommended that
the latest applicable version be used (even if it is published later than this
specification). The EBNF in this specification links only to the 1.0 versions. Note also
that these external productions follow the whitespace rules of their respective
specifications, and not the rules of this specification, in particular A.2.4.1 Default Whitespace Handling. Thus prefix : localname
is not a
syntactically valid lexical QName for purposes of this
specification, just as it is not permitted in a XML document. Also, comments are not
permissible on either side of the colon. Also extra-grammatical constraints such as
well-formedness constraints must be taken into account.
Constraint: reserved-function-names
Unprefixed function names spelled the same way as language keywords could make the
language impossible to parse. For instance, element(foo)
could be taken either as
a FunctionCall or as an ElementTest. Therefore, an unprefixed function name must not be any of the names in
A.3 Reserved Function Names.
A function named "if" can be called by binding its namespace to a prefix and using the prefixed form: "library:if(foo)" instead of "if(foo)".
Constraint: occurrence-indicators
As written, the grammar in A XPath 4.0 Grammar is ambiguous for some forms using the '+' and '*' occurrence indicators. The ambiguity is resolved as follows: these operators are tightly bound to the SequenceType expression, and have higher precedence than other uses of these symbols. Any occurrence of '+' and '*', as well as '?', following a sequence type is assumed to be an occurrence indicator, which binds to the last ItemType in the SequenceType.
Thus, 4 treat as item() + - 5
must be interpreted as (4 treat as item()+) - 5
, taking the '+' as an
OccurrenceIndicator and the '-' as a subtraction operator. To force the interpretation of
"+" as an addition operator (and the corresponding interpretation of the "-" as a unary
minus), parentheses may be used: the form (4 treat as item()) +
-5
surrounds the SequenceType expression with
parentheses and leads to the desired interpretation.
function () as xs:string *
is interpreted as function () as (xs:string
*)
, not as (function () as xs:string) *
. Parentheses can be used as
shown to force the latter interpretation.
This rule has as a consequence that certain forms which would otherwise be syntactically valid and unambiguous are not recognized: in "4 treat as item() + 5", the "+" is taken as an OccurrenceIndicator, and not as an operator, which means this is not a syntactically valid expression.
This section contains general notes on the EBNF productions, which may be helpful in understanding how to interpret and implement the EBNF. These notes are not normative. The notes below are referenced from the right side of the production, with the notation: /* gn: <id> */.
Note:
Look-ahead is required to distinguish FunctionCall from
a EQName or keyword followed by a
Comment. For example: address (: this
may be empty :)
may be mistaken for a call to a function named "address"
unless this lookahead is employed. Another example is for (:
whom the bell :) $tolls in 3 return $tolls
, where the keyword "for" must
not be mistaken for a function name.
Comments are allowed everywhere that ignorable whitespace is allowed, and the Comment symbol does not explicitly appear on the right-hand side of the grammar (except in its own production). See A.2.4.1 Default Whitespace Handling.
A comment can contain nested comments, as long as all "(:" and ":)" patterns are balanced, no matter where they occur within the outer comment.
Note:
Lexical analysis may typically handle nested comments by incrementing a counter for each "(:" pattern, and decrementing the counter for each ":)" pattern. The comment does not terminate until the counter is back to zero.
Some illustrative examples:
(: commenting out a (: comment :) may be confusing, but often helpful
:)
is a syntactically valid Comment, since balanced nesting of comments
is allowed.
"this is just a string :)"
is a syntactically
valid expression. However, (: "this is just a string :)" :)
will
cause a syntax error. Likewise, "this is another string
(:"
is a syntactically valid expression, but (: "this is another
string (:" :)
will cause a syntax error. It is a limitation of nested
comments that literal content can cause unbalanced nesting of comments.
for (: set up loop :) $i in $x return $i
is
syntactically valid, ignoring the comment.
5 instance (: strange place for a comment :) of
xs:integer
is also syntactically valid.
The terminal symbols assumed by the grammar above are described in this section.
Quoted strings appearing in production rules are terminal symbols.
Other terminal symbols are defined in A.2.1 Terminal Symbols.
Some productions are defined by reference to the XML and XML Names specifications (e.g. [XML 1.0] and [XML Names], or [XML 1.1] and [XML Names 1.1] . A host language may choose which version of these specifications is used; it is recommended that the latest applicable version be used (even if it is published later than this specification).
A host language may choose whether the lexical rules of [XML 1.0] and [XML Names] are followed, or alternatively, the lexical rules of [XML 1.1] and [XML Names 1.1] are followed.
When tokenizing, the longest possible match that is consistent with the EBNF is used.
All keywords are case sensitive. Keywords are not reserved—that is, any lexical QName may duplicate a keyword except as noted in A.3 Reserved Function Names.
[136] | IntegerLiteral |
::= |
Digits
|
|
[137] | DecimalLiteral |
::= | ("." Digits) | (Digits "." [0-9]*) |
/* ws: explicit */ |
[138] | DoubleLiteral |
::= | (("." Digits) | (Digits ("." [0-9]*)?)) [eE] [+-]? Digits
|
/* ws: explicit */ |
[139] | StringLiteral |
::= | ('"' (EscapeQuot | [^"])* '"') | ("'" (EscapeApos | [^'])* "'") |
/* ws: explicit */ |
[140] | URIQualifiedName |
::= |
BracedURILiteral
NCName
|
/* ws: explicit */ |
[141] | BracedURILiteral |
::= | "Q" "{" [^{}]* "}" |
/* ws: explicit */ |
[142] | EscapeQuot |
::= | '""' |
|
[143] | EscapeApos |
::= | "''" |
|
[144] | Comment |
::= | "(:" (CommentContents | Comment)* ":)" |
/* ws: explicit */ |
/* gn: comments */ | ||||
[145] | QName |
::= |
[http://www.w3.org/TR/REC-xml-names/#NT-QName]Names
|
/* xgc: xml-version */ |
[146] | NCName |
::= |
[http://www.w3.org/TR/REC-xml-names/#NT-NCName]Names
|
/* xgc: xml-version */ |
[147] | Char |
::= |
[http://www.w3.org/TR/REC-xml#NT-Char]XML
|
/* xgc: xml-version */ |
The following symbols are used only in the definition of terminal symbols; they are not terminal symbols in the grammar of A.1 EBNF.
[148] | Digits |
::= | [0-9]+ |
|
[149] | CommentContents |
::= | (Char+ - (Char* ('(:' | ':)') Char*)) |
XPath 4.0 expressions consist of terminal symbols and symbol separators.
Terminal symbols that are not used exclusively in /* ws: explicit */ productions are of two kinds: delimiting and non-delimiting.
[Definition: The delimiting terminal symbols are: "!", "!!", "!=", StringLiteral, "#", "$", "(", ")", "*", "*:", "+", (comma), "-", "->", (dot), "..", "/", "//", (colon), ":*", "::", ":=", "<", "<<", "<=", "=", "=>", ">", ">=", ">>", "?", "??", "@", BracedURILiteral, "[", "]", "{", "|", "||", "}" ]
[Definition: The non-delimiting terminal symbols are: IntegerLiteral, URIQualifiedName, NCName, DecimalLiteral, DoubleLiteral, QName, "ancestor", "ancestor-or-self", "and", "array", "as", "attribute", "cast", "castable", "child", "comment", "descendant", "descendant-or-self", "div", "document-node", "element", "else", "empty-sequence", "enum", "eq", "every", "except", "following", "following-sibling", "for", "function", "ge", "gt", "idiv", "if", "in", "instance", "intersect", "is", "item", "le", "let", "lt", "map", "member", "mod", "namespace", "namespace-node", "ne", "node", "of", "or", "otherwise", "parent", "preceding", "preceding-sibling", "processing-instruction", "record", "return", "satisfies", "schema-attribute", "schema-element", "self", "some", "text", "then", "to", "treat", "union", "with" ]
[Definition: Whitespace and Comments function as symbol separators. For the most part, they are not mentioned in the grammar, and may occur between any two terminal symbols mentioned in the grammar, except where that is forbidden by the /* ws: explicit */ annotation in the EBNF, or by the /* xgc: xml-version */ annotation.]
One or more symbol separators are required between two consecutive terminal symbols T and U (where T precedes U) when any of the following is true:
T and U are both non-delimiting terminal symbols.
T is a QName or an NCName and U is "." or "-".
T is a numeric literal and U is ".", or vice versa.
The host language must specify whether the XPath 4.0 processor normalizes all line breaks on input, before parsing, and if it does so, whether it uses the rules of [XML 1.0] or [XML 1.1].
For [XML 1.0] processing, all of the following must be translated to a single #xA character:
the two-character sequence #xD #xA
any #xD character that is not immediately followed by #xA.
For [XML 1.1] processing, all of the following must be translated to a single #xA character:
the two-character sequence #xD #xA
the two-character sequence #xD #x85
the single character #x85
the single character #x2028
any #xD character that is not immediately followed by #xA or #x85.
[Definition: A whitespace character is any of the characters defined by [http://www.w3.org/TR/REC-xml/#NT-S].]
[Definition: Ignorable whitespace consists of any whitespace characters that may occur between terminals, unless these characters occur in the context of a production marked with a ws:explicit annotation, in which case they can occur only where explicitly specified (see A.2.4.2 Explicit Whitespace Handling).] Ignorable whitespace characters are not significant to the semantics of an expression. Whitespace is allowed before the first terminal and after the last terminal of an XPath expression. Whitespace is allowed between any two terminals. Comments may also act as "whitespace" to prevent two adjacent terminals from being recognized as one. Some illustrative examples are as follows:
foo- foo
results in a syntax error. "foo-" would be recognized as a
QName.
foo -foo
is syntactically equivalent to foo - foo
, two QNames separated by a subtraction
operator.
foo(: This is a comment :)- foo
is syntactically
equivalent to foo - foo
. This is because the comment prevents the two
adjacent terminals from being recognized as one.
foo-foo
is syntactically equivalent to single QName.
This is because "-" is a valid character in a QName. When used as an operator after
the characters of a name, the "-" must be separated from the name, e.g. by using
whitespace or parentheses.
10div 3
results in a syntax error.
10 div3
also results in a syntax error.
10div3
also results in a syntax error.
Explicit whitespace notation is specified with the EBNF productions, when it is different from the default rules, using the notation shown below. This notation is not inherited. In other words, if an EBNF rule is marked as /* ws: explicit */, the notation does not automatically apply to all the 'child' EBNF productions of that rule.
/* ws: explicit */ means that the EBNF notation explicitly notates, with
S
or otherwise, where whitespace
characters are allowed. In productions with the /* ws: explicit */
annotation, A.2.4.1 Default Whitespace Handling does not apply.
Comments are not allowed in these productions except where the Comment non-terminal appears.
The following names are not allowed as function names in an unprefixed form because expression syntax takes precedence.
array
attribute
comment
document-node
element
empty-sequence
function
if
item
map
namespace-node
node
processing-instruction
schema-attribute
schema-element
switch
text
tuple
typeswitch
union
Note:
Although the keywords switch
and typeswitch
are not used in
XPath, they are considered reserved function names for compatibility with XQuery.
The grammar in A.1 EBNF normatively defines built-in precedence among the operators of XPath. These operators are summarized here to make clear the order of their precedence from lowest to highest. The associativity column indicates the order in which operators of equal precedence in an expression are applied.
# | Operator | Associativity |
---|---|---|
1 | , (comma) | either |
2 | for, let, some, every, if | NA |
3 | or | either |
4 | and | either |
5 | eq, ne, lt, le, gt, ge, =, !=, <, <=, >, >=, is, <<, >> | NA |
6 | || | left-to-right |
7 | to | NA |
8 | +, - (binary) | left-to-right |
9 | *, div, idiv, mod | left-to-right |
10 | union, | | either |
11 | intersect, except | left-to-right |
12 | instance of | NA |
13 | treat as | NA |
14 | castable as | NA |
15 | cast as | NA |
16 | => | left-to-right |
17 | -, + (unary) | right-to-left |
18 | ! | left-to-right |
19 | /, // | left-to-right |
20 | [ ], ? | left-to-right |
21 | ? (unary) | NA |
In the "Associativity" column, "either" indicates that all the operators at that level have
the associative property (i.e., (A op B) op C
is equivalent to A op (B op
C)
), so their associativity is inconsequential. "NA" (not applicable) indicates that
the EBNF does not allow an expression that directly contains multiple operators from that
precedence level, so the question of their associativity does not arise.
Note:
Parentheses can be used to override the operator precedence in the usual way. Square brackets in an expression such as A[B] serve two roles: they act as an operator causing B to be evaluated once for each item in the value of A, and they act as parentheses enclosing the expression B.