dxml.parser

This implements a range-based StAX _parser for XML 1.0 (which will work with XML 1.1 documents assuming that they don't use any 1.1-specific features). For the sake of simplicity, sanity, and efficiency, the DTD section is not supported beyond what is required to parse past it.

Start tags, end tags, comments, cdata sections, and processing instructions are all supported and reported to the application. Anything in the DTD is skipped (though it's parsed enough to parse past it correctly, and that can result in an XMLParsingException if that XML isn't valid enough to be correctly skipped), and the XML declaration at the top is skipped if present (XML 1.1 requires that it be there, but XML 1.0 does not).

Regardless of what the XML declaration says (if present), any range of $(K_CHAR) will be treated as being encoded in UTF-8, any range of $(K_WCHAR) will be treated as being encoded in UTF-16, and any range of $(K_DCHAR) will be treated as having been encoded in UTF-32. Strings will be treated as ranges of their code units, not code points.

Since the DTD is skipped, entity references other than the five which are predefined by the XML spec cannot be fully processed (since wherever they were used in the document would be replaced by what they referred to, which could be arbitrarily complex XML). As such, by default, if any entity references which are not predefined are encountered outside of the DTD, an XMLParsingException will be thrown (see Config.throwOnEntityRef for how that can be configured). The predefined entity references and any character references encountered will be checked to verify that they're valid, but they will not be replaced (since that does not work with returning slices of the original input).

However, decodeXML or parseStdEntityRef from dxml.util can be used to convert the predefined entity references to what the refer to, and decodeXML or parseCharRef from dxml.util can be used to convert character references to what they refer to.

Primary Symbols

Symbol	Description
parseXML	The function used to initiate the parsing of an XML document.
EntityRange	The range returned by parseXML.
EntityRange.Entity	The element type of EntityRange.

Parser Configuration Helpers

Symbol	Description
Config	Used to configure how EntityRange parses the XML.
simpleXML	A user-friendly configuration for when the application just wants the element tags and the data in between them.
makeConfig	A convenience function for constructing a custom Config.
SkipComments	A $(PHOBOS_REF Flag, std, typecons) used with Config to tell the parser to skip comments.
SkipPI	A $(PHOBOS_REF Flag, std, typecons) used with Config to tell the parser to skip processing instructions.
SplitEmpty	A $(PHOBOS_REF Flag, std, typecons) used with Config to configure how the parser deals with empty element tags.

Helper Types Used When Parsing

Symbol	Description
EntityType	The type of an entity in the XML (e.g. a $(LREF_ALTTEXT start tag, EntityType.elementStart) or a $(LREF_ALTTEXT comment, EntityType.comment)).
TextPos	Gives the line and column number in the XML document.
XMLParsingException	Thrown by EntityRange when it encounters invalid XML.

Helper Functions Used When Parsing

Symbol	Description
getAttrs	A function similar to $(PHOBOS_REF getopt, std, getopt) which allows for the easy processing of start tag attributes.
skipContents	Iterates an EntityRange from a start tag to its matching end tag.
skipToPath	Used to navigate from one start tag to another as if the start tag names formed a file path.
skipToEntityType	Skips to the next entity of the given type in the range.
skipToParentEndTag	Iterates an EntityRange until it reaches the end tag that matches the start tag which is the parent of the current entity.

Helper Traits

Symbol	Description
isAttrRange	Whether the given range is a range of attributes.

Members

Aliases

SkipComments alias SkipComments = Flag!"SkipComments"
SkipPI alias SkipPI = Flag!"SkipPI"
SplitEmpty alias SplitEmpty = Flag!"SplitEmpty"
ThrowOnEntityRef alias ThrowOnEntityRef = Flag!"ThrowOnEntityRef"

Classes

XMLParsingException class XMLParsingException: The exception type thrown when the XML parser encounters invalid XML.

Enums

EntityType enum EntityType: Represents the type of an XML entity. Used by EntityRange.Entity.

Functions

getAttrs void getAttrs(R attrRange, Args args)
void getAttrs(R attrRange, OR unmatched, Args args): A helper function for processing start tag attributes.
makeConfig Config makeConfig(Args args): Helper function for creating a custom config. It makes it easy to set one or more of the member variables to something other than the default without having to worry about explicitly setting them individually or setting them all at once via a constructor.
parseXML EntityRange!(config, R) parseXML(R xmlText): Lazily parses the given range of characters as an XML document.
skipContents R skipContents(R entityRange): Takes an EntityRange which is at a start tag and iterates it until it is at its corresponding end tag. It is an error to call skipContents when the current entity is not EntityType.elementStart.
skipToEntityType R skipToEntityType(R entityRange, EntityType[] entityTypes): Skips entities until the given EntityType is reached.
skipToParentEndTag R skipToParentEndTag(R entityRange): Skips entities until the end tag is reached that corresponds to the start tag that is the parent of the current entity.
skipToPath R skipToPath(R entityRange, string path): Treats the given string like a file path except that each directory corresponds to the name of a start tag. Note that this does not try to implement XPath as that would be quite complicated, and it really doesn't fit with a StAX parser.

Manifest constants

simpleXML enum simpleXML;: This Config is intended for making it easy to parse XML by skipping everything that isn't the actual data as well as making it simpler to deal with empty element tags by treating them the same as a start tag and end tag with nothing but whitespace between them.

Structs

Config struct Config: Used to configure how the parser works.
EntityRange struct EntityRange(Config cfg, R): Lazily parses the given range of characters as an XML document.
EntityRangeCompileTests struct EntityRangeCompileTests: Undocumented in source.
TextPos struct TextPos: Where in the XML document an entity is.

Templates

isAttrRange template isAttrRange(R): Whether the given type is a forward range of attributes.

Examples

1 auto xml = "<!-- comment -->\n" ~
2            "<root>\n" ~
3            "    <foo>some text<whatever/></foo>\n" ~
4            "    <bar/>\n" ~
5            "    <baz></baz>\n" ~
6            "</root>";
7 {
8     auto range = parseXML(xml);
9     assert(range.front.type == EntityType.comment);
10     assert(range.front.text == " comment ");
11     range.popFront();
12 
13     assert(range.front.type == EntityType.elementStart);
14     assert(range.front.name == "root");
15     range.popFront();
16 
17     assert(range.front.type == EntityType.elementStart);
18     assert(range.front.name == "foo");
19     range.popFront();
20 
21     assert(range.front.type == EntityType.text);
22     assert(range.front.text == "some text");
23     range.popFront();
24 
25     assert(range.front.type == EntityType.elementEmpty);
26     assert(range.front.name == "whatever");
27     range.popFront();
28 
29     assert(range.front.type == EntityType.elementEnd);
30     assert(range.front.name == "foo");
31     range.popFront();
32 
33     assert(range.front.type == EntityType.elementEmpty);
34     assert(range.front.name == "bar");
35     range.popFront();
36 
37     assert(range.front.type == EntityType.elementStart);
38     assert(range.front.name == "baz");
39     range.popFront();
40 
41     assert(range.front.type == EntityType.elementEnd);
42     assert(range.front.name == "baz");
43     range.popFront();
44 
45     assert(range.front.type == EntityType.elementEnd);
46     assert(range.front.name == "root");
47     range.popFront();
48 
49     assert(range.empty);
50 }
51 {
52     auto range = parseXML!simpleXML(xml);
53 
54     // simpleXML skips comments
55 
56     assert(range.front.type == EntityType.elementStart);
57     assert(range.front.name == "root");
58     range.popFront();
59 
60     assert(range.front.type == EntityType.elementStart);
61     assert(range.front.name == "foo");
62     range.popFront();
63 
64     assert(range.front.type == EntityType.text);
65     assert(range.front.text == "some text");
66     range.popFront();
67 
68     // simpleXML splits empty element tags into a start tag and end tag
69     // so that the code doesn't have to care whether a start tag with no
70     // content is an empty tag or a start tag and end tag with nothing but
71     // whitespace in between.
72     assert(range.front.type == EntityType.elementStart);
73     assert(range.front.name == "whatever");
74     range.popFront();
75 
76     assert(range.front.type == EntityType.elementEnd);
77     assert(range.front.name == "whatever");
78     range.popFront();
79 
80     assert(range.front.type == EntityType.elementEnd);
81     assert(range.front.name == "foo");
82     range.popFront();
83 
84     assert(range.front.type == EntityType.elementStart);
85     assert(range.front.name == "bar");
86     range.popFront();
87 
88     assert(range.front.type == EntityType.elementEnd);
89     assert(range.front.name == "bar");
90     range.popFront();
91 
92     assert(range.front.type == EntityType.elementStart);
93     assert(range.front.name == "baz");
94     range.popFront();
95 
96     assert(range.front.type == EntityType.elementEnd);
97     assert(range.front.name == "baz");
98     range.popFront();
99 
100     assert(range.front.type == EntityType.elementEnd);
101     assert(range.front.name == "root");
102     range.popFront();
103 
104     assert(range.empty);
105 }