dxml.parser

This implements a range-based StAX _parser for XML 1.0 (which will work with XML 1.1 documents assuming that they don't use any 1.1-specific features). For the sake of simplicity, sanity, and efficiency, the DTD section is not supported beyond what is required to parse past it.

Start tags, end tags, comments, cdata sections, and processing instructions are all supported and reported to the application. Anything in the DTD is skipped (though it's parsed enough to parse past it correctly, and that can result in an XMLParsingException if that XML isn't valid enough to be correctly skipped), and the XML declaration at the top is skipped if present (XML 1.1 requires that it be there, but XML 1.0 does not).

Regardless of what the XML declaration says (if present), any range of $(K_CHAR) will be treated as being encoded in UTF-8, any range of $(K_WCHAR) will be treated as being encoded in UTF-16, and any range of $(K_DCHAR) will be treated as having been encoded in UTF-32. Strings will be treated as ranges of their code units, not code points. Note that like Phobos typically does when processing strings, the code assumes that BOMs have already been removed, so if the range of characters comes from a file that uses a BOM, the calling code needs to strip it out before calling parseXML, or parsing will fail due to invalid characters.

Since the DTD is skipped, entity references other than the five which are predefined by the XML spec cannot be fully processed (since wherever they were used in the document would be replaced by what they referred to, which could be arbitrarily complex XML). As such, by default, if any entity references which are not predefined are encountered outside of the DTD, an XMLParsingException will be thrown (see Config.throwOnEntityRef for how that can be configured). The predefined entity references and any character references encountered will be checked to verify that they're valid, but they will not be replaced (since that does not work with returning slices of the original input).

However, decodeXML or parseStdEntityRef from dxml.util can be used to convert the predefined entity references to what the refer to, and decodeXML or parseCharRef from dxml.util can be used to convert character references to what they refer to.

Primary Symbols

SymbolDescription
parseXMLThe function used to initiate the parsing of an XML document.
EntityRangeThe range returned by parseXML.
EntityRange.EntityThe element type of EntityRange.

Parser Configuration Helpers

SymbolDescription
ConfigUsed to configure how EntityRange parses the XML.
simpleXMLA user-friendly configuration for when the application just wants the element tags and the data in between them.
makeConfigA convenience function for constructing a custom Config.
SkipCommentsA $(PHOBOS_REF Flag, std, typecons) used with Config to tell the parser to skip comments.
SkipPIA $(PHOBOS_REF Flag, std, typecons) used with Config to tell the parser to skip processing instructions.
SplitEmptyA $(PHOBOS_REF Flag, std, typecons) used with Config to configure how the parser deals with empty element tags.

Helper Types Used When Parsing

SymbolDescription
EntityTypeThe type of an entity in the XML (e.g. a $(LREF_ALTTEXT start tag, EntityType.elementStart) or a $(LREF_ALTTEXT comment, EntityType.comment)).
TextPosGives the line and column number in the XML document.
XMLParsingExceptionThrown by EntityRange when it encounters invalid XML.

Helper Functions Used When Parsing

SymbolDescription
getAttrsA function similar to $(PHOBOS_REF getopt, std, getopt) which allows for the easy processing of start tag attributes.
skipContentsIterates an EntityRange from a start tag to its matching end tag.
skipToPathUsed to navigate from one start tag to another as if the start tag names formed a file path.
skipToEntityTypeSkips to the next entity of the given type in the range.
skipToParentEndTagIterates an EntityRange until it reaches the end tag that matches the start tag which is the parent of the current entity.

Helper Traits

SymbolDescription
isAttrRangeWhether the given range is a range of attributes.

Members

Aliases

SkipComments
alias SkipComments = Flag!"SkipComments"
SkipPI
alias SkipPI = Flag!"SkipPI"
SplitEmpty
alias SplitEmpty = Flag!"SplitEmpty"
ThrowOnEntityRef
alias ThrowOnEntityRef = Flag!"ThrowOnEntityRef"

Classes

XMLParsingException
class XMLParsingException

The exception type thrown when the XML parser encounters invalid XML.

Enums

EntityType
enum EntityType

Represents the type of an XML entity. Used by EntityRange.Entity.

Functions

getAttrs
void getAttrs(R attrRange, Args args)
void getAttrs(R attrRange, OR unmatched, Args args)

A helper function for processing start tag attributes.

makeConfig
Config makeConfig(Args args)

Helper function for creating a custom config. It makes it easy to set one or more of the member variables to something other than the default without having to worry about explicitly setting them individually or setting them all at once via a constructor.

parseXML
EntityRange!(config, R) parseXML(R xmlText)

Lazily parses the given range of characters as an XML document.

skipContents
R skipContents(R entityRange)

Takes an EntityRange which is at a start tag and iterates it until it is at its corresponding end tag. It is an error to call skipContents when the current entity is not EntityType.elementStart.

skipToEntityType
R skipToEntityType(R entityRange, EntityType[] entityTypes)

Skips entities until the given EntityType is reached.

skipToParentEndTag
R skipToParentEndTag(R entityRange)

Skips entities until the end tag is reached that corresponds to the start tag that is the parent of the current entity.

skipToPath
R skipToPath(R entityRange, string path)

Treats the given string like a file path except that each directory corresponds to the name of a start tag. Note that this does not try to implement XPath as that would be quite complicated, and it really doesn't fit with a StAX parser.

Manifest constants

simpleXML
enum simpleXML;

This Config is intended for making it easy to parse XML by skipping everything that isn't the actual data as well as making it simpler to deal with empty element tags by treating them the same as a start tag and end tag with nothing but whitespace between them.

Structs

Config
struct Config

Used to configure how the parser works.

EntityRange
struct EntityRange(Config cfg, R)

Lazily parses the given range of characters as an XML document.

TextPos
struct TextPos

Where in the XML document an entity is.

Templates

isAttrRange
template isAttrRange(R)

Whether the given type is a forward range of attributes.

Variables

_entityRangeTests
EntityRange!(Config.init, EntityRangeCompileTests) _entityRangeTests;
Undocumented in source.

Examples

1 auto xml = "<!-- comment -->\n" ~
2            "<root>\n" ~
3            "    <foo>some text<whatever/></foo>\n" ~
4            "    <bar/>\n" ~
5            "    <baz></baz>\n" ~
6            "</root>";
7 {
8     auto range = parseXML(xml);
9     assert(range.front.type == EntityType.comment);
10     assert(range.front.text == " comment ");
11     range.popFront();
12 
13     assert(range.front.type == EntityType.elementStart);
14     assert(range.front.name == "root");
15     range.popFront();
16 
17     assert(range.front.type == EntityType.elementStart);
18     assert(range.front.name == "foo");
19     range.popFront();
20 
21     assert(range.front.type == EntityType.text);
22     assert(range.front.text == "some text");
23     range.popFront();
24 
25     assert(range.front.type == EntityType.elementEmpty);
26     assert(range.front.name == "whatever");
27     range.popFront();
28 
29     assert(range.front.type == EntityType.elementEnd);
30     assert(range.front.name == "foo");
31     range.popFront();
32 
33     assert(range.front.type == EntityType.elementEmpty);
34     assert(range.front.name == "bar");
35     range.popFront();
36 
37     assert(range.front.type == EntityType.elementStart);
38     assert(range.front.name == "baz");
39     range.popFront();
40 
41     assert(range.front.type == EntityType.elementEnd);
42     assert(range.front.name == "baz");
43     range.popFront();
44 
45     assert(range.front.type == EntityType.elementEnd);
46     assert(range.front.name == "root");
47     range.popFront();
48 
49     assert(range.empty);
50 }
51 {
52     auto range = parseXML!simpleXML(xml);
53 
54     // simpleXML skips comments
55 
56     assert(range.front.type == EntityType.elementStart);
57     assert(range.front.name == "root");
58     range.popFront();
59 
60     assert(range.front.type == EntityType.elementStart);
61     assert(range.front.name == "foo");
62     range.popFront();
63 
64     assert(range.front.type == EntityType.text);
65     assert(range.front.text == "some text");
66     range.popFront();
67 
68     // simpleXML splits empty element tags into a start tag and end tag
69     // so that the code doesn't have to care whether a start tag with no
70     // content is an empty tag or a start tag and end tag with nothing but
71     // whitespace in between.
72     assert(range.front.type == EntityType.elementStart);
73     assert(range.front.name == "whatever");
74     range.popFront();
75 
76     assert(range.front.type == EntityType.elementEnd);
77     assert(range.front.name == "whatever");
78     range.popFront();
79 
80     assert(range.front.type == EntityType.elementEnd);
81     assert(range.front.name == "foo");
82     range.popFront();
83 
84     assert(range.front.type == EntityType.elementStart);
85     assert(range.front.name == "bar");
86     range.popFront();
87 
88     assert(range.front.type == EntityType.elementEnd);
89     assert(range.front.name == "bar");
90     range.popFront();
91 
92     assert(range.front.type == EntityType.elementStart);
93     assert(range.front.name == "baz");
94     range.popFront();
95 
96     assert(range.front.type == EntityType.elementEnd);
97     assert(range.front.name == "baz");
98     range.popFront();
99 
100     assert(range.front.type == EntityType.elementEnd);
101     assert(range.front.name == "root");
102     range.popFront();
103 
104     assert(range.empty);
105 }

See Also

Meta

Source

See Source File
$(LINK_TO_SRC dxml/_parser.d)