parseXML

Lazily parses the given range of characters as an XML document.

EntityRange is essentially a StAX parser, though it evolved into that rather than being based on what Java did, and it's range-based rather than iterator-based, so its API is likely to differ from other implementations. The basic concept should be the same though.

One of the core design goals of this parser is to slice the original input rather than having to allocate strings for the output or wrap it in a lazy range that produces a mutated version of the data. So, all of the text that the parser provides is either a slice or $(PHOBOS_REF takeExactly, std, range) of the input. However, in some cases, for the parser to be fully compliant with the XML spec, dxml.util.decodeXML must be called on the text to mutate certain constructs (e.g. removing any '\r' in the text or converting "&lt;" to '<'). But that's left up to the application.

The parser is not $(K_NOGC), but it allocates memory very minimally. It allocates some of its state on the heap so it can validate attributes and end tags. However, that state is shared among all the ranges that came from the same call to parseXML (only the range farthest along in parsing validates attributes or end tags), so save does not allocate memory unless save on the underlying range allocates memory. The shared state currently uses a couple of dynamic arrays to validate the tags and attributes, and if the document has a particularly deep tag depth or has a lot of attributes on a start tag, then some reallocations may occur until the maximum is reached, but enough is reserved that for most documents, no reallocations will occur. The only other times that the parser would allocate would be if an exception were thrown or if the range that was passed to parseXML allocates for any reason when calling any of the range primitives.

If invalid XML is encountered at any point during the parsing process, an XMLParsingException will be thrown. If an exception has been thrown, then the parser is in an invalid state, and it is an error to call any functions on it.

However, note that XML validation is reduced for any entities that are skipped (e.g. for anything in the DTD, validation is reduced to what is required to correctly parse past it, and when Config.skipPI == SkipPI.yes, processing instructions are only validated enough to correctly skip past them).

As the module documentation says, this parser does not provide any DTD support. It is not possible to properly support the DTD while returning slices of the original input, and the DTD portion of the spec makes parsing XML far, far more complicated.

A quick note about carriage returns: per the XML spec, they are all supposed to either be stripped out or replaced with newlines or spaces before the XML parser even processes the text. That doesn't work when the parser is slicing the original text and not mutating it at all. So, for the purposes of parsing, this parser treats all carriage returns as if they were newlines or spaces (though they won't count as newlines when counting the lines for TextPos). However, they will appear in any text fields or attribute values if they are in the document (since the text fields and attribute values are slices of the original text). dxml.util.decodeXML can be used to strip them along with converting any character references in the text. Alternatively, the application can remove them all before calling parseXML, but it's not necessary.

  1. struct EntityRange(Config cfg, R)
  2. EntityRange!(config, R) parseXML(R xmlText)
    EntityRange!(config, R)
    parseXML
    (
    R
    )
    if (
    isForwardRange!R &&
    isSomeChar!(ElementType!R)
    )

Examples

1 import std.range.primitives : walkLength;
2 
3 auto xml = "<?xml version='1.0'?>\n" ~
4            "<?instruction start?>\n" ~
5            "<foo attr='42'>\n" ~
6            "    <bar/>\n" ~
7            "    <!-- no comment -->\n" ~
8            "    <baz hello='world'>\n" ~
9            "    nothing to say.\n" ~
10            "    nothing at all...\n" ~
11            "    </baz>\n" ~
12            "</foo>\n" ~
13            "<?some foo?>";
14 
15 {
16     auto range = parseXML(xml);
17     assert(range.front.type == EntityType.pi);
18     assert(range.front.name == "instruction");
19     assert(range.front.text == "start");
20 
21     range.popFront();
22     assert(range.front.type == EntityType.elementStart);
23     assert(range.front.name == "foo");
24 
25     {
26         auto attrs = range.front.attributes;
27         assert(walkLength(attrs.save) == 1);
28         assert(attrs.front.name == "attr");
29         assert(attrs.front.value == "42");
30     }
31 
32     range.popFront();
33     assert(range.front.type == EntityType.elementEmpty);
34     assert(range.front.name == "bar");
35 
36     range.popFront();
37     assert(range.front.type == EntityType.comment);
38     assert(range.front.text == " no comment ");
39 
40     range.popFront();
41     assert(range.front.type == EntityType.elementStart);
42     assert(range.front.name == "baz");
43 
44     {
45         auto attrs = range.front.attributes;
46         assert(walkLength(attrs.save) == 1);
47         assert(attrs.front.name == "hello");
48         assert(attrs.front.value == "world");
49     }
50 
51     range.popFront();
52     assert(range.front.type == EntityType.text);
53     assert(range.front.text ==
54            "\n    nothing to say.\n    nothing at all...\n    ");
55 
56     range.popFront();
57     assert(range.front.type == EntityType.elementEnd); // </baz>
58     range.popFront();
59     assert(range.front.type == EntityType.elementEnd); // </foo>
60 
61     range.popFront();
62     assert(range.front.type == EntityType.pi);
63     assert(range.front.name == "some");
64     assert(range.front.text == "foo");
65 
66     range.popFront();
67     assert(range.empty);
68 }
69 {
70     auto range = parseXML!simpleXML(xml);
71 
72     // simpleXML is set to skip processing instructions.
73 
74     assert(range.front.type == EntityType.elementStart);
75     assert(range.front.name == "foo");
76 
77     {
78         auto attrs = range.front.attributes;
79         assert(walkLength(attrs.save) == 1);
80         assert(attrs.front.name == "attr");
81         assert(attrs.front.value == "42");
82     }
83 
84     // simpleXML is set to split empty tags so that <bar/> is treated
85     // as the same as <bar></bar> so that code does not have to
86     // explicitly handle empty tags.
87     range.popFront();
88     assert(range.front.type == EntityType.elementStart);
89     assert(range.front.name == "bar");
90     range.popFront();
91     assert(range.front.type == EntityType.elementEnd);
92     assert(range.front.name == "bar");
93 
94     // simpleXML is set to skip comments.
95 
96     range.popFront();
97     assert(range.front.type == EntityType.elementStart);
98     assert(range.front.name == "baz");
99 
100     {
101         auto attrs = range.front.attributes;
102         assert(walkLength(attrs.save) == 1);
103         assert(attrs.front.name == "hello");
104         assert(attrs.front.value == "world");
105     }
106 
107     range.popFront();
108     assert(range.front.type == EntityType.text);
109     assert(range.front.text ==
110            "\n    nothing to say.\n    nothing at all...\n    ");
111 
112     range.popFront();
113     assert(range.front.type == EntityType.elementEnd); // </baz>
114     range.popFront();
115     assert(range.front.type == EntityType.elementEnd); // </foo>
116     range.popFront();
117     assert(range.empty);
118 }

Meta