decodeXML

Decodes any XML character references and standard XML entity references in the text as well as removing any carriage returns. It's intended to be used on the text fields of element tags and on the values of start tag attributes.

There are a number of characters that either can't be directly represented in the text fields or attribute values in XML or which can sometimes be directly represented but not always (e.g. an attribute value can contain either a single quote or a double quote, but it can't contain both at the same time, because one of them would match the opening quote). So, those characters have alternate representations in order to be allowed (e.g. "&lt;" for '<', because '<' would normally be the beginning of an entity). Technically, they're entity references, but the ones handled by decodeXML are the ones explicitly defined in the XML standard and which don't require a DTD section.

Ideally, the parser would transform all such alternate representations to what they represent when providing the text to the application, but that would make it impossible to return slices of the original text from the properties of an Entity. So, instead of having those properties do the transformation themselves, decodeXML and asDecodedXML do that so that the application can choose to do it or not (in many cases, there is nothing to decode, making the calls unnecessary).

Similarly, an application can choose to encode a character as a character reference (e.g. '&#65" or '&#x40" for 'A'). decodeXML will decode such character references to their corresponding characters.

However, decodeXML does not handle any entity references beyond the five predefined ones listed below. All others are left unprocessed. Processing them properly would require handling the DTD section, which dxml does not support. The parser considers any entity references other than the predefined ones to be invalid XML, so unless the text being passed to decodeXML doesn't come from dxml's parser, it can't have any entity references in it other than the predefined ones. Similarly, invalid character references are left unprocessed as well as any character that is not valid in an XML document. decodeXML never throws on invalid XML.

Also, '\r' is not supposed to appear in an XML document except as a character reference unless it's in a CDATA section. So, it really should be stripped out before being handed off to the application, but again, that doesn't work with slices. So, decodeXML also handles that.

Specifically, what decodeXML and asDecodedXML do is

convert &amp; to &
convert &gt; to >
convert &lt; to <
convert &apos; to '
convert &quot; to "
remove all instances of \r
convert all character references (e.g. &#xA;) to the characters that they represent

All other entity references are left untouched, and any '&' which is not used in one of the constructs listed in the table as well as any malformed constructs (e.g. "&Amp;" or "&#xGGA2;") are left untouched.

The difference between decodeXML and asDecodedXML is that decodeXML returns a $(K_STRING), whereas asDecodedXML returns a lazy _range of code units. In the case where a $(K_STRING) is passed to decodeXML, it will simply return the original $(K_STRING) if there is no text to decode (whereas in other cases, decodeXML and asDecodedXML are forced to return new ranges even if there is no text to decode).

  1. string decodeXML(R range)
    string
    decodeXML
    (
    R
    )
    ()
    if (
    isForwardRange!R &&
    isSomeChar!(ElementType!R)
    )
  2. auto asDecodedXML(R range)

Parameters

range R

The _range of characters to decodeXML.

Return Value

Type: string

The decoded text. decodeXML returns a $(K_STRING), whereas asDecodedXML returns a lazy _range of code units (so it could be a _range of $(K_CHAR) or $(K_WCHAR) and not just $(K_DCHAR); which it is depends on the code units of the _range being passed in).

Examples

1 assert(decodeXML("hello world &amp;&gt;&lt;&apos;&quot; \r\r\r\r\r foo") ==
2        `hello world &><'"  foo`);
3 
4 assert(decodeXML("if(foo &amp;&amp; bar)\r\n" ~
5                  "    left = right;") ==
6        "if(foo && bar)\n" ~
7        "    left = right;");
8 
9 assert(decodeXML("&#12487;&#12451;&#12521;&#12531;") == "ディラン");
10 assert(decodeXML("foo") == "foo");
11 assert(decodeXML("&#   ;") == "&#   ;");
12 
13 {
14     import std.algorithm.comparison : equal;
15     auto range = asDecodedXML("hello world &amp;&gt;&lt;&apos;&quot; " ~
16                               "\r\r\r\r\r foo");
17     assert(equal(range, `hello world &><'"  foo`));
18 }
19 
20 {
21     import dxml.parser;
22     auto xml = "<root>\n" ~
23                "    <function return='vector&lt;int&gt;' name='foo'>\r\n" ~
24                "        <doc_comment>This function does something really\r\n" ~
25                "                 fancy, and you will love it.</doc_comment>\r\n" ~
26                "        <param type='int' name='i'>\r\n" ~
27                "        <param type='const std::string&amp;' name='s'>\r\n" ~
28                "    </function>\n" ~
29                "</root>";
30     auto range = parseXML!simpleXML(xml);
31     range.popFront();
32     assert(range.front.type == EntityType.elementStart);
33     assert(range.front.name == "function");
34     {
35         auto attrs = range.front.attributes;
36         assert(attrs.front.name == "return");
37         assert(attrs.front.value == "vector&lt;int&gt;");
38         assert(decodeXML(attrs.front.value) == "vector<int>");
39         attrs.popFront();
40         assert(attrs.front.name == "name");
41         assert(attrs.front.value == "foo");
42         assert(decodeXML(attrs.front.value) == "foo");
43     }
44     range.popFront();
45 
46     assert(range.front.type == EntityType.elementStart);
47     assert(range.front.name == "doc_comment");
48     range.popFront();
49 
50     assert(range.front.text ==
51            "This function does something really\r\n" ~
52            "                 fancy, and you will love it.");
53     assert(decodeXML(range.front.text) ==
54            "This function does something really\n" ~
55            "                 fancy, and you will love it.");
56     range.popFront();
57 
58     assert(range.front.type == EntityType.elementEnd);
59     assert(range.front.name == "doc_comment");
60     range.popFront();
61 
62     assert(range.front.type == EntityType.elementStart);
63     assert(range.front.name == "param");
64     {
65         auto attrs = range.front.attributes;
66         assert(attrs.front.name == "type");
67         assert(attrs.front.value == "int");
68         assert(decodeXML(attrs.front.value) == "int");
69         attrs.popFront();
70         assert(attrs.front.name == "name");
71         assert(attrs.front.value == "i");
72         assert(decodeXML(attrs.front.value) == "i");
73     }
74     range.popFront();
75 
76     assert(range.front.type == EntityType.elementStart);
77     assert(range.front.name == "param");
78     {
79         auto attrs = range.front.attributes;
80         assert(attrs.front.name == "type");
81         assert(attrs.front.value == "const std::string&amp;");
82         assert(decodeXML(attrs.front.value) == "const std::string&");
83         attrs.popFront();
84         assert(attrs.front.name == "name");
85         assert(attrs.front.value == "s");
86         assert(decodeXML(attrs.front.value) == "s");
87     }
88 }

See Also

Meta