1 import std.exception : assertThrown; 2 import dxml.util : decodeXML; 3 4 auto xml = "<root>\n" ~ 5 " <std>&'><"</std>\n" ~ 6 " <other>&foobar;</other>\n" ~ 7 " <invalid>&--;</invalid>\n" ~ 8 "</root>"; 9 10 // ThrowOnEntityRef.yes 11 { 12 auto range = parseXML(xml); 13 assert(range.front.type == EntityType.elementStart); 14 assert(range.front.name == "root"); 15 16 range.popFront(); 17 assert(range.front.type == EntityType.elementStart); 18 assert(range.front.name == "std"); 19 20 range.popFront(); 21 assert(range.front.type == EntityType.text); 22 assert(range.front.text == "&'><""); 23 assert(range.front.text.decodeXML() == `&'><"`); 24 25 range.popFront(); 26 assert(range.front.type == EntityType.elementEnd); 27 assert(range.front.name == "std"); 28 29 range.popFront(); 30 assert(range.front.type == EntityType.elementStart); 31 assert(range.front.name == "other"); 32 33 // Attempted to parse past "&foobar;", which is syntactically 34 // valid, but it's not one of the five predefined entity references. 35 assertThrown!XMLParsingException(range.popFront()); 36 } 37 38 // ThrowOnEntityRef.no 39 { 40 auto range = parseXML!(makeConfig(ThrowOnEntityRef.no))(xml); 41 assert(range.front.type == EntityType.elementStart); 42 assert(range.front.name == "root"); 43 44 range.popFront(); 45 assert(range.front.type == EntityType.elementStart); 46 assert(range.front.name == "std"); 47 48 range.popFront(); 49 assert(range.front.type == EntityType.text); 50 assert(range.front.text == "&'><""); 51 assert(range.front.text.decodeXML() == `&'><"`); 52 53 range.popFront(); 54 assert(range.front.type == EntityType.elementEnd); 55 assert(range.front.name == "std"); 56 57 range.popFront(); 58 assert(range.front.type == EntityType.elementStart); 59 assert(range.front.name == "other"); 60 61 // Doesn't throw, because "&foobar;" is syntactically valid. 62 range.popFront(); 63 assert(range.front.type == EntityType.text); 64 assert(range.front.text == "&foobar;"); 65 66 // decodeXML has no effect on non-standard entity references. 67 assert(range.front.text.decodeXML() == "&foobar;"); 68 69 range.popFront(); 70 assert(range.front.type == EntityType.elementEnd); 71 assert(range.front.name == "other"); 72 73 range.popFront(); 74 assert(range.front.type == EntityType.elementStart); 75 assert(range.front.name == "invalid"); 76 77 // Attempted to parse past "&--;", which is not syntactically valid, 78 // because -- is not a valid name for an entity reference. 79 assertThrown!XMLParsingException(range.popFront()); 80 }
Whether the parser should throw when it encounters any entity references other than the five entity references defined in the XML standard.
Any other entity references would have to be defined in the DTD in order to be valid. And in order to know what XML they represent (which could be arbitrarily complex, even effectively inserting entire XML documents into the middle of the XML), the DTD would have to be parsed. However, dxml does not support parsing the DTD beyond what is required to correctly parse past it, and replacing entity references with what they represent would not work with the slicing semantics that EntityRange provides. As such, it is not possible for dxml to correctly handle any entity references other than the five which are defined in the XML standard, and even those are only parsed by using dxml.util.decodeXML or dxml.util.parseStdEntityRef. EntityRange always validates that entity references are one of the five, predefined entity references, but otherwise, it lets them pass through as normal text. It does not replace them with what they represent.
As such, the default behavior of EntityRange is to throw an XMLParsingException when it encounters an entity reference which is not one of the five defined by the XML standard. With that behavior, there is no risk of processing an XML document as if it had no entity references and ending up with what the program using the parser would probably consider incorrect results. However, there are cases where a program may find it acceptable to treat entity references as normal text and ignore them. As such, if a program wishes to take that approach, it can set throwOnEntityRef to ThrowOnEntityRef.no.
If throwOnEntityRef == ThrowOnEntityRef.no, then any entity reference that it encounters will be validated to ensure that it is syntactically valid (i.e. that the characters it contains form what could be a valid entity reference assuming that the DTD declared it properly), but otherwise, EntityRange will treat it as normal text, just like it treats the five, predefined entity references as normal text.
Note that any valid XML entity reference which contains start or end tags must contain matching start or end tags, and entity references cannot contain incomplete fragments of XML (e.g. the start or end of a comment). So, missing entity references should only affect the data in the XML document and not its overall structure (if that were not _true, attempting to ignore entity references such as ThrowOnEntityRef.no does would be a disaster in the making). However, how reasonable it is to miss that data depends entirely on the application and what the XML documents it's parsing contain - hence, the behavior is configurable.