Having a validating parser in place can reduce the required code to parse XML a lot – you know very well what you actually get. As mentioned in my last post about RELAX NG & trang, I prefer RELAX NG over W3C XML Schema – which doesn’t matter anyway because Apple’s suggested XML parser doesn’t validate at all.
So we have to go one level deeper and have a look at libxml2.
Apple’s example „XmlPerformance“ helped to get started, but didn’t do the trick because libxml2 allows validation for xmlDocPtr
or xmlTextReader
but not for SAX parsers as used in the example.
The libxml2 examples didn’t help me too much either, but luckily there’s xmllint available in source (OSS just rocks) which does almost what we want. It first parses the XML into a xmlDocPtr
and validates afterwards – and it does so for a reason:
You can have a validating xmlTextReader
(via xmlTextReaderRelaxNGSetSchema
), but it won’t detect IDREFs missing their referred to ID and the error messages lack the name of the failing item. BTW – when validating against a W3C schema this ID/IDREF check isn’t available yet.
I finally discarded streaming XML parsing in favour of validation and „push“ parsing (nice for data coming in over the wire) and did:
- load the RELAX NG regular form schema (watch out for the assignment of
relaxngschemas
) – similar to xmllint schema loading, - push the raw XML data into a
xmlDocPtr
(xmlCreatePushParserCtxt
) exactly like xmllint, - validate the in-memory document (
xmlRelaxNGValidateDoc
), - turn it into a
xmlTextReader
, - process the reader.
Wrap up:
- if you want full RELAX NG validation with libxml2 v2.7.3, forget about streamed parsing,
- wrap the document into a
xmlTextReader
if you want a SAXish programming model.
I may prepare and publish a MroLibxml2Parser
inheriting NSXMLParser
and firing it’s callbacks in order to easily switch validating and non-validating parser implementations, but this has to wait a bit. Stay tuned.