Subject: [xsl] Re: Turning escaped mixed content back to XML From: Martin Holmes <mholmes@xxxxxxx> Date: Tue, 01 Apr 2014 08:44:33 -0700 |
Cheers, Martin
On 28-3-2014 22:49, Martin Holmes wrote:On 14-03-28 02:18 PM, David Carlisle wrote:On 28/03/2014 21:06, Martin Holmes wrote:I spoke too soon. Passing this:
contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a <gi>teiCorpus</gi> element.
into parse-xml-fragment() gets this fatal error:
FODC0006: First argument to parse-xml-fragment() is not a well-formed and namespace-well-formed XML fragment. XML parser reported: I/O error reported by XML parser processing file:/home/mholmes/Documents/tei/council/translation/new_translations_into_specs.xsl:
404 Not Found for: http://www.saxonica.com/parse-xml-fragment/actual.xml
I've tried that, but it seems to make no difference. But my reading of the spec suggests that it will accept a mixed-content fragment without a root element, though I may be misunderstanding it.
Your assumption on fn:parse-xml-fragment() is correct.
I tried your text fragment with fn:parse-xml-fragment on both Saxon and Exselt and it simply works. Considering that you get a 404 not found error, suggests there is something off elsewhere in your stylesheet. A more complete input/output/stylesheet example might help tracking this one down.
If the input XML is crappy, you can use a self-grown approach towards translating the escaped XML. The following is not fool-proof, but it creates XML or almost-XML, depending on your input, which, if the resulting XML is not fully compliant, will _not_ raise an error. However, this code does not take entities or escaped quotes/apostrophes/ampersands, CDATA sections, comments etc into account. It is not that hard to add them though if your source contains them, but be aware, it may quickly end up into a "regex parser for XML", which many on this list will (correctly) frown upon.
But then again, if your input cannot be relied upon for fn:parse-xml-fragment(), and/or you need to find out how it looks like without all the escapes for fault-analysis, this may definitely help.
The DTD declarations in the beginning are not required, but I use them for readability. The chosen character range for forcing the processor to output angle brackets when it is not XML are from the Private Use Area of Unicode.
Solution 1 -------------- <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE xsl:stylesheet [ <!ENTITY less ""> <!ENTITY great ""> ]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:data="http://exselt.net/data" xmlns:text="http://example.com/text" exclude-result-prefixes="xsl data text" version="3.0">
<xsl:output indent="yes" use-character-maps="angle-brackets" />
<xsl:character-map name="angle-brackets"> <xsl:output-character character="&less;" string="<"/> <xsl:output-character character="&great;" string=">"/> </xsl:character-map>
<data:escaped> <text:p>indicates the amount by which this zone has been rotated clockwise, with respect to the normal orientation of the parent <gi>surface</gi> element as implied by the dimensions given in the <gi>msDesc</gi> element or by the coordinates of the <gi>surface</gi> itself. The orientation is expressed in arc degrees.</text:p> <text:p>a start-tag, with delimiters < and > is intended</text:p> <text:p>contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a <gi>teiCorpus</gi> element.</text:p> </data:escaped>
<xsl:variable name="data" select="doc('')/*/data:escaped" />
<xsl:template match="/"> <xsl:apply-templates select="$data/text:p" /> </xsl:template>
<xsl:template match="text:p"> <xsl:copy copy-namespaces="no"> <xsl:apply-templates /> </xsl:copy> </xsl:template>
<xsl:template match="text()"> <!-- find an opening '<' not followed by a space, until the first closing '>' --> <xsl:analyze-string select="." regex="<([^ >][^>]+)>"> <xsl:matching-substring> <xsl:value-of select="'&less;' || regex-group(1) || '&great;'" /> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="." /> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
</xsl:stylesheet>
Solution 2 -------------- The following uses fn:parse-xml-fragment and the new xsl:try/xsl:catch to fix the fragment if an error occurs. Again, this is not foolproof, but as a fallback, it simply dumps the string as it is when it cannot be processed.
Note that I deliberately changed also the 3rd text to be invalid,b ut with only one error so that it can be fixed by the "fixup" part, and note that the recursive nature of this solution is currently very limited, but once better errors are available in try/catch (with line-number and column-number), you might use this as a starting point for an XML cleanup function ;).
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:data="http://example.com/data" xmlns:text="http://exselt.net/text" xmlns:err="http://www.w3.org/2005/xqt-errors" exclude-result-prefixes="xs xsl data text err" version="3.0">
<xsl:output indent="yes"/>
<data:escaped> <text:p> indicates the amount by which this zone has been rotated clockwise, with respect to the normal orientation of the parent <gi>surface</gi> element as implied by the dimensions given in the <gi>msDesc</gi> element or by the coordinates of the <gi>surface</gi> itself. The orientation is expressed in arc degrees. </text:p> <text:p> a start-tag, with delimiters < and > is intended </text:p> <text:p> contains a single <TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a <giteiCorpus</gi> element. </text:p> </data:escaped>
<xsl:variable name="data" select="doc('')/*/data:escaped" />
<xsl:template match="/"> <xsl:apply-templates select="$data/text:p" /> </xsl:template>
<xsl:template match="text:p"> <xsl:copy copy-namespaces="no"> <xsl:apply-templates mode="parse" /> </xsl:copy> </xsl:template>
<xsl:template match="." mode="parse"> <xsl:param name="recur" as="xs:boolean" select="true()" /> <xsl:try> <xsl:copy-of select="parse-xml-fragment(.)" />
<!-- when parsing fails, this is the error --> <xsl:catch errors="err:FODC0006">
<!-- recursively apply templates until we are fixed currently max one level deep, should use $err:line/col-number once that is available --> <xsl:variable name="pos" select="string-length(substring-before(., '<'))" />
<!-- poor man's error fixing --> <xsl:variable name="fixed" select=" substring(., 1, $pos) || substring(., $pos + 1, 1)!replace(., '<', '&lt;') || substring(., $pos + 2)" />
<!-- using Dimitre's style ifs for recursion ;) --> <xsl:apply-templates select="$fixed[$recur]" mode="#current" > <xsl:with-param name="last" select="false()" /> </xsl:apply-templates> <xsl:copy-of select="$fixed[not($recur)]" /> </xsl:catch> </xsl:try> </xsl:template> </xsl:stylesheet>
Both stylesheets should work cross-processor. I tried them with Exselt and Saxon.
Not sure all of this is of any use for your current use-case, but it was a nice excercise to play around with, and it made me find some issues in either processor related to error handling and applying predicates to strings (both which I will report appropriately).
Cheers,
Abel Braaksma Exselt XSLT 3.0 processor http://exselt.net
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] When to use conditional c, Ihe Onwuka | Thread | [xsl] Re: Demonstration web browser, G. Ken Holman |
Re: [xsl] When to use conditional c, David Rudel | Date | [xsl] Re: Demonstration web browser, G. Ken Holman |
Month |