Subject: Re: [xsl] Issue processing heterogeneous DITA files From: "Darcy Parker" <darcyparker@xxxxxxxxx> Date: Thu, 25 Sep 2008 13:50:52 -0400 |
Hi Doug, I have run into similar challenges with DITA and other xml files. Here's what I have found: Figuring out the doctype of the input doc is challenging because the XML parser that feeds input to saxon doesn't pass on the doctype info. As well, the XML parser resolves entity declarations. XSLT (saxon) is doing an identity transformation... but it is an identity based on what it sees from the XML parser. And not what you see when you look at it as a text file. (I am not sure if entity declarations are an issue for you or not... but if you do use them, then you may not want them resolved for the purposes of the identity transformation. I and others have asked about an alternate XML parser that encodes the doctype and entity declarations into the DOM... (There was an article written about this awhile ago: http://www.xml.com/pub/a/2000/08/09/xslt/xslt.html) When I asked, Michael Kay said he thought this solution/article was proposed before Sax2... and he and others suggested that with Sax2, an alternate parser could be written. Unfortunately I am not advanced enough to figure this out yet. (Everyone: I would love it if someone could write an open tool that transforms the XML text file into a markup as suggested by http://www.xml.com/pub/a/2000/08/09/xslt/xslt.html. Then users could write XSLT that has full information about what was in the text file - instead of the representation presented by the XML parser. This tool could be used as a step before parsing the XML. Or a replacement for the actual XML parser used by the XSLT processor.) Here's what I have come up with as a workaround - use unparsed-text() and xsl:analyze-string, to look up the doctype... (It doesn't prevent the entity declarations from being resolved... when you later read it in to be process by XSLT. But it does let you get the doctype by just parsing the file as plain text.) The following function is not perfect... but it works for me (for now). (Note: I defined a name space fn for this function...) <xsl:function name="fn:getdoctype" as="xs:string"> <xsl:param name="file" as="xs:string"/> <xsl:choose> <xsl:when test="unparsed-text-available($file)"> <xsl:variable name="FileContent" select="unparsed-text($file)"/> <xsl:variable name="Result"> <!--**** In future need to consider !DOCTYPE being inside a comment...--> <xsl:analyze-string select="$FileContent" regex="<!DOCTYPE\s+(\S*)\s+(PUBLIC\s+["']{{1}}([^"']*)["']{{1}}\s+["']{{1}}([^"']*)["']{{1}}|SYSTEM\s+["']{{1}}([^"']*)["']{{1}})"> <!--See http://www.xml.com/pub/a/2003/06/04/tr.html for info on regex in xslt 2.0--> <!-- regex is a bit complicated... - note double { and double } such as {{1}} are necessary in order to tell xsl processor not to treat the curly braces as attribute value template expression delimiters - note the " has to be escaped as " group 1 is the doctype name group 2 contains (group 3 and 4) or group 5 group 3 is the PUBLIC ID group 4 is the SYSTEM ID if the PUBLIC ID is present group 5 is the SYSTEM ID if not PUBLIC ID is present--> <xsl:matching-substring> <xsl:choose> <xsl:when test="not(regex-group(3)='')"> <xsl:value-of select="regex-group(3)"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="regex-group(5)"/> </xsl:otherwise> </xsl:choose> </xsl:matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:value-of select="$Result"/> <!--Need to create the Result variable because result may be '' and xsl:function would ignore '' otherwise.--> </xsl:when> <xsl:otherwise> <xsl:message terminate="yes">ERROR: fn:getdoctype($file). $file="<xsl:value-of select="$file"/>" does not exist.</xsl:message> </xsl:otherwise> </xsl:choose> </xsl:function> Once you have the doctype, you will have to use xsl:result-document - as you found. Darcy On Thu, Sep 25, 2008 at 1:24 PM, Burgess, Doug <doug.burgess@xxxxxxx> wrote: > > Hi there, > > I've encountered a number of use cases recently in which I need to apply > identity transformations to directories full of heterogeneous DITA > files, i.e. dirs full of topic, task, concept, reference, or various > specializations of them. The use cases range from tweaking href/conref > paths to account for new/different CMS locations, to cleaning up > unreferenced IDs, to modifying the structure of certain elements to > support new domain specializations, etc. I'm using Ant version 1.7.1 to > pass the source files to Saxon 9. > > The problem I've run into involves figuring out the required document > type of the result document at runtime. I had assumed I could use > something like this: > > <?xml version="1.0" encoding="UTF-8"?> > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > version="2.0"> > > <xsl:variable name="publicID"> > <xsl:choose> > <xsl:when test="/topic"> > <xsl:value-of select="'-//OASIS//DTD DITA Topic//EN'"/> > </xsl:when> > > <!-- ....etc. for all DITA document types --> > > <xsl:otherwise> > <xsl:value-of select="'-//OASIS//DTD DITA Composite//EN'"/> > </xsl:otherwise> > </xsl:choose> > </xsl:variable> > > > <xsl:variable name="systemID"> > <xsl:choose> > <xsl:when test="/topic"> > <xsl:value-of select="'topic.dtd'"/> > </xsl:when> > > <!-- ....etc. for all DITA document types --> > > <xsl:otherwise> > <xsl:value-of select="'ditabase.dtd'"/> > </xsl:otherwise> > </xsl:choose> > </xsl:variable> > > <xsl:output method="xml" encoding="UTF-8" indent="yes" > doctype-public="{$publicID}" doctype-system="{$systemID}"/> > > <xsl:template match="/"> > <xsl:apply-templates/> > </xsl:template> > > <xsl:template match="@* | node()"> > <xsl:copy> > <xsl:copy-of select="@*"/> > <xsl:apply-templates/> > </xsl:copy> > </xsl:template> > > <!-- More specific templates follow here --> > > </xsl:stylesheet> > > > However, once I saw the result... > > <!DOCTYPE topic PUBLIC "{$publicID}" "{$systemID}"> > ...etc. > > ...I consulted the XSLT 2.0 Programmers reference and noted the > restrictions on attribute-value templates in top-level declarations. > > The solution is a slightly kludgy use of named xsl:outputs, referenced > by an <xsl:result-document>, as follows: > > <?xml version="1.0" encoding="UTF-8"?> > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > version="2.0"> > > <xsl:output name="ditaTopic" method="xml" encoding="UTF-8" > indent="yes" doctype-public="-//OASIS//DTD DITA Topic//EN" > doctype-system="topic.dtd"/> > > <!-- ....etc. for all DITA document types --> > > <xsl:output name="ditaComposite" method="xml" encoding="UTF-8" > indent="yes" doctype-public="-//OASIS//DTD DITA Composite//EN" > doctype-system="ditabase.dtd"/> > > <xsl:param name="someOutputURI"/> > > <xsl:variable name="theDoctype"> > <xsl:choose> > <xsl:when test="/topic"> > <xsl:value-of select="'ditaTopic'"/> > </xsl:when> > > <!-- ....etc. for all DITA document types --> > > <xsl:otherwise> > <xsl:value-of select="'ditaComposite'"/> > </xsl:otherwise> > </xsl:choose> > </xsl:variable> > > <xsl:template match="/"> > <xsl:result-document format="{$theDoctype}" > href="{$someOutputURI}"> > <xsl:apply-templates/> > </xsl:result-document> > </xsl:template> > > <xsl:template match="@* | node()"> > <xsl:copy> > <xsl:copy-of select="@*"/> > <xsl:apply-templates/> > </xsl:copy> > </xsl:template> > > <!-- More specific templates follow here --> > > </xsl:stylesheet> > > > I find this kludgy, because it seems counterintuitive to me to use > <xsl:result-document> when executing a one-to-one transformation. So I'm > left wondering why the use of attribute-value templates has been > restricted in this way, at least on public ID and system ID strings. The > XSLT 2.0 Programmers Reference says this ensures that declaration values > are constant for each run of the stylesheet, however that's exactly the > result I don't want. > > Is there some inherent danger in allowing for attribute-value templates > on @doctype-public and @doctype-system that I'm not seeing? Or was there > an assumption made by the XSLT 2.0 WG that transformations will only be > applied to instances of a uniform document type. Processing DITA in this > fashion seems to provide a counterexample to that.... > > Thanks for any comments, > Doug > > ============================= > Doug Burgess > Content Management Lead, UBI > doug.burgess@xxxxxxx > Work: 604-974-2334 Mobile: 778-840-8004 > Business Objects, an SAP Company > Vancouver BC > Canada
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Issue processing heterogeneou, Burgess, Doug | Thread | RE: [xsl] Issue processing heteroge, Michael Kay |
[xsl] Issue processing heterogeneou, Burgess, Doug | Date | RE: [xsl] Issue processing heteroge, Michael Kay |
Month |