Re: [xsl] Issue processing heterogeneous DITA files

Subject: Re: [xsl] Issue processing heterogeneous DITA files
From: "Darcy Parker" <darcyparker@xxxxxxxxx>
Date: Thu, 25 Sep 2008 13:50:52 -0400
Hi Doug,

I have run into similar challenges with DITA and other xml files.

Here's what I have found:

Figuring out the doctype of the input doc is challenging because the
XML parser that feeds input to saxon doesn't pass on the doctype info.
 As well, the XML parser resolves entity declarations.  XSLT (saxon)
is doing an identity transformation... but it is an identity based on
what it sees from the XML parser. And not what you see when you look
at it as a text file.  (I am not sure if entity declarations are an
issue for you or not... but if you do use them, then you may not want
them resolved for the purposes of the identity transformation.

I and others have asked about an alternate XML parser that encodes the
doctype and entity declarations into the DOM... (There was an article
written about this awhile ago:
http://www.xml.com/pub/a/2000/08/09/xslt/xslt.html)  When I asked,
Michael Kay said he thought this solution/article was proposed before
Sax2... and he and others suggested that with Sax2, an alternate
parser could be written.  Unfortunately I am not advanced enough to
figure this out yet.

(Everyone: I would love it if someone could write an open tool that
transforms the XML text file into a markup as suggested by
http://www.xml.com/pub/a/2000/08/09/xslt/xslt.html.  Then users could
write XSLT that has full information about what was in the text file -
instead of the representation presented by the XML parser.  This tool
could be used as a step before parsing the XML.  Or a replacement for
the actual XML parser used by the XSLT processor.)

Here's what I have come up with as a workaround - use unparsed-text()
and xsl:analyze-string, to look up the doctype... (It doesn't prevent
the entity declarations from being resolved... when you later read it
in to be process by XSLT. But it does let you get the doctype by just
parsing the file as plain text.)

The following function is not perfect... but it works for me (for
now).  (Note: I defined a name space fn for this function...)

<xsl:function    name="fn:getdoctype" as="xs:string">
    <xsl:param name="file" as="xs:string"/>
    <xsl:choose>
        <xsl:when test="unparsed-text-available($file)">
            <xsl:variable name="FileContent" select="unparsed-text($file)"/>
            <xsl:variable name="Result">
                <!--**** In future need to consider !DOCTYPE being
inside a comment...-->
                <xsl:analyze-string select="$FileContent"
regex="&lt;!DOCTYPE\s+(\S*)\s+(PUBLIC\s+[&#34;']{{1}}([^&#34;']*)[&#34;']{{1}}\s+[&#34;']{{1}}([^&#34;']*)[&#34;']{{1}}|SYSTEM\s+[&#34;']{{1}}([^&#34;']*)[&#34;']{{1}})">

                <!--See http://www.xml.com/pub/a/2003/06/04/tr.html
for info on regex in xslt 2.0-->

                <!--    regex is a bit complicated...
                    - note  double {  and double } such as {{1}} are
necessary in order to tell xsl processor not to treat the curly braces
as attribute value template expression delimiters
                    - note the " has to be escaped as &#34;

                    group 1 is the doctype name
                    group 2 contains (group 3 and 4) or group 5
                    group 3 is the PUBLIC ID
                    group 4 is the SYSTEM ID if the PUBLIC ID is present
                    group 5 is the SYSTEM ID if not PUBLIC ID is present-->

                    <xsl:matching-substring>
                        <xsl:choose>
                            <xsl:when test="not(regex-group(3)='')">
                                <xsl:value-of select="regex-group(3)"/>
                            </xsl:when>
                            <xsl:otherwise>
                                <xsl:value-of select="regex-group(5)"/>
                            </xsl:otherwise>
                        </xsl:choose>
                    </xsl:matching-substring>
                </xsl:analyze-string>
            </xsl:variable>
            <xsl:value-of select="$Result"/>  <!--Need to create the
Result variable because result may be '' and xsl:function would ignore
'' otherwise.-->
        </xsl:when>
        <xsl:otherwise>
            <xsl:message terminate="yes">ERROR: fn:getdoctype($file).
$file="<xsl:value-of select="$file"/>" does not exist.</xsl:message>
        </xsl:otherwise>
    </xsl:choose>

</xsl:function>

Once you have the doctype, you will have to use xsl:result-document -
as you found.

Darcy
On Thu, Sep 25, 2008 at 1:24 PM, Burgess, Doug <doug.burgess@xxxxxxx> wrote:
>
> Hi there,
>
> I've encountered a number of use cases recently in which I need to apply
> identity transformations to directories full of heterogeneous DITA
> files, i.e. dirs full of topic, task, concept, reference, or various
> specializations of them. The use cases range from tweaking href/conref
> paths to account for new/different CMS locations, to cleaning up
> unreferenced IDs, to modifying the structure of certain elements to
> support new domain specializations, etc. I'm using Ant version 1.7.1 to
> pass the source files to Saxon 9.
>
> The problem I've run into involves figuring out the required document
> type of the result document at runtime. I had assumed I could use
> something like this:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> version="2.0">
>
>  <xsl:variable name="publicID">
>      <xsl:choose>
>          <xsl:when test="/topic">
>            <xsl:value-of select="'-//OASIS//DTD DITA Topic//EN'"/>
>          </xsl:when>
>
>          <!-- ....etc. for all DITA document types -->
>
>          <xsl:otherwise>
>            <xsl:value-of select="'-//OASIS//DTD DITA Composite//EN'"/>
>          </xsl:otherwise>
>      </xsl:choose>
>  </xsl:variable>
>
>
>  <xsl:variable name="systemID">
>      <xsl:choose>
>          <xsl:when test="/topic">
>              <xsl:value-of select="'topic.dtd'"/>
>          </xsl:when>
>
>          <!-- ....etc. for all DITA document types -->
>
>          <xsl:otherwise>
>              <xsl:value-of select="'ditabase.dtd'"/>
>          </xsl:otherwise>
>      </xsl:choose>
>  </xsl:variable>
>
>  <xsl:output method="xml" encoding="UTF-8" indent="yes"
> doctype-public="{$publicID}" doctype-system="{$systemID}"/>
>
>  <xsl:template match="/">
>    <xsl:apply-templates/>
>  </xsl:template>
>
>  <xsl:template match="@* | node()">
>    <xsl:copy>
>      <xsl:copy-of select="@*"/>
>      <xsl:apply-templates/>
>    </xsl:copy>
>  </xsl:template>
>
> <!-- More specific templates follow here -->
>
> </xsl:stylesheet>
>
>
> However, once I saw the result...
>
> <!DOCTYPE topic PUBLIC "{$publicID}" "{$systemID}">
> ...etc.
>
> ...I consulted the XSLT 2.0 Programmers reference and noted the
> restrictions on attribute-value templates in top-level declarations.
>
> The solution is a slightly kludgy use of named xsl:outputs, referenced
> by an <xsl:result-document>, as follows:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> version="2.0">
>
>    <xsl:output name="ditaTopic" method="xml" encoding="UTF-8"
> indent="yes" doctype-public="-//OASIS//DTD DITA Topic//EN"
> doctype-system="topic.dtd"/>
>
>    <!-- ....etc. for all DITA document types -->
>
>    <xsl:output name="ditaComposite" method="xml" encoding="UTF-8"
> indent="yes" doctype-public="-//OASIS//DTD DITA Composite//EN"
> doctype-system="ditabase.dtd"/>
>
>    <xsl:param name="someOutputURI"/>
>
>    <xsl:variable name="theDoctype">
>        <xsl:choose>
>            <xsl:when test="/topic">
>                <xsl:value-of select="'ditaTopic'"/>
>            </xsl:when>
>
>            <!-- ....etc. for all DITA document types -->
>
>            <xsl:otherwise>
>                <xsl:value-of select="'ditaComposite'"/>
>            </xsl:otherwise>
>        </xsl:choose>
>    </xsl:variable>
>
>    <xsl:template match="/">
>        <xsl:result-document format="{$theDoctype}"
> href="{$someOutputURI}">
>            <xsl:apply-templates/>
>        </xsl:result-document>
>    </xsl:template>
>
>        <xsl:template match="@* | node()">
>                <xsl:copy>
>                        <xsl:copy-of select="@*"/>
>                        <xsl:apply-templates/>
>                </xsl:copy>
>        </xsl:template>
>
>    <!-- More specific templates follow here -->
>
> </xsl:stylesheet>
>
>
> I find this kludgy, because it seems counterintuitive to me to use
> <xsl:result-document> when executing a one-to-one transformation. So I'm
> left wondering why the use of attribute-value templates has been
> restricted in this way, at least on public ID and system ID strings. The
> XSLT 2.0 Programmers Reference says this ensures that declaration values
> are constant for each run of the stylesheet, however that's exactly the
> result I don't want.
>
> Is there some inherent danger in allowing for attribute-value templates
> on @doctype-public and @doctype-system that I'm not seeing? Or was there
> an assumption made by the XSLT 2.0 WG that transformations will only be
> applied to instances of a uniform document type. Processing DITA in this
> fashion seems to provide a counterexample to that....
>
> Thanks for any comments,
> Doug
>
> =============================
> Doug Burgess
> Content Management Lead, UBI
> doug.burgess@xxxxxxx
> Work: 604-974-2334 Mobile: 778-840-8004
> Business Objects, an SAP Company
> Vancouver BC
> Canada

Current Thread