Re: [xsl] accessing the input XML's doctype

Subject: Re: [xsl] accessing the input XML's doctype
From: "Darcy Parker" <darcyparker@xxxxxxxxx>
Date: Wed, 16 Jul 2008 16:59:21 -0400
Hi James,

I have run into similar problems with arbortext content.  I think
there are a couple of problems.

1) is that the declarations, including entity declarations are being
resolved by the parser, so XSL only sees the resolved values and
therefore you don't know what you should specify the output's doctype
should be and you can't undo the resolved entity declarations.

2) You need to be able to strip out the <atict:del></atict:del>
elements and their descendents... and remove the
<atict:add></atict:add> but copy it's descendents.  This is fairly
straight forward and I suspect you have this working.

But the first problem as you identified is painful... I have partially
solved this problem with an xsl:function similar to yours using an
xsl:analyze-string on unparsed-text().... This gets the doctype so
that you can set the doctype in xsl:result-document, but it doesn't
solve the problem of the entity declarations being resolved- which you
don't want if you're trying to just strip the track changes.

I don't have a solution yet because I was able to work around the
problem.  But I came across this article:
http://www.xml.com/pub/a/2000/08/09/xslt/xslt.html
It recommends a modified XML parser that wraps the doctype and entity
declarations in a new markup so that your XSLT processor has this
information to work with.  It seems to me that this will be your best
solution.   Unfortunately I haven't figured out how to get a modified
parser like this or how to use it with saxon.

If you figure it out, or if others can comment, please let me know.

Thanks
Darcy
On Wed, Jul 16, 2008 at 3:40 PM, James Sulak <jsulak@xxxxxxxxxxxxxxxx> wrote:
> Hello All,
>
> I'm trying to write a transform that gives the output XML file the same
> document type as the input XML file.  (Specifically, it's a transform to
> remove Arbortext Editor's change-tracking markup).  I'm not happy with
> the method I'm using now, namely regexing the input XML as an unparsed
> document to extract the public and system identifiers from the doctype
> declaration.
>
> I have a fairly limited knowledge of how a XSLT processor (we're using
> Saxon) interacts with the XML parser.  But my understanding is that the
> parser reads in the XML, resolves any default attribute values, and then
> passes the document tree to the XSLT processor.  The XSLT processor
> itself doesn't know or care about the doctype information.  Is this
> correct?
>
> If it is, that would seem to imply that what I'm asking is impossible
> without writing an extension function.  Is this the case?  Since our
> implementation is already dependent on several Saxon extension
> functions, that's an acceptable solution.  Has anyone attempted anything
> like this, or have any suggestions on how to proceed?  Could I call
> Xerces (or another parser) from an extension function and get the public
> and system identifiers?
>
> Here's the relevant part of my current method:
>
>   <xsl:param name="doctype.public"
> select="f:input-doctype(document-uri(.))[1]"/>
>   <xsl:param name="doctype.system"
> select="f:input-doctype(document-uri(.))[2]"/>
>
>   <xsl:function name="f:input-doctype">
>      <xsl:param name="document-uri"/>
>      <xsl:variable name="unparsed-document"
> select="unparsed-text($document-uri)"/>
>      <xsl:variable name="regex">
>         <xsl:text>DOCTYPE
>                                 [\s]*
>                                 ([a-zA-Z0-9]+)
>                                 [\s]*
>                                 PUBLIC
>                                 [\s]*
>                                 "(.+)"
>                                 [\s]*
>                                 "([0-9a-zA-Z/]+\.dtd)"
>         </xsl:text>
>      </xsl:variable>
>      <xsl:analyze-string select="$unparsed-document" regex="{$regex}"
> flags="msx">
>         <xsl:matching-substring>
>            <xsl:sequence select="regex-group(2), regex-group(3)"/>
>         </xsl:matching-substring>
>      </xsl:analyze-string>
>   </xsl:function>
>
>   <xsl:output method="xml" version="1.0" encoding="utf-8"/>
>
>   <xsl:template match="/">
>      <xsl:result-document doctype-public="{$doctype.public}"
> doctype-system="{$doctype.system}">
>         <xsl:apply-templates/>
>      </xsl:result-document>
>   </xsl:template>
>
>
> Thanks,
>
> -James
>
>
> -----
> James Sulak
> Electronic Publishing Developer
> Jones McClure Publishing

Current Thread