RE: [xsl] accessing the input XML's doctype

Subject: RE: [xsl] accessing the input XML's doctype
From: "James Sulak" <jsulak@xxxxxxxxxxxxxxxx>
Date: Thu, 17 Jul 2008 09:50:56 -0500
Thanks everyone for your response.

Darcy - Fortunately, I have the meat of the transform working (accepting
splits and joins, too).  The article looks interesting.

David - I like the idea of default attributes, but ideally I want the
transform to be truly universal.  Maybe the transform could first check
for those attribute, and if they doesn't exist, use my current
plain-text parsing method.

Michael - Writing a custom SAX filter is a bit beyond my current
abilities, would be a good learning project when I have time.

If I ever get anything more sophisticated or elegant working, I'll post
it to the list.

Thanks,

-James





-----Original Message-----
From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
Sent: Wednesday, July 16, 2008 6:08 PM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: RE: [xsl] accessing the input XML's doctype

One thing you could try doing - I've had it in mind for years - is to
write
a filter between the XML parser and the XSLT processor, using SAX
interfaces, that gets notification of the DTD events from the parser and
translates them into things the XSLT processor understands, like
elements
and attributes in some special namespace.

This seems much cleaner architecturally than reading the document as
unparsed text and trying to parse it yourself.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: James Sulak [mailto:jsulak@xxxxxxxxxxxxxxxx]
> Sent: 16 July 2008 20:40
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] accessing the input XML's doctype
>
> Hello All,
>
> I'm trying to write a transform that gives the output XML
> file the same document type as the input XML file.
> (Specifically, it's a transform to remove Arbortext Editor's
> change-tracking markup).  I'm not happy with the method I'm
> using now, namely regexing the input XML as an unparsed
> document to extract the public and system identifiers from
> the doctype declaration.
>
> I have a fairly limited knowledge of how a XSLT processor (we're using
> Saxon) interacts with the XML parser.  But my understanding
> is that the parser reads in the XML, resolves any default
> attribute values, and then passes the document tree to the
> XSLT processor.  The XSLT processor itself doesn't know or
> care about the doctype information.  Is this correct?
>
> If it is, that would seem to imply that what I'm asking is
> impossible without writing an extension function.  Is this
> the case?  Since our implementation is already dependent on
> several Saxon extension functions, that's an acceptable
> solution.  Has anyone attempted anything like this, or have
> any suggestions on how to proceed?  Could I call Xerces (or
> another parser) from an extension function and get the public
> and system identifiers?
>
> Here's the relevant part of my current method:
>
>    <xsl:param name="doctype.public"
> select="f:input-doctype(document-uri(.))[1]"/>
>    <xsl:param name="doctype.system"
> select="f:input-doctype(document-uri(.))[2]"/>
>
>    <xsl:function name="f:input-doctype">
>       <xsl:param name="document-uri"/>
>       <xsl:variable name="unparsed-document"
> select="unparsed-text($document-uri)"/>
>       <xsl:variable name="regex">
>          <xsl:text>DOCTYPE
>                                  [\s]*
>                                  ([a-zA-Z0-9]+)
>                                  [\s]*
>                                  PUBLIC
>                                  [\s]*
>                                  "(.+)"
>                                  [\s]*
>                                  "([0-9a-zA-Z/]+\.dtd)"
>          </xsl:text>
>       </xsl:variable>
>       <xsl:analyze-string select="$unparsed-document" regex="{$regex}"
> flags="msx">
>          <xsl:matching-substring>
>             <xsl:sequence select="regex-group(2), regex-group(3)"/>
>          </xsl:matching-substring>
>       </xsl:analyze-string>
>    </xsl:function>
>
>    <xsl:output method="xml" version="1.0" encoding="utf-8"/>
>
>    <xsl:template match="/">
>       <xsl:result-document doctype-public="{$doctype.public}"
> doctype-system="{$doctype.system}">
>          <xsl:apply-templates/>
>       </xsl:result-document>
>    </xsl:template>
>
>
> Thanks,
>
> -James
>
>
> -----
> James Sulak
> Electronic Publishing Developer
> Jones McClure Publishing

Current Thread