Re: [xsl] accessing the input XML's doctype

Subject: Re: [xsl] accessing the input XML's doctype
From: "Darcy Parker" <darcyparker@xxxxxxxxx>
Date: Thu, 17 Jul 2008 11:07:14 -0400
Can anyone point to a modified XML parser that works with saxon that
is similar to the one in the article?
http://www.xml.com/pub/a/2000/08/09/xslt/xslt.html

It seems like the modified XML parser would be a good solution and
that it would be of general interest to a wide audience.  So I am
hoping that someone has already created one, compiled it and has
chosen to share it freely on the Internet, with instructions on how to
use it with saxon.  Unfortunately I can't find one like the article
mentions.

Or is the custom SAX filter as Michael suggested a better approach?

Darcy

On Thu, Jul 17, 2008 at 10:50 AM, James Sulak <jsulak@xxxxxxxxxxxxxxxx> wrote:
>
> Thanks everyone for your response.
>
> Darcy - Fortunately, I have the meat of the transform working (accepting
> splits and joins, too).  The article looks interesting.
>
> David - I like the idea of default attributes, but ideally I want the
> transform to be truly universal.  Maybe the transform could first check
> for those attribute, and if they doesn't exist, use my current
> plain-text parsing method.
>
> Michael - Writing a custom SAX filter is a bit beyond my current
> abilities, would be a good learning project when I have time.
>
> If I ever get anything more sophisticated or elegant working, I'll post
> it to the list.
>
> Thanks,
>
> -James
>
>
>
>
>
> -----Original Message-----
> From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
> Sent: Wednesday, July 16, 2008 6:08 PM
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: RE: [xsl] accessing the input XML's doctype
>
> One thing you could try doing - I've had it in mind for years - is to
> write
> a filter between the XML parser and the XSLT processor, using SAX
> interfaces, that gets notification of the DTD events from the parser and
> translates them into things the XSLT processor understands, like
> elements
> and attributes in some special namespace.
>
> This seems much cleaner architecturally than reading the document as
> unparsed text and trying to parse it yourself.
>
> Michael Kay
> http://www.saxonica.com/
>
> > -----Original Message-----
> > From: James Sulak [mailto:jsulak@xxxxxxxxxxxxxxxx]
> > Sent: 16 July 2008 20:40
> > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> > Subject: [xsl] accessing the input XML's doctype
> >
> > Hello All,
> >
> > I'm trying to write a transform that gives the output XML
> > file the same document type as the input XML file.
> > (Specifically, it's a transform to remove Arbortext Editor's
> > change-tracking markup).  I'm not happy with the method I'm
> > using now, namely regexing the input XML as an unparsed
> > document to extract the public and system identifiers from
> > the doctype declaration.
> >
> > I have a fairly limited knowledge of how a XSLT processor (we're using
> > Saxon) interacts with the XML parser.  But my understanding
> > is that the parser reads in the XML, resolves any default
> > attribute values, and then passes the document tree to the
> > XSLT processor.  The XSLT processor itself doesn't know or
> > care about the doctype information.  Is this correct?
> >
> > If it is, that would seem to imply that what I'm asking is
> > impossible without writing an extension function.  Is this
> > the case?  Since our implementation is already dependent on
> > several Saxon extension functions, that's an acceptable
> > solution.  Has anyone attempted anything like this, or have
> > any suggestions on how to proceed?  Could I call Xerces (or
> > another parser) from an extension function and get the public
> > and system identifiers?
> >
> > Here's the relevant part of my current method:
> >
> >    <xsl:param name="doctype.public"
> > select="f:input-doctype(document-uri(.))[1]"/>
> >    <xsl:param name="doctype.system"
> > select="f:input-doctype(document-uri(.))[2]"/>
> >
> >    <xsl:function name="f:input-doctype">
> >       <xsl:param name="document-uri"/>
> >       <xsl:variable name="unparsed-document"
> > select="unparsed-text($document-uri)"/>
> >       <xsl:variable name="regex">
> >          <xsl:text>DOCTYPE
> >                                  [\s]*
> >                                  ([a-zA-Z0-9]+)
> >                                  [\s]*
> >                                  PUBLIC
> >                                  [\s]*
> >                                  "(.+)"
> >                                  [\s]*
> >                                  "([0-9a-zA-Z/]+\.dtd)"
> >          </xsl:text>
> >       </xsl:variable>
> >       <xsl:analyze-string select="$unparsed-document" regex="{$regex}"
> > flags="msx">
> >          <xsl:matching-substring>
> >             <xsl:sequence select="regex-group(2), regex-group(3)"/>
> >          </xsl:matching-substring>
> >       </xsl:analyze-string>
> >    </xsl:function>
> >
> >    <xsl:output method="xml" version="1.0" encoding="utf-8"/>
> >
> >    <xsl:template match="/">
> >       <xsl:result-document doctype-public="{$doctype.public}"
> > doctype-system="{$doctype.system}">
> >          <xsl:apply-templates/>
> >       </xsl:result-document>
> >    </xsl:template>
> >
> >
> > Thanks,
> >
> > -James
> >
> >
> > -----
> > James Sulak
> > Electronic Publishing Developer
> > Jones McClure Publishing

Current Thread