Re: [xsl] Changing a from unstructured HTML to XML

Subject: Re: [xsl] Changing a from unstructured HTML to XML
From: Martin Honnen <Martin.Honnen@xxxxxx>
Date: Tue, 21 Sep 2010 15:29:44 +0200
Evan Leibovitch wrote:

I am working with an HTML input file, and I'd like to group things
better by sections (ultimately, with the intent of using
xml:result-document to create a new file for each section).

What I have is not uncommon:

<h1 class="section">Section Name</h1>
<h1 class="headline">Headline name</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 2</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 3</h1>
[... assorted HTML marked up text ...]
<h1 class="section">Section 2</h1>
<h1 class="headline">Headline 4</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 5</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 6</h1>
[... assorted HTML marked up text ...]

and so on.

What I'd like to end up with is, if possible

<section id="Section Name">
  <headline id="Headline ">
     [...marked up text...]
  </headline id="Headline 2">
  <headline>
     [...marked up text...]
   </headline>
  <headline id="Headline 3">
     [...marked up text...]
   </headline>
</section>

XSLT 2.0 and group-starting-with could do that e.g.


<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  version="2.0">

<xsl:output method="xml" indent="yes" version="1.0"/>

  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@*, node()"/>
    </xsl:copy>
  </xsl:template>

<xsl:template match="body">
<xsl:copy>
<xsl:for-each-group select="node()" group-starting-with="h1[@class = 'section']">
<xsl:if test="self::h1[@class = 'section']">
<section id="{.}">
<xsl:for-each-group select="current-group() except ." group-starting-with="h1[@class = 'headline']">
<xsl:if test="self::h1[@class = 'headline']">
<headline id="{.}">
<xsl:apply-templates select="current-group() except ."/>
</headline>
</xsl:if>
</xsl:for-each-group>
</section>
</xsl:if>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>


</xsl:stylesheet>

will turn

<body>
<h1 class="section">Section Name</h1>
<h1 class="headline">Headline name</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 2</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 3</h1>
[... assorted HTML marked up text ...]
<h1 class="section">Section 2</h1>
<h1 class="headline">Headline 4</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 5</h1>
[... assorted HTML marked up text ...]
<h1 class="headline">Headline 6</h1>
[... assorted HTML marked up text ...]
</body>

into

<body>
   <section id="Section Name">
      <headline id="Headline name">
[... assorted HTML marked up text ...]
</headline>
      <headline id="Headline 2">
[... assorted HTML marked up text ...]
</headline>
      <headline id="Headline 3">
[... assorted HTML marked up text ...]
</headline>
   </section>
   <section id="Section 2">
      <headline id="Headline 4">
[... assorted HTML marked up text ...]
</headline>
      <headline id="Headline 5">
[... assorted HTML marked up text ...]
</headline>
      <headline id="Headline 6">
[... assorted HTML marked up text ...]
</headline>
   </section>
</body>


--


	Martin Honnen
	http://msmvps.com/blogs/martin_honnen/

Current Thread