Re: [xsl] From WordprocessingML inline styles to nested inline elements

Subject: Re: [xsl] From WordprocessingML inline styles to nested inline elements
From: Yves Forkl <Y.Forkl@xxxxxx>
Date: Fri, 22 Jun 2007 16:57:13 +0200
Hello all,

several months ago I asked for help with the following task:

Reading up WordprocessingML (from Word 2003), I obtain text runs with inline styles attached as leaf nodes like this:

<w:r>
  <w:rPr>
    <w:i/>
    <w:b/>
  </w:rPr>
  <w:t>This is text in bold and italic.</w:t>
</w:r>

In my output, however, the inline styles should nest, and moreover, nest in a particular order:

<run><b><i>This is text in bold and italic.</i></b></run>

With the valuable help from the list, especially from David and Wendell, I managed to craft an XSLT 2.0 stylesheet module that does the job quite well; I have attached a demo version of it below (I can post the full, richly commented version if someone is interested), together with a sample input file.


Now I need to enhance it a little bit, in order to cater for cases where some inline style may have different "indicators", which currently interfere with each other. An example would be superscript style which manifests itself in the presence of (at least) one of these 3 child element sequences within w:r:

A) <w:vertAlign w:val="superscript"/>

B) <w:position w:val="6"/>

C) <w:position w:val="6"/><w:vertAlign w:val="superscript"/>

While A) works well, B) and C) receive multiple <sup></sup> containers, see the result of applying sample XSL and XML. I know I need to give up considering each child of w:r separately for looking for sequences (or rather sth like unordered node sets?) inside.

So I am facing two questions:

1) Which is the best way to replace the one-by-one comparison in

<xsl:when test="some $style_repr in w:rPr/*
            satisfies
              deep-equal($style_repr, $current_style_wordml_repr)">

with an algorithm that is capable of comparing node sets?

2) I suppose I will have to be able to delete the child elements from w:r which were already matched, to prevent cases A and B above from matching when case C already matched. (Currently, I am doing without because I considered the matching patterns being mutually exclusive.) I think it would be easiest to make the w:r instance, which is now accessed as context node, into a parameter to allow for modifying it. Is there a more elegant way?

Yves

===== wordml_phys_run_styles.xsl =====

<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml";
  xmlns:lookup="http://xmlns.srz.de/yforkl/xslt/lookup";
  exclude-result-prefixes="lookup w"
  version="2.0">

  <lookup:wordml_styles_table>
    <lookup:wordml_phys_style_repr r_equiv="b">
      <w:b/>
    </lookup:wordml_phys_style_repr>
    <lookup:wordml_phys_style_repr r_equiv="i">
      <w:i/>
    </lookup:wordml_phys_style_repr>
<!-- Uncomment this to try a naive approach to recognize superscript by two
     child elements at once -->
<!--
    <lookup:wordml_phys_style_repr r_equiv="sup">
      <w:position w:val="6"/>
      <w:vertAlign w:val="superscript"/>
    </lookup:wordml_phys_style_repr>
-->
    <lookup:wordml_phys_style_repr r_equiv="sup">
      <w:vertAlign w:val="superscript"/>
    </lookup:wordml_phys_style_repr>
    <lookup:wordml_phys_style_repr r_equiv="sup">
      <w:position w:val="6"/>
    </lookup:wordml_phys_style_repr>
  </lookup:wordml_styles_table>

  <xsl:template match="w:p">
    <sample>
      <xsl:apply-templates/>
    </sample>
  </xsl:template>

  <xsl:template match="w:r">
    <xsl:call-template name="convert_phys_run_styles"/>
  </xsl:template>

<!-- Convert physical style runs and text of w:r as context node into nested
inline elements -->
<xsl:template name="convert_phys_run_styles">
<xsl:call-template name="add_style">
<!-- pass a sequence of the representations of all possible physical run
styles; the order of its items reflects the style nesting hierarchy
defined by the target structure; context node is the w:r which
may have physical styles applied -->
<xsl:with-param name="available_styles_sequence"
select="
document('')/
xsl:stylesheet/
lookup:wordml_styles_table/
lookup:wordml_phys_style_repr"/>
</xsl:call-template>
</xsl:template>


<!-- Add element for current style from hierarchy, if it is active -->
<xsl:template name="add_style">
<xsl:param name="available_styles_sequence"/>
<xsl:choose>
<xsl:when test="empty($available_styles_sequence)">
<!-- (Children: usually either w:t or w:sym, holding text only) -->
<xsl:apply-templates select="w:t"/>
</xsl:when>
<xsl:otherwise>
<xsl:variable name="current_style_wordml"
select="$available_styles_sequence[1]"/>
<xsl:variable name="current_style_wordml_repr"
select="$current_style_wordml/*[1]"/>
<xsl:choose>
<!-- try to find the child of w:rPr that is the WordML representation
of the run style currently being looked for (using an
existentially quantified comparison because the order of w:rPr's
children is free and "deep-equal" can only compare single
nodes); if the current style matches, add the style's element
equivalent around the inner styles and text -->
<!-- ### limitation: does not support indicators composed of several
nodes with particular relationships (e.g. a specific order or
configurations of nodes to be interpreted specially), i.e. only
atomic recognition of the nodes is possible -->
<xsl:when test="some $style_repr in w:rPr/*
satisfies
deep-equal($style_repr, $current_style_wordml_repr)">
<xsl:element name="{$current_style_wordml/@r_equiv}">
<xsl:call-template name="add_style">
<xsl:with-param name="available_styles_sequence"
select="remove($available_styles_sequence, 1)"/>
</xsl:call-template>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="add_style">
<xsl:with-param name="available_styles_sequence"
select="remove($available_styles_sequence, 1)"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>


===== wordml_phys_run_styles.xml =====

<?xml version="1.0" encoding="ISO-8859-1"?>
<w:p xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml";>
<w:r>
<w:rPr>
<w:i/>
<w:vertAlign w:val="superscript"/>
</w:rPr>
<w:t>This italic + superscript is always fine</w:t>
</w:r>
<w:r>
<w:rPr/>
<w:t> but </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i/>
<w:position w:val="6"/>
</w:rPr>
<w:t>that italic + superscript has maybe one "sup" container too much</w:t>
</w:r>
<w:r>
<w:rPr/>
<w:t> while </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i/>
<w:position w:val="6"/>
<w:vertAlign w:val="superscript"/>
</w:rPr>
<w:t>this italic + superscript has either a double or triple "sup" container!</w:t>
</w:r>
</w:p>


Current Thread