Re: [xsl] Seeking a smarter tokenize for augmented text

Subject: Re: [xsl] Seeking a smarter tokenize for augmented text
From: "Trevor Nicholls trevor@xxxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 7 May 2021 09:41:15 -0000
I have made some progress on this, not to a working point yet but I'm more
confident than I was, so thanks to all for the suggestions which have been
helpful. I also found some hints in a stackoverflow answer of Martin Honnen's
which reinforced the advice to work on this by adding a line marker element
and using grouping.

The original statement of the requirement was a bit vague, and the content
model currently in use is a bit too flexible. So I think I can stipulate that
inline elements will not run across line breaks (and if they do I should be
able to run a pre-fix which splits them), nor will the content include any
nested inline elements.

At the moment I'm assuming that in the step where I insert line marker
elements, I also have to use modal templates to insert inline element markers,
then run another pass to restore the inline elements. Something like this,
correct?

  <xsl:variable name="brokenlines">
    <xsl:element name="textlines">
      <xsl:element name="linemarker"/>
      <xsl:analyze-string select="." regex="(\r\n?|\n\r?)">
        <xsl:matching-substring>
          <xsl:element name="linemarker"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:apply-templates mode="break"/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    <xsl:element>
  </xsl:variable>
  <xsl:variable name="textlines">
    <xsl:call-template name="rebuild">
      <xsl:with-param name="lines" select="$brokenlines"/>
    </xsl:call-template>
  <xsl:variable>
  <-- $textlines/textlines is now the original textlines with line children
-->
  ...

  <xsl:template match="textlines/*/text()" mode="break">
    <xsl:value-of select="concat('[[{', name(..), '}', ., ']]')" />
  </xsl:template>

  <xsl:template name="rebuild">
    <xsl:param name="lines" as="document-node()" />
    <xsl:element name="textlines">
      <xsl:for-each select="$lines/textlines">
        <xsl:for-each-group select="node()" group-starting-with="linemarker">
          <xsl:element name="line">
            <xsl:apply-templates
select="current-group()[not(self::linemarker)]" mode="rebuild" />
          </xsl:element>
        </xsl:for-each-group>
      </xsl:element>
    </xsl:template>

  <xsl:template match="text()" mode="rebuild">
    <xsl:analyze-string select="." regex="something matching
[[{name}content]]">
      <xsl:matching-substring>
        <xsl:element name="the name in the regex">
          the content in the regex
        </xsl:element>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl-value-of select="." />
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

Am I going along the right lines? I'd prefer to be set straight sooner rather
than later!

Cheers
T

-----Original Message-----
From: Michael MC<ller-Hillebrand mmh@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Friday, 7 May 2021 20:26
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [xsl] Seeking a smarter tokenize for augmented text

Hi,

Current Thread