Re: [xsl] Breaking paragraphs one linebreaks

Subject: Re: [xsl] Breaking paragraphs one linebreaks
From: "Manuel Souto Pico terminolator@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 9 May 2019 23:40:46 -0000
Thank you so much for your suggestions, Martin.

In fact looking at the result of the three stylesheets I think the first
one is the one it serves my purposes better. Especially being able to match
an expression rather than a specific HTML or XML tag seems convenient.

The only thing that I would need to change is to handle punctuation
differently than tags. Tags (br, li, etc.) used as delimiters for splitting
can be eaten by the tokenizer, that's fine, but I would like to keep
punctuation. I'm trying with something like pre-processing the text before
applying the tokenizer, with something like:

<xsl:value-of select="replace(current(), '([.?!]\s)', '$1&lt;br&gt;')"/>

That would replace final punctuation with itself ($1) and a linebreak tag,
that the tokenizer will use as breaking point. Not sure where that would
go, though.

I have also looked at analyze-string but I think that would be more
complicated.

Some feedback about the other two options (using my full text):

The HTML parser would sound like a good idea in principle but the source
document contains some &lt; entities &gt; (that appear like <entities> in
the display) that just disappear, eg. "Dear &lt;school administrator&gt;"
becomes just "Dear "
https://xsltfiddle.liberty-development.net/ej9EGcD/6

parse-xml-fragment fails with this error: Error executing XSLT at line 25 :
First argument to parse-xml-fragment() is not a well-formed and
namespace-well-formed XML fragment. XML parser reported:
org.xml.sax.SAXParseException; systemId:
file:///C:/WINDOWS/SysWOW64/inetsrv/; lineNumber: 1; columnNumber: 27;
Attribute name "administrator" associated with an element type "school"
must be followed by the ' = ' character.
https://xsltfiddle.liberty-development.net/ej9EGcD/5

Thanks!
Cheers, Manuel


Martin Honnen martin.honnen@xxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
escreveu no dia quinta, 9/05/2019 C (s) 23:07:

> Am 09.05.2019 um 22:16 schrieb Martin Honnen martin.honnen@xxxxxx:
> > Am 09.05.2019 um 21:55 schrieb Martin Honnen martin.honnen@xxxxxx:
> >> Am 09.05.2019 um 21:42 schrieb Manuel Souto Pico terminolator@xxxxxxxxx
> :
> >>>
> >>>
> >>> @Martin, your example works really well. I had to edit the expression,
> >>> as in my real files sometimes they have used lists instead of
> >>> linebreaks:
> >>>
> >>> <xsl:param name="lb"
> >>> as="xs:string">&lt;/?(li|ul|br)\s*/?&gt;</xsl:param>
> >>>
> >>> However, I can see what I would also need to split at the end of
> >>> sentences when there's no escaped tag but just final punctuation. To
> >>> avoid the transformation eating the punctuation, I have tried with a
> >>> lookbehind assertion but it seems it's not supported:
> >>>
> >>> <xsl:param name="lb"
> >>> as="xs:string">(?<=[.!?])\s|&lt;/?(li|ul|br)\s*/?&gt;</xsl:param>
> >>>
> >>> Any ideas?
> >>>
> >>
> >> In general, if there is markup, it might be better to try to parse it,
> >> in your initial sample you seemed to have simple HTML empty element
> >> syntax with <br> elements, now with the adapted regular expression it
> >> seems you expect opening and closing tags.
> >>
> >> If you know the escaped markup is an XML fragment then I would try to
> >> parse it with the "parse-xml-fragment" function, if it is HTML, then I
> >> would look into using David Carlisle's HTML parser implementation done
> >> in pure XSLT 2 or use an extension function like the commercial editions
> >> of Saxon offer.
> >>
>
> For HTML parsing with the XSLT based HTML parser
> (
>
https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl
> )
> it would look like
>
>
>    <xsl:import
> href="
>
https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl
> "/>
>
>    <xsl:template match="tu">
>        <xsl:variable name="split">
>            <xsl:apply-templates mode="split"/>
>        </xsl:variable>
>        <xsl:for-each-group select="$split/tuv/seg" group-by="position()
> mod count($split/tuv[1]/seg)">
>            <tu tuid="{position()}">
>                <xsl:apply-templates
> select="current-group()/snapshot()/.."/>
>            </tu>
>        </xsl:for-each-group>
>    </xsl:template>
>
>    <xsl:mode name="split" on-no-match="shallow-copy"/>
>
>    <xsl:template match="seg" expand-text="yes" mode="split">
>        <xsl:for-each-group select="d:htmlparse(., '', true())/node()"
> group-ending-with="br">
>            <xsl:if test=". instance of text()">
>              <seg>{.}</seg>
>            </xsl:if>
>        </xsl:for-each-group>
>    </xsl:template>
>
>
> https://xsltfiddle.liberty-development.net/ej9EGcD/6

Current Thread