Re: [xsl] Breaking paragraphs one linebreaks

Subject: Re: [xsl] Breaking paragraphs one linebreaks
From: "Manuel Souto Pico terminolator@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 9 May 2019 23:53:24 -0000
It seems this works:

      <xsl:for-each select="tokenize(replace(.,  '([.?!]\s)',
'$1&lt;br&gt;'  ), '(' || $lb || ')+')">
          <seg>{.}</seg>
      </xsl:for-each>

Although perhaps there's a better way of writing it, to make it more
readable... (using a variable, perhaps).

https://xsltfiddle.liberty-development.net/ej9EGcD/9

Cheers, Manuel

Manuel Souto Pico terminolator@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> escreveu no dia sexta, 10/05/2019
C (s) 01:41:

> Thank you so much for your suggestions, Martin.
>
> In fact looking at the result of the three stylesheets I think the first
> one is the one it serves my purposes better. Especially being able to match
> an expression rather than a specific HTML or XML tag seems convenient.
>
> The only thing that I would need to change is to handle punctuation
> differently than tags. Tags (br, li, etc.) used as delimiters for splitting
> can be eaten by the tokenizer, that's fine, but I would like to keep
> punctuation. I'm trying with something like pre-processing the text before
> applying the tokenizer, with something like:
>
> <xsl:value-of select="replace(current(), '([.?!]\s)', '$1&lt;br&gt;')"/>
>
> That would replace final punctuation with itself ($1) and a linebreak tag,
> that the tokenizer will use as breaking point. Not sure where that would
> go, though.
>
> I have also looked at analyze-string but I think that would be more
> complicated.
>
> Some feedback about the other two options (using my full text):
>
> The HTML parser would sound like a good idea in principle but the source
> document contains some &lt; entities &gt; (that appear like <entities> in
> the display) that just disappear, eg. "Dear &lt;school administrator&gt;"
> becomes just "Dear "
> https://xsltfiddle.liberty-development.net/ej9EGcD/6
>
> parse-xml-fragment fails with this error: Error executing XSLT at line 25
> : First argument to parse-xml-fragment() is not a well-formed and
> namespace-well-formed XML fragment. XML parser reported:
> org.xml.sax.SAXParseException; systemId:
> file:///C:/WINDOWS/SysWOW64/inetsrv/; lineNumber: 1; columnNumber: 27;
> Attribute name "administrator" associated with an element type "school"
> must be followed by the ' = ' character.
> https://xsltfiddle.liberty-development.net/ej9EGcD/5
>
> Thanks!
> Cheers, Manuel
>
>
> Martin Honnen martin.honnen@xxxxxx <
> xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> escreveu no dia quinta,
> 9/05/2019 C (s) 23:07:
>
>> Am 09.05.2019 um 22:16 schrieb Martin Honnen martin.honnen@xxxxxx:
>> > Am 09.05.2019 um 21:55 schrieb Martin Honnen martin.honnen@xxxxxx:
>> >> Am 09.05.2019 um 21:42 schrieb Manuel Souto Pico
>> terminolator@xxxxxxxxx:
>> >>>
>> >>>
>> >>> @Martin, your example works really well. I had to edit the expression,
>> >>> as in my real files sometimes they have used lists instead of
>> >>> linebreaks:
>> >>>
>> >>> <xsl:param name="lb"
>> >>> as="xs:string">&lt;/?(li|ul|br)\s*/?&gt;</xsl:param>
>> >>>
>> >>> However, I can see what I would also need to split at the end of
>> >>> sentences when there's no escaped tag but just final punctuation. To
>> >>> avoid the transformation eating the punctuation, I have tried with a
>> >>> lookbehind assertion but it seems it's not supported:
>> >>>
>> >>> <xsl:param name="lb"
>> >>> as="xs:string">(?<=[.!?])\s|&lt;/?(li|ul|br)\s*/?&gt;</xsl:param>
>> >>>
>> >>> Any ideas?
>> >>>
>> >>
>> >> In general, if there is markup, it might be better to try to parse it,
>> >> in your initial sample you seemed to have simple HTML empty element
>> >> syntax with <br> elements, now with the adapted regular expression it
>> >> seems you expect opening and closing tags.
>> >>
>> >> If you know the escaped markup is an XML fragment then I would try to
>> >> parse it with the "parse-xml-fragment" function, if it is HTML, then I
>> >> would look into using David Carlisle's HTML parser implementation done
>> >> in pure XSLT 2 or use an extension function like the commercial
>> editions
>> >> of Saxon offer.
>> >>
>>
>> For HTML parsing with the XSLT based HTML parser
>> (
>>
https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl
>> )
>> it would look like
>>
>>
>>    <xsl:import
>> href="
>>
https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl
>> "/>
>>
>>    <xsl:template match="tu">
>>        <xsl:variable name="split">
>>            <xsl:apply-templates mode="split"/>
>>        </xsl:variable>
>>        <xsl:for-each-group select="$split/tuv/seg" group-by="position()
>> mod count($split/tuv[1]/seg)">
>>            <tu tuid="{position()}">
>>                <xsl:apply-templates
>> select="current-group()/snapshot()/.."/>
>>            </tu>
>>        </xsl:for-each-group>
>>    </xsl:template>
>>
>>    <xsl:mode name="split" on-no-match="shallow-copy"/>
>>
>>    <xsl:template match="seg" expand-text="yes" mode="split">
>>        <xsl:for-each-group select="d:htmlparse(., '', true())/node()"
>> group-ending-with="br">
>>            <xsl:if test=". instance of text()">
>>              <seg>{.}</seg>
>>            </xsl:if>
>>        </xsl:for-each-group>
>>    </xsl:template>
>>
>>
>> https://xsltfiddle.liberty-development.net/ej9EGcD/6
>>
>>
>>
>> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/2528023> (by
> email <>)

Current Thread