Subject: Re: [xsl] Breaking paragraphs one linebreaks From: "Manuel Souto Pico terminolator@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 9 May 2019 23:53:24 -0000 |
It seems this works: <xsl:for-each select="tokenize(replace(., '([.?!]\s)', '$1<br>' ), '(' || $lb || ')+')"> <seg>{.}</seg> </xsl:for-each> Although perhaps there's a better way of writing it, to make it more readable... (using a variable, perhaps). https://xsltfiddle.liberty-development.net/ej9EGcD/9 Cheers, Manuel Manuel Souto Pico terminolator@xxxxxxxxx < xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> escreveu no dia sexta, 10/05/2019 C (s) 01:41: > Thank you so much for your suggestions, Martin. > > In fact looking at the result of the three stylesheets I think the first > one is the one it serves my purposes better. Especially being able to match > an expression rather than a specific HTML or XML tag seems convenient. > > The only thing that I would need to change is to handle punctuation > differently than tags. Tags (br, li, etc.) used as delimiters for splitting > can be eaten by the tokenizer, that's fine, but I would like to keep > punctuation. I'm trying with something like pre-processing the text before > applying the tokenizer, with something like: > > <xsl:value-of select="replace(current(), '([.?!]\s)', '$1<br>')"/> > > That would replace final punctuation with itself ($1) and a linebreak tag, > that the tokenizer will use as breaking point. Not sure where that would > go, though. > > I have also looked at analyze-string but I think that would be more > complicated. > > Some feedback about the other two options (using my full text): > > The HTML parser would sound like a good idea in principle but the source > document contains some < entities > (that appear like <entities> in > the display) that just disappear, eg. "Dear <school administrator>" > becomes just "Dear " > https://xsltfiddle.liberty-development.net/ej9EGcD/6 > > parse-xml-fragment fails with this error: Error executing XSLT at line 25 > : First argument to parse-xml-fragment() is not a well-formed and > namespace-well-formed XML fragment. XML parser reported: > org.xml.sax.SAXParseException; systemId: > file:///C:/WINDOWS/SysWOW64/inetsrv/; lineNumber: 1; columnNumber: 27; > Attribute name "administrator" associated with an element type "school" > must be followed by the ' = ' character. > https://xsltfiddle.liberty-development.net/ej9EGcD/5 > > Thanks! > Cheers, Manuel > > > Martin Honnen martin.honnen@xxxxxx < > xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> escreveu no dia quinta, > 9/05/2019 C (s) 23:07: > >> Am 09.05.2019 um 22:16 schrieb Martin Honnen martin.honnen@xxxxxx: >> > Am 09.05.2019 um 21:55 schrieb Martin Honnen martin.honnen@xxxxxx: >> >> Am 09.05.2019 um 21:42 schrieb Manuel Souto Pico >> terminolator@xxxxxxxxx: >> >>> >> >>> >> >>> @Martin, your example works really well. I had to edit the expression, >> >>> as in my real files sometimes they have used lists instead of >> >>> linebreaks: >> >>> >> >>> <xsl:param name="lb" >> >>> as="xs:string"></?(li|ul|br)\s*/?></xsl:param> >> >>> >> >>> However, I can see what I would also need to split at the end of >> >>> sentences when there's no escaped tag but just final punctuation. To >> >>> avoid the transformation eating the punctuation, I have tried with a >> >>> lookbehind assertion but it seems it's not supported: >> >>> >> >>> <xsl:param name="lb" >> >>> as="xs:string">(?<=[.!?])\s|</?(li|ul|br)\s*/?></xsl:param> >> >>> >> >>> Any ideas? >> >>> >> >> >> >> In general, if there is markup, it might be better to try to parse it, >> >> in your initial sample you seemed to have simple HTML empty element >> >> syntax with <br> elements, now with the adapted regular expression it >> >> seems you expect opening and closing tags. >> >> >> >> If you know the escaped markup is an XML fragment then I would try to >> >> parse it with the "parse-xml-fragment" function, if it is HTML, then I >> >> would look into using David Carlisle's HTML parser implementation done >> >> in pure XSLT 2 or use an extension function like the commercial >> editions >> >> of Saxon offer. >> >> >> >> For HTML parsing with the XSLT based HTML parser >> ( >> https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl >> ) >> it would look like >> >> >> <xsl:import >> href=" >> https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl >> "/> >> >> <xsl:template match="tu"> >> <xsl:variable name="split"> >> <xsl:apply-templates mode="split"/> >> </xsl:variable> >> <xsl:for-each-group select="$split/tuv/seg" group-by="position() >> mod count($split/tuv[1]/seg)"> >> <tu tuid="{position()}"> >> <xsl:apply-templates >> select="current-group()/snapshot()/.."/> >> </tu> >> </xsl:for-each-group> >> </xsl:template> >> >> <xsl:mode name="split" on-no-match="shallow-copy"/> >> >> <xsl:template match="seg" expand-text="yes" mode="split"> >> <xsl:for-each-group select="d:htmlparse(., '', true())/node()" >> group-ending-with="br"> >> <xsl:if test=". instance of text()"> >> <seg>{.}</seg> >> </xsl:if> >> </xsl:for-each-group> >> </xsl:template> >> >> >> https://xsltfiddle.liberty-development.net/ej9EGcD/6 >> >> >> >> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> > EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/2528023> (by > email <>)
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Breaking paragraphs one l, Manuel Souto Pico te | Thread | Re: [xsl] Breaking paragraphs one l, Terry Badger terry_b |
Re: [xsl] Breaking paragraphs one l, Manuel Souto Pico te | Date | [xsl] How to convert a recursive fu, Costello, Roger L. c |
Month |