Subject: Re: [xsl] Breaking paragraphs one linebreaks From: "Manuel Souto Pico terminolator@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Thu, 9 May 2019 23:40:46 -0000 |
Thank you so much for your suggestions, Martin. In fact looking at the result of the three stylesheets I think the first one is the one it serves my purposes better. Especially being able to match an expression rather than a specific HTML or XML tag seems convenient. The only thing that I would need to change is to handle punctuation differently than tags. Tags (br, li, etc.) used as delimiters for splitting can be eaten by the tokenizer, that's fine, but I would like to keep punctuation. I'm trying with something like pre-processing the text before applying the tokenizer, with something like: <xsl:value-of select="replace(current(), '([.?!]\s)', '$1<br>')"/> That would replace final punctuation with itself ($1) and a linebreak tag, that the tokenizer will use as breaking point. Not sure where that would go, though. I have also looked at analyze-string but I think that would be more complicated. Some feedback about the other two options (using my full text): The HTML parser would sound like a good idea in principle but the source document contains some < entities > (that appear like <entities> in the display) that just disappear, eg. "Dear <school administrator>" becomes just "Dear " https://xsltfiddle.liberty-development.net/ej9EGcD/6 parse-xml-fragment fails with this error: Error executing XSLT at line 25 : First argument to parse-xml-fragment() is not a well-formed and namespace-well-formed XML fragment. XML parser reported: org.xml.sax.SAXParseException; systemId: file:///C:/WINDOWS/SysWOW64/inetsrv/; lineNumber: 1; columnNumber: 27; Attribute name "administrator" associated with an element type "school" must be followed by the ' = ' character. https://xsltfiddle.liberty-development.net/ej9EGcD/5 Thanks! Cheers, Manuel Martin Honnen martin.honnen@xxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> escreveu no dia quinta, 9/05/2019 C (s) 23:07: > Am 09.05.2019 um 22:16 schrieb Martin Honnen martin.honnen@xxxxxx: > > Am 09.05.2019 um 21:55 schrieb Martin Honnen martin.honnen@xxxxxx: > >> Am 09.05.2019 um 21:42 schrieb Manuel Souto Pico terminolator@xxxxxxxxx > : > >>> > >>> > >>> @Martin, your example works really well. I had to edit the expression, > >>> as in my real files sometimes they have used lists instead of > >>> linebreaks: > >>> > >>> <xsl:param name="lb" > >>> as="xs:string"></?(li|ul|br)\s*/?></xsl:param> > >>> > >>> However, I can see what I would also need to split at the end of > >>> sentences when there's no escaped tag but just final punctuation. To > >>> avoid the transformation eating the punctuation, I have tried with a > >>> lookbehind assertion but it seems it's not supported: > >>> > >>> <xsl:param name="lb" > >>> as="xs:string">(?<=[.!?])\s|</?(li|ul|br)\s*/?></xsl:param> > >>> > >>> Any ideas? > >>> > >> > >> In general, if there is markup, it might be better to try to parse it, > >> in your initial sample you seemed to have simple HTML empty element > >> syntax with <br> elements, now with the adapted regular expression it > >> seems you expect opening and closing tags. > >> > >> If you know the escaped markup is an XML fragment then I would try to > >> parse it with the "parse-xml-fragment" function, if it is HTML, then I > >> would look into using David Carlisle's HTML parser implementation done > >> in pure XSLT 2 or use an extension function like the commercial editions > >> of Saxon offer. > >> > > For HTML parsing with the XSLT based HTML parser > ( > https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl > ) > it would look like > > > <xsl:import > href=" > https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl > "/> > > <xsl:template match="tu"> > <xsl:variable name="split"> > <xsl:apply-templates mode="split"/> > </xsl:variable> > <xsl:for-each-group select="$split/tuv/seg" group-by="position() > mod count($split/tuv[1]/seg)"> > <tu tuid="{position()}"> > <xsl:apply-templates > select="current-group()/snapshot()/.."/> > </tu> > </xsl:for-each-group> > </xsl:template> > > <xsl:mode name="split" on-no-match="shallow-copy"/> > > <xsl:template match="seg" expand-text="yes" mode="split"> > <xsl:for-each-group select="d:htmlparse(., '', true())/node()" > group-ending-with="br"> > <xsl:if test=". instance of text()"> > <seg>{.}</seg> > </xsl:if> > </xsl:for-each-group> > </xsl:template> > > > https://xsltfiddle.liberty-development.net/ej9EGcD/6
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Breaking paragraphs one l, Martin Honnen martin | Thread | Re: [xsl] Breaking paragraphs one l, Manuel Souto Pico te |
Re: [xsl] Breaking paragraphs one l, Martin Honnen martin | Date | Re: [xsl] Breaking paragraphs one l, Manuel Souto Pico te |
Month |