|
Subject: Re: [xsl] finding and removing duplicate string From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx> Date: Fri, 2 Dec 2011 18:22:02 +0100 |
Unless your <p>-paragraphs aren't very long you should not use pattern
matching like this because this is a pattern that exhibits quadratic
performance depending on the string length.
I ran a quick test comparing Java's regex engine to the substring
comparison proposed here earlier on.
The "hit" case (2 x "the quick brown..."):
pattern: 0.000003061s - substr: 0.000000134s, a factor of 22
The "fail" case ("the quick brown..." vs "okkokoko...", equal lengths)
pattern: 0.000004452s - substr: 0.000000026s, a factor of 171
Some XSLT regex engine might be better, but its execution time is
still bound to increase by O(n^2).
-W
On 2 December 2011 17:29, Imsieke, Gerrit, le-tex
<gerrit.imsieke@xxxxxxxxx> wrote:
> <xsl:template match="p">
> <xsl:copy>
> <xsl:copy-of select="@*" />
> <!-- use replace() for normalizing the input first, i.e., replace the
> newline with a space: -->
> <xsl:analyze-string select="replace(., '\s+', ' ')"
> regex="^(.+)\s+\1$">
> <!-- \1 is a back-reference to the first match, which is allowed according
> to http://www.w3.org/TR/xpath-functions/#regex-syntax -->
> <xsl:matching-substring>
> <xsl:value-of select="regex-group(1)"/>
> </xsl:matching-substring>
> <xsl:non-matching-substring>
> <!-- output the whole string if above regex doesn't match: -->
> <xsl:value-of select="."/>
> </xsl:non-matching-substring>
> </xsl:analyze-string>
> </xsl:copy>
> </xsl:template>
>
>
> On 2011-12-02 16:32, Jacob L wrote:
>>
>> All,
>>
>>
>> I am using<xsl:stylesheet version="2.0" .If in the input XML file,
>> the text in the<p> tag repeats itself such as
>>
>>
>>
>> <text>
>>
>> <p>Bradley Cooper named Peoples Sexiest man alive 2011 Bradley
>> Cooper named Peoples Sexiest man alive 2011</p>
>>
>> </text>
>>
>>
>>
>> I want to write code to check it and omit it. The result should be:-
>>
>>
>>
>> After putting check in the xsl and deleting the duplicate string. The
>> output should be:-
>>
>>
>>
>> <text>
>> <p>Bradley Cooper named Peoples Sexiest man alive 2011</p>
>> </text>
>>
>>
>> Thanks for the help!
>>
>
> --
> Gerrit Imsieke
> Geschdftsf|hrer / Managing Director
> le-tex publishing services GmbH
> Weissenfelser Str. 84, 04229 Leipzig, Germany
> Phone +49 341 355356 110, Fax +49 341 355356 510
> gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
>
> Registergericht / Commercial Register: Amtsgericht Leipzig
> Registernummer / Registration Number: HRB 24930
>
> Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek,
> Thomas Schmidt, Dr. Reinhard Vvckler
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| Re: [xsl] finding and removing dupl, Imsieke, Gerrit, le- | Thread | Re: [xsl] finding and removing dupl, Andrew Welch |
| Re: [xsl] __LINE__ equivalent in XS, Bartolomeo Nicolotti | Date | Re: [xsl] finding and removing dupl, Andrew Welch |
| Month |