Re: [xsl] finding and removing duplicate string

Subject: Re: [xsl] finding and removing duplicate string
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Fri, 2 Dec 2011 18:22:02 +0100
Unless your <p>-paragraphs aren't very long you should not use pattern
matching like this because this is a pattern that exhibits quadratic
performance depending on the string length.

I ran a quick test comparing Java's regex engine to the substring
comparison proposed here earlier on.

The "hit" case (2 x "the quick brown..."):
   pattern:  0.000003061s - substr:  0.000000134s, a factor of 22

The "fail" case ("the quick brown..." vs "okkokoko...", equal lengths)
   pattern:  0.000004452s - substr:  0.000000026s, a factor of 171

Some XSLT regex engine might be better, but its execution time is
still bound to increase by O(n^2).

-W


On 2 December 2011 17:29, Imsieke, Gerrit, le-tex
<gerrit.imsieke@xxxxxxxxx> wrote:
>  <xsl:template match="p">
>    <xsl:copy>
>      <xsl:copy-of select="@*" />
> <!-- use replace() for normalizing the input first, i.e., replace the
> newline with a space: -->
>      <xsl:analyze-string select="replace(., '\s+', ' ')"
> regex="^(.+)\s+\1$">
> <!-- \1 is a back-reference to the first match, which is allowed according
> to http://www.w3.org/TR/xpath-functions/#regex-syntax -->
>        <xsl:matching-substring>
>          <xsl:value-of select="regex-group(1)"/>
>        </xsl:matching-substring>
>        <xsl:non-matching-substring>
> <!-- output the whole string if above regex doesn't match: -->
>          <xsl:value-of select="."/>
>        </xsl:non-matching-substring>
>      </xsl:analyze-string>
>    </xsl:copy>
>  </xsl:template>
>
>
> On 2011-12-02 16:32, Jacob L wrote:
>>
>> All,
>>
>>
>> I am using<xsl:stylesheet version="2.0" .If in the input XML file,
>> the text in the<p>  tag repeats itself such as
>>
>>
>>
>> <text>
>>
>> <p>Bradley Cooper named Peoples Sexiest man alive 2011  Bradley
>> Cooper named Peoples Sexiest man alive 2011</p>
>>
>> </text>
>>
>>
>>
>> I want to write code to check it and omit it. The result should be:-
>>
>>
>>
>> After putting check in the xsl and deleting the duplicate string. The
>> output should be:-
>>
>>
>>
>>  <text>
>>         <p>Bradley Cooper named Peoples Sexiest man alive 2011</p>
>>    </text>
>>
>>
>> Thanks for the help!
>>
>
> --
> Gerrit Imsieke
> Geschdftsf|hrer / Managing Director
> le-tex publishing services GmbH
> Weissenfelser Str. 84, 04229 Leipzig, Germany
> Phone +49 341 355356 110, Fax +49 341 355356 510
> gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
>
> Registergericht / Commercial Register: Amtsgericht Leipzig
> Registernummer / Registration Number: HRB 24930
>
> Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek,
> Thomas Schmidt, Dr. Reinhard Vvckler

Current Thread