RE: [xsl] text() word lists

Subject: RE: [xsl] text() word lists
From: "McNally, David" <David.McNally@xxxxxxxxxx>
Date: Fri, 6 Feb 2004 18:27:53 -0500
You can do this in XSLT 1.0, with the nodeset function, by doing multiple
passes on the file.  Basically, in the first pass, you get rid of all the
elements, and then turn all of the words in the document into empty
elements.  You can then manipulate the words as elements, and there are
probably any number of ways to get to your final result.  Here, I'm doing
another pass just to add on count attributes to each element, and then a
final pass to output the results, first sorted alphabetically, then by
count.

It seems to work, though I haven't done enough testing to be sure that it
doesn't quietly mess things up.  Also, I'm not sure how well it's going to
work on big files - but in that situation you probably should think about
using perl or something.

Hope this helps,
David.



Text_frequency_count.xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:ext="urn:schemas-microsoft-com:xslt" xmlns:rep="http://whatever.com";
xmlns:saxon="http://icl.com/saxon";>
	<xsl:output method="text" version="1.0" encoding="UTF-8"
indent="yes"/>

<xsl:variable name="nonwordchars"><xsl:text>.,:;!?
"'()[]&lt;>{}@#$%^*-_+=|\~</xsl:text></xsl:variable>
<xsl:variable name="lletters" select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="Uletters" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="numbers" select="1234567890"/>


<xsl:template match="/">

	<xsl:variable name="document">
		<bagOfElements>
		<xsl:apply-templates select="*" mode="firstrun"/>
		</bagOfElements>
	</xsl:variable>

	<xsl:variable name="document2">
		<bagOfElements>
		<xsl:apply-templates select="ext:node-set($document)"
mode="secondrun"/>
		</bagOfElements>
	</xsl:variable>

	<xsl:apply-templates select="ext:node-set($document2)/*"
mode="finalrun"/>

</xsl:template>


<!-- FIRST RUN - get rid of all the elements, and turn words into elements
-->

<xsl:template match="*" mode="firstrun">
	<xsl:apply-templates mode="firstrun"/>	
</xsl:template>

<xsl:template match="text()" mode="firstrun">

	<!-- loads of space characters in the final concat - basically
anything that's not 
		a letter gets translated into a space -->
	<xsl:variable name="text" 
select="normalize-space(translate(.,concat($Uletters,$numbers,$nonwordchars,
'&#9;', '&#10;'),
concat($lletters,'
')))"/>

	<xsl:call-template name="elementify">
		<xsl:with-param name="text" select="$text"/>
	</xsl:call-template>

</xsl:template>

<xsl:template name="elementify">
	<xsl:param name="text"/>

	<xsl:choose>
		<xsl:when test="contains($text,' ')">
			<xsl:element name="{substring-before($text,' ')}"/>
			<xsl:call-template name="elementify">
				<xsl:with-param name="text"
select="substring-after($text, ' ')"/>
			</xsl:call-template>
		</xsl:when>
		<xsl:otherwise>
			<xsl:if test="string-length($text) > 0">
				<xsl:element name="{$text}"/>
			</xsl:if>
		</xsl:otherwise>
	</xsl:choose>

</xsl:template>

<!-- SECOND RUN - just adding count attributes to make sorting easier in
final run. -->

<xsl:template match="bagOfElements" mode="secondrun">
	<xsl:for-each select="*">
		<xsl:element name="{name()}">
			<xsl:attribute name="count">
				<xsl:value-of
select="count(/bagOfElements/*[name() = name(current())])"/>
			</xsl:attribute>
		</xsl:element>
	</xsl:for-each>
</xsl:template>


<!-- THE FINAL RUN -->

<xsl:template match="bagOfElements" mode="finalrun">
	<xsl:text>
Ordered By Name

</xsl:text>
	<xsl:apply-templates  select="*" mode="finalrun">
		<xsl:sort select="name()"></xsl:sort>
	</xsl:apply-templates>
	<xsl:text>
Ordered By Count

</xsl:text>
	<xsl:apply-templates  select="*" mode="finalrun">
		<xsl:sort select="@count"></xsl:sort>
	</xsl:apply-templates>

</xsl:template>


<xsl:template match="*" mode="finalrun">
	<xsl:variable name="currentname" select="name()"/>
	<xsl:if test="not(preceding-sibling::*[name() = $currentname])">
		<xsl:value-of select="name()"/>
		<xsl:text>: </xsl:text>
		<xsl:value-of select="@count"/>
		<xsl:text>&#10;</xsl:text>
	</xsl:if>
</xsl:template>

</xsl:stylesheet>


File:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"
href="C:\Work\xsl\text_frequency_count.xslt"?>
<foo>
	<blort> This is a <wibble>Test</wibble>, only a test!</blort>
	<blort> This really is a <wibble>great big test</wibble>, 
 only a test!</blort>
</foo>


Output:


Ordered By Name

a: 4
big: 1
great: 1
is: 2
only: 2
really: 1
test: 4
this: 2

Ordered By Count

really: 1
great: 1
big: 1
this: 2
is: 2
only: 2
a: 4
test: 4

> -----Original Message-----
> From: James Cummings [mailto:James.Cummings@xxxxxxxxxxxxxx] 
> Sent: Friday, February 06, 2004 10:35 AM
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] text() word lists
> 
> 
> 
> Hi there,
> 
> I'm sure this is a faq, and I've checked the faq and archive.
> I swear I remember someone asking about it, but I couldn't
> find it, so here goes.
> 
> I want to take an XML file of unknown elements and create
> a word frequency list / word list.  Now, an entry on sorting
> in the xslt faq says this is just what xslt is bad at.  (And 
> I'm sure there are some that would say 'just go use perl', 
> but let's say I want to do it in xslt(1 or 2).
> 
> XSLT2 makes the tokenization of strings much easier, so 
> assuming I'm using that, if I have:
> 
> <foo>
> <blort> This is a <wibble>Test</wibble>, only a test!</blort> 
> <blort> This really is a <wibble>great big test</wibble>, 
> only a test!</blort> </foo>
> 
> I don't know that foo|wibble|blort  will be the element names.
> 
> But I want to produce both:
> 
> a  -- 4
> test  -- 4
> only -- 2
> is  -- 2
> this  -- 2
> big -- 1
> great -- 1
> really -- 1
> 
> Which (unless I've missed something) should be
> a case-insensitive list grouped by frequency
> sorted alphabetically within this, and ignoring
> punctuation.
> 
> But also:
> 
> a  -- 4
> big -- 1
> great -- 1
> is  -- 2
> only -- 2
> test  -- 4
> this  -- 2
> really -- 1
> 
> Which is the same list by not grouped
> by frequency.
> 
> Suggestions? Solutions?
> 
> Many thanks for any help,
> -James
> ---
> Dr James Cummings, Oxford Text Archive, University of Oxford 
> James.Cummings at ota.ahds.ac.uk http://users.ox.ac.uk/~jamesc/
> 
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
> 


---------------------------------------

The information contained in this e-mail message, and any attachment thereto, is confidential and may not be disclosed without our express permission.  If you are not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution or copying of this message, or any attachment thereto, in whole or in part, is strictly prohibited.  If you have received this message in error, please immediately notify us by telephone, fax or e-mail and delete the message and all of its attachments.  Thank you.

Every effort is made to keep our network free from viruses.  You should, however, review this e-mail message, as well as any attachment thereto, for viruses.  We take no responsibility and have no liability for any computer virus which may be transferred via this e-mail message.


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread