[xsl] white space in xml should not be interpreted as text nodes

Subject: [xsl] white space in xml should not be interpreted as text nodes
From: "Markus Hanel" <markus.hanel@xxxxxx>
Date: Thu, 11 Mar 2004 10:06:50 +0100 (MET)
Hello,

I hope this is the right mailinglist for this kind of topic. If not, do not 
hestiate to ignore this posting or direct me to another mailing list.

Here is the problem:
My application is a web server centered programm that uses mod_python and
xml 
has to process xml files. These xml files have most of the time ignorable
white 
spaces like \n, \r \t between the different tags. The problem is that
minidom 
seems to interpret these white spaces as text nodes and I cannot know in
before 
how many of these "text nodes" are in between the real data nodes. This
seems to 
disturb the real structure of the dom tree and child nodes are no longer
child 
nodes etc. That makes it hard to write a reliable xml application since I
cannot 
know how many spaces the writer/editor of the xml file has put in between
the 
tags. So I tried to find a way of getting rid of these unwanted text nodes
with 
this piece of code but that did not help
either:


################################################################################
#
################################################################################
def cleanUpNodes( nodes ):
    """Removes all TEXT_NODES in parameter nodes that contain only        
characters
    that are defined as whitespace in the string library"""

    for node in nodes.childNodes:
        if node.nodeType == Node.TEXT_NODE:
            node.data = string.strip(node.data)
   
nodes.normalize()

################################################################################
#
################################################################################


I tried out also pulldom, but it interprets the white spaces as "CHARACTER" 
envents and not as "IGNORABLE_WHITSPACE" events. Another thing is that
pulldom 
seems to never generates an "END_DOCUMENT" event ?!

The big question is:
Does anybody know a way around this problem ?
Am I missing something ?
How can I get rid of this unwanted white-space-text-nodes ?

Here is an example that shows what the same code inteprets as child node
when 
processing the same xml file without and with white spaces in between the
tags:


<############### XML File with white spaces #################>
<root>

    <child_1>
        <child_11>
            <child_111 path="/qpers_data/" proto="file" />
        </child_11>
    </child_1>

    <child_2 type="admin" status="active" label="root">
        <child_21 path="/qnodes/admin/admin_root.xml" proto="file" />
    </child_2>

</root>

<############################# Code #############################>

#!/usr/bin/python

from xml.dom import minidom
from xml.dom import Node
import
string

################################################################################
def cleanUpNodes( nodes ):
    """Removes all TEXT_NODES in parameter nodes that contain only
characters
    that are defined as whitespace in the string library"""
    for node in nodes.childNodes:
        if node.nodeType == Node.TEXT_NODE:
            node.data = string.strip(node.data)
   
nodes.normalize()

###############################################################################
def dumpTree( xmlFileIn, xmlFileOut ):
    
    try:
        dom = minidom.parse( xmlFileIn )
        file = open( xmlFileOut, "w" )
    except IOError, (errno, strerror):
        print "I/O error(%s): %s" % (errno, strerror )
        return
    
    cleanUpNodes( dom.documentElement )
    for node in dom.documentElement.childNodes:
        
        while ( node ):
            file.write( "\n node ->" + node.nodeName )
            file.write( node.toxml('ISO-8859-1') )
            node = node.firstChild

    file.close()
        
    return
1

###############################################################################
dumpTree( "index_wos.xml", "without_space.xml" )




<####################### Output with XML with whitespace
####################>

  node ->child_1<child_1>
        <child_11>
            <child_111 path="/qpers_data/" proto="file"/>
        </child_11>
    </child_1>
  node ->#text
        
  node ->child_2<child_2 label="root" status="active" type="admin">
        <child_21 path="/qnodes/admin/admin_root.xml" proto="file"/>
    </child_2>
  node ->#text


<#################### Output with XML without whitespace
####################>

  node ->child_1<child_1><child_11><child_111 path="/qpers_data/" 
/proto="file"/></child_11></child_1>
  node ->child_11<child_11><child_111 path="/qpers_data/"
/proto="file"/></child_11>
  node ->child_111<child_111 path="/qpers_data/" proto="file"/>
  node ->child_2<child_2 label="root" status="active" type="admin"><child_21

/path="/qnodes/admin/admin_root.xml" proto="file"/></child_2>
  node ->child_21<child_21 path="/qnodes/admin/admin_root.xml"
proto="file"/>



regards,


markus


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread