RE: [xsl] How to strip all optional but empty elements from a XML doc?

Subject: RE: [xsl] How to strip all optional but empty elements from a XML doc?
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Tue, 4 Aug 2009 08:32:34 +0100
> Assume I have a XML doc and a XSD schema file which defines 
> mandatory and optional element fields.
> 
> How can I now go recursively through the whole XML document 
> and find all elements which are optional (e.g. minoccurs=0) 
> but currently EMPTY. These empty, optional elements should be 
> stripped off so that the resulting XML does not contain them any more.
> 
> How can I achive this with an XSLT script?
> 

Not easy, but here's a suggestion: it relies on Saxon, but avoids requiring
any new Java extension functions.

Firstly, generate an SCM document containing the schema component model:

java com.saxonica.Validate -xsd:my-schema.xsd -scmout:my-schema.scm

The SCM file holds a "normalized" XML representation of the schema which is
much more amenable to XSLT processing than your original schema documents
(for example, there is no need to follow references to named model groups,
they are all expanded).

Now use a schema-aware transformation in which the input document has been
validated (and therefore type-annotated) against the schema.

When you reach an element in the instance document, call
saxon:type-annotation() to get the name of the type of the element. Use this
to find the complex type in the SCM file: it might look something like this:

<scm:complexType id="C110" base="#anyType" derivationMethod="restriction"
abstract="false"
                 name="sometype" targetNamespace="http://some-namespace";
                 variety="element-only">
      <scm:attributeUse required="true" ref="C152"/>
      <scm:attributeUse required="true" ref="C153"/>
      <scm:attributeUse required="false" ref="C160"/>
      <scm:attributeUse required="true" ref="C161"/>
      <scm:modelGroupParticle minOccurs="1" maxOccurs="1">
         <scm:sequence>
            <scm:elementParticle minOccurs="0" maxOccurs="1" ref="C59"/>
            <scm:elementParticle minOccurs="0" maxOccurs="unbounded"
ref="C93"/>
            <scm:elementParticle minOccurs="0" maxOccurs="1" ref="C45"/>
            <scm:elementParticle minOccurs="0" maxOccurs="1" ref="C67"/>
            <scm:elementParticle minOccurs="0" maxOccurs="1" ref="C121"/>
            <scm:elementParticle minOccurs="0" maxOccurs="unbounded"
ref="C57"/>
         </scm:sequence>
      </scm:modelGroupParticle>
      ...
</scm:complexType>

The ref attribute on the elementParticle points to the element declaration,
something like this:

<scm:element id="C59" name="elementName"
                targetNamespace="http://something.com/namespace";
                type="C60"
                global="true"
                nillable="false"
                abstract="false"/>

Now when processing the children of the element in the instance, it should
be reasonably easy to match them up with the scm:elementParticle children of
this scm:complexType to see what the minOccurs value is.

There are two complications you'll have to think about:

(a) If the complex type is anonymous, saxon:type-annotation() returns a
system-generated name, and this name isn't present in the SCM file. So you
might need to do an initial transformation on your schema document to ensure
that all types (simple and complex) have explicit names.

(b) It's possible to have a content model in which two element particles
have the same name, but different minOccurs values: for example it's legal
to have the content model (a, a, a?) in which the first two a children are
mandatory and the third is optional. To match each child element to a
specific element particle when the names aren't unique is possible, but
probably more work than you want to do. A safe rule would be to remove the
child element only if ALL the particles with a matching element name have
minOccurs="0".

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay 

Current Thread