Re: [xsl] Using PIs to set attributes

Subject: Re: [xsl] Using PIs to set attributes
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 08 Jun 2006 14:52:29 -0400
Steven,

This is about as classic a case of overlap as one is likely to see.

At 02:31 AM 6/8/2006, you wrote:
I've got some XML that looks like this:

<p>Programmatic access to objects is determined by the objects
that are
  <ul><?Fm Condstart API_Only?>
    <li>defined in your enterprise WSDL file</li>
    <li><?Fm Condend API_Only?><?Fm Condstart OT_Only?>
          available in the EntityNames[] array in the Session3 object
          <?Fm Condend OT_Only?></li>
    <li>in your organization configuration</li>
    <li>valid with your security access  ....

The processing instructions are designed to indicate conditional text
(if API is the target, include the content between the <?Fm Condstart
API_Only?> and  <?Fm Condend API_Only?>).

I'd like to process this XML and be able to replace it with something
like this:

<p>Programmatic access to objects is determined by the objects
that are
  <ul>
    <li platform="api">defined in your enterprise WSDL file</li>
    <li><ph platform="ot">available in the EntityNames[] array
           in the Session3 object</ph></li>
    <li>in your organization configuration</li>
    <li>valid with your security access  ....

I'm really not sure how to do this.  These PIs are ill-behaved, crossing
element boundaries, can be nested, and can cross each other's boundaries
as well.  In other words, you could also see this:

<p>Programmatic access to objects is determined by the objects
that are
  <ul><?Fm Condstart API_Only?>
    <li>defined in your enterprise WSDL file</li>
    <li><?Fm Condstart OT_Only?><?Fm Condend API_Only?>
          available in the EntityNames[] array in the Session3 object
          <?Fm Condend OT_Only?></li>
    <li>in your organization configuration</li>
    <li>valid with your security access  ....

Notice how OT_Only starts before API_Only ends?  I'm stumped, so any
advice would be greatly appreciated.

Unless you can find a way to narrow down the range of your possible inputs (say, to avoid the kind of overlapping just shown), and even then, you are really going to find this tough going. The problem works directly at XML/XSLT's Achilles' heel, namely the notion that everything we need to work with fits nicely into the document tree. I'm not saying it's impossible to deal with ... rather, that this is an area of active research.


If I didn't have to do this at scale, I might be inclined to start with tag-writing techniques -- which ordinarily I would stay very far away from, as they violate the spirit of XSLT, and usually make for nothing but trouble -- and brace myself for a fair amount of cleanup by hand or otherwise.

If I did have to do this at scale (and maybe even if not), I would try very hard to specify more constraints on the input; then I'd use either tag-writing (quick, dirty and dangerous) or pipelining/grouping methods to handle the range of pseudo-tag pairs I was prepared to accept. I might use Schematron or a similar analytic validation strategy to help enforce those constraints. For example, in this case it might be possible to flatten the hierarchy first, perhaps calculating offsets to determine where ranges were co-terminous, then use grouping methods to restore the hierarchy, only with the extra information embedded.

In your examples, it might be possible to do something considerably less than this -- though I do wonder why one of your implicit ranges gets marked on the <li> element as <li platform="api">...</li>, while another comes out on a <ph> element as <ph platform="ot">...</ph> -- but you haven't suggested to us what you want to happen with a case such as

<ul><?Fm Condstart API_Only?>
    <li>defined in your enterprise WSDL file</li>
    <li><?Fm Condstart OT_Only?>
          available in the EntityNames[] array
        <?Fm Condend API_Only?> in the Session3 object
          <?Fm Condend OT_Only?></li>
    <li>in your organization configuration</li>
    <li>valid with your security access  ....

Notice in this case the ranges "actually" overlap, as there's text content that belongs to both the "API_Only" and "OT_Only" ranges ... will this never happen? (If not, maybe your problem can be simplified.)

There's a fair amount of literature on the general topic of overlapping structures in markup, and several different approaches to dealing with it, but none so mature that anything like an off-the-shelf solution is readily available.

Given the right search terms, Google might point you to
http://mulberrytech.com/Extreme/Proceedings/html/2004/Piez01/EML2004Piez01.html
or any of a number of other papers that have been written on this topic.

Good luck,
Wendell

Current Thread