At 2010-08-04 17:39 -0400, steve.majewski@xxxxxxxxx wrote:
When our EAD/XML files are edited, we run them thru a stylesheet that
that certain sections all have @id attributes, and if not, adds them
using generate-id().
That sounds risky to me. I tell my students one should be using
generate-id() for *every* element with an @id and adjusting any
@idref attributes to use the translated values. There is an
infinitesimal but possible chance that an authored id attribute will
match a generated id attribute.
The uniqueness of identifiers is guaranteed only when generate-id()
is used for every identifier. This makes sense because generate-id()
has no way of knowing which of your attributes are identifiers and
which are not.
You've exaggerated the risk by running a document with generated
identifiers through a process that again generates identifiers using
the same implementation-defined algorithm. But you haven't protected
the identifiers on the way in from the identifiers being generated
the second time.
I've recently discovered that some of those files now have duplicate
I think we've had a misconception about the uniqueness of generated ids.
A closer reading of M.Kay's book, as well as searching this lists
recent archives
says that it's "guaranteed to be unique for every node that
participates in a given transform"
Additionally, in one of those other threads, Florent Georges wrote:
Yes. And it is guaranteed to generate always the same ID when
called on the same node.
I suspect that what was not explicitly stated but implied by that
clause is that
it means that it is unique for nodes *generated* in a given transform,
not including those ids that are passed thru and copied from the input
to the output doc.
False. Every time a tree is created, be it from the source tree,
from a document() or doc() function, from a temporary tree variable,
that tree will be made up of nodes. Every node across all trees in
the one transformation will have a unique identifier. Said
differently, no two nodes across all trees in the one transformation
will have the same identifier.
But that is all. Nothing is said about what the user uses for
identifiers in the authored content.
We have generated nodes id's from previous transforms. Usually, these
do seem
to be unique -- I suspect because of that additional condition above
about "same node".
I think the cases where we do have duplicates were when a new element
was inserted
above another of the same kind, with a previously generated id. This
new node -- although
having entirely different content -- is considered "the same node" in
the sense that
it has the same xpath, for example: /ead/archdesc/dsc/c01[1]/c02[1]
( the previous node, being "pushed down" to //c02[2] )
The uniqueness of nodes is *not* guaranteed from one transformation
to the next. When you pass a document through a second
transformation, the engine's determination of uniqueness starts from
scratch, without any knowledge of any id values in your input document.
If you follow the scheme I tell my students, then you get back to
being unique across all nodes ... the values simply change every time
a transformation is performed.
Am I (finally!) understanding this correctly ?
I'm not sure as I didn't really understand your explanation because I
could not correlate your uses of "usually" and "the cases where" and
"these seem to".
Does the above sound like a reasonable and likely explanation of
what's happening ?
I think so if what you are finding is that:
<section id="x">
<xref idref="x"/>
... gets written out as:
<section id="x">
<section id="gen-e3">
<xref idref="x"/>
... which when you then add a new section:
<section id="x">
<section id="gen-e3">
<xref idref="x"/>
... gets transformed to become:
<section id="x">
<section id="gen-e3">
<section id="gen-e3">
<xref idref="x"/>
... because the new section is again the third element in the
document ... and you have a duplicate. Note that my values for
example here are invalid because a generated id cannot have a "-" but
I'm using that to illustrate my point. Also, a poor algorithm since
the text nodes are also nodes with unique identifiers. But this is
just an example.
Now, if you follow my advice to students, then:
<section id="x">
<xref idref="x"/>
... gets written out as:
<section id="gen-e2">
<section id="gen-e3">
<xref idref="gen-e2"/>
... which when you then add a new section:
<section id="gen-e2">
<section id="gen-e3">
<xref idref="gen-e2"/>
... gets transformed to become:
<section id="gen-e2">
<section id="gen-e3">
<section id="gen-e4">
<xref idref="gen-e2"/>
.... and if the first section had moved, then the idref= would have
also changed to be the new id= value for that first section. Every
node with an ID gets written out not with the authored ID but with
the generated ID ... and every IDREF gets written out with the
generated ID of the node it points to.
This comes up also in my XSL-FO instruction, because when you are
aggregating multiple XML documents into a single XSL-FO output, and
you are dealing with user-authored id values, you cannot use them as
is because the value space for each document is independent. It
would be too easy for two documents to have the same ID, so you
cannot put that ID into the XSL-FO because that would create a conflict.
So, by following my rule of thumb, *every* ID gets replaced with that
node's generated identifier, and every corresponding IDREF gets
replaced with the referenced node's generated identifier, then
everything is safely identified across all documents being aggregated
and there are no ambiguous references.
I hope this helps.
. . . . . . . . . . . . Ken
XSLT/XQuery training: after 2011-03-28/04-01
Vote for your XML training:
Crane Softwrights Ltd.
G. Ken Holman mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Male Cancer Awareness Nov'07
Legal business disclaimers: