Subject: [xsl] Initial whitespace in PI from XSLT, main body|
From: "Bauman, Syd s.bauman@xxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sat, 7 May 2022 21:14:19 -0000
[Could not post whole thing due to size limitation on list. Complete text version and separate appendices are currently available at https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_from_XSLT/. Since that is not a permanent store (hence the btemp/b in the path), I will post the appendices [A], [B], and [C], hopefully as a reply to this, shortly.] I have discovered a discrepancy between Saxon on the one hand and xsltproc & my intuition on the other when it comes to writing a processing instruction whose string value starts with whitespace. E.g. <?syd This is a test. This is only a test. ?> Reading When reading this PI, I fully expect the string value to start with the letter bTb and end with the string bt. b. This makes sense because the XML spec, in production 16, defines a PI as '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' where, of course, 'S' is one or more occurrences of any of the four whitespace characters. While the value string is not really defined in the prose, it is clear from the production that the S is only required if there is a string. This implies that the purpose of the S is to separate the PITarget from the string. I am used to greedy matching, so it makes sense to me that a parser would think of any and all whitespace immediately following the PITarget as a delimiter, and thus not return it as part of the value string. I grant that, as far as my small brain can tell, it would not be against the production for a parser to use non-greedy matching, decide only the first whitespace character matches the S, and that all following whitespace characters should match "Char*". But that is not what I expect, because it seems to violate the spirit of the production b if that were the desired result, why wouldnbt the spec use "(#x20 | #x9 | #xD | #xA)" between the PITarget and the rest, rather than "(#x20 | #x9 | #xD | #xA)+"? Furthermore, if this were the parsing algorithm, it would be possible to end up with a string value of a PI that contained nothing but whitespace characters. While not utterly insane, it does seem to be the kind of complication that is likely to be more trouble than it is worth. Besides, as I said, I am used to greedy matching and expect writers of XML parsers to be like me. p And, perhaps more importantly, the string value of a processing instruction node in the XDM is defined as bThe data part of the source PI, not including the whitespace that separates it from the PITarget.b Writing But what if I try to write a PI whose string value starts with one or more whitespace characters? First, we know the processor is required to write out one or more whitespace characters between the PITarget and the value string. I presume (without knowing for sure) that the processor is welcome to use whatever set of whitespace characters it wants to separate the PITarget from the rest when it serializes a PI. (I have never seen nor heard of a processor that uses anything other than a single space (U+0020) character, myself.) I further suspect that most processors would choose to not use any whitespace characters when serializing a PI that does not have a value string. But if I am explicitly giving the processor a string to use as the value of the PI that starts with space, I sort of expect that string, including the leading space, to appear in the output after whatever space the processor normally uses to separate a PITarget from a value string. And that is the behavior I get from xsltproc.[B]<https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_ from_XSLT/Appendix_B_xsltproc_output.xml> But it is not the behavior I get from Saxon.[C]<https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_fro m_XSLT/Appendix_C_Saxon_output.xml> So is Saxon in error, or is xsltproc in error, or is the spec ambiguous and either behavior is OK, or something else? P.S. I have tried a few various combinations of the -strip: commandline parameter to Saxon, and changing the program[A]<https://bauman.zapto.org/~syd/temp/DSG/Initial_whitespace_in_PI_fr om_XSLT/Appendix_A_XSLT_and_input.xslt> from an XSLT 1.0 pgm to an XSLT 3.0 pgm, same results. Notes  SaxonJ-HE 11.2 run in GNU bash on an Ubuntu 20.04.4 system.  Using libxml 20910, libxslt 10134 and libexslt 820 on same system.  https://www.w3.org/TR/REC-xml/  This becomes clearer if you reduce all that bany sequence of characters except NOT "?>"b stuff to something simple: '<?' PITarget (S (StringSansQuestionPointy) )? '?>'  I have to admit, though, the fact that the spec lists the illegal PITargets as b" XML ", " xml "b, putting spaces around the illegal Names, gives me pause. If there were only a space after, it would really boggle my thought process. But since there is space both before and after I suspect it is not intended, and this is just an error or editorial style I disagree with.  Kay, Michael, _XSLT 2.0 and XPath 2.0_, 4th ed. Wiley Publishing, Inc., Indianapolis, IN. p. 51.