Subject: RE: [xsl] Identifying output from the (MS) xml parser From: "MarrowSoft Support" <marrow@xxxxxxxxxxxxxx> Date: Tue, 10 Dec 2002 09:21:01 -0000 |
Hi Hugh, > I believe the output from the parser must be one of the following: > A result tree; > A wide (Unicode) string; > An ASCII (8bit) string; The multiple choices here a slightly contradictory. The transformation will output either XML (which would equate to what you're calling a result tree I assume), HTML or text according to the <xsl:output> @method attribute (see also below). But this is not really related to the encoding, i.e. whether the output is UTF-16 or not. The transformation engine (parser) will actually output either UTF-16 string (BSTR) or a stream - where that stream might be encoded as ASCII or a multitude of other encodings. But it will never output an 8-bit ASCII string as such - that particular encoded output would have to go into an output stream. > I believe which of these is produced will be determined by the > <xsl:output> element. Yes, the output is determined by <xsl:output> element - if this is present (i.e. it is not obligatory for that element to be there in the stylesheet nor are any/all of the deciding attributes)... The method is determined by the @method attribute (if present) of the <xsl:output> element. If the @method attribute is omitted then the transformation engine will use defaults - which means the output will either be XML or HTML (see http://www.w3.org/TR/xslt#output). The default is to output XML unless the first output element is named <HTML> (in any case combination) in which case it assumes the output is HTML. The encoding is determined by the @encoding attribute (if present) of the <xsl:output> element. If not specified then the default is always UTF-16. But there is a big gotcha with this (and the cause of the biggest FAQ qith MSXMLs) - in that even if the @encoding attribute is specified you may still end up with UTF-16 output depending on which methods you used to perform the transformation:- 1) if you use .transformNode() method then the output will always be UTF-16 because the result of that method is a BSTR - so it must, by nature, be encoded as UTF-16. 2) if you use the .transformNodeToObject() method then the output will be UTF-16 if the second parameter of that method call is a DOM object. But if the second parameter is a stream object (i.e. one that supports a .write() method) then the output will be encoded according to the encoding specified by the @encoding attribute 3) if you use the IXSLProcessor/IXSLTemplate interfaces to perform the transformation then it depends on 'how' you use these interfaces to determine whether you will get UTF-16 or some other encoding specified by the @encoding attribute. This is because the .output property of the IXSLProcessor interface can be set prior to transformation or just read after transformation. If the .output property is assigned prior to transformation with a stream object then the stream will be written to in the specified encoding. But if the .output property is only read after the transformation then the output will always be UTF-16 because the .output property, in this case, can only contain a BSTR. If you don't know what the user is delivering to you to be transformed then it is probably best to use the IXSLProcessor/IXSLTemplate interfaces to perform the transformation - and set the .output property to a stream object prior to the transformation. > Do I have to interrogate the style sheet to find this information? This is probably unwise to do as the first step. It may be something that you could utilize to clarify things - but this would have to be a matter of elimination and detection followed by a checking of what may have been specified on the <xsl:output> element. > If so, can I assume that the <xsl:stylesheet> element is the second node > in the xslt tree, or at least a top level element? Not really - the <xsl:stylesheet>, if used, must be the root element but the stylesheet might not contain an <xsl:stylesheet> element at all (see http://www.w3.org/TR/xslt#result-element-stylesheet). > If the output method is "xml" or "html" the output must be a result > tree/xml document. Not really - bear in mind that the whole point of the HTML output method is to be able to generate HTML which may not constitute well-formed XML. > I have also seen an attribute of <output> called media-type, but have > not been able to find any documentation on this. Can anyone comment on > this? The documentation on this is in the spec - but I don't think it has any great impact on what you are doing. With MSXML the only time you will see any impact of this is when you use an output @method of HTML where the media type is placed in the @content attribute of a <META> tag. > If the <output> "method" is neither "xml" not "html", then I assume the > output is a character stream. It is as well not to confuse in any way the output method and encoding - as they are, for the most part, unrelated. > Whether this uses 16 bit or 8 > bit/multibyte characters will depend upon the "encoding" attribute. Is > there a concise list of the "encoding" values that result in characters > of a particular size, or some other way to determine this information? For MSXML there isn't even a concise list of encodings that are supported - because this will vary from machine to machine - depending on what language packs etc. are present on that machine. But whether a particular encoding is 16-bit or 8-bit is probably going to be a distraction rather than a help - in that you won't want to be writing code that copes with all encodings when there are Windows APIs that will help you convert everything (i.e. take a look at the MultiByteToWideChar() and WideCharToMultiByte() API) - but in order to use these you will need to ascertain the encoding of the output. > My user supplies both xml and xslt input. > The input may generate a new xml document, or a flat file. > I am trying to determine what comes out of the ms parser. > Could someone(s) please advice me of the accuracy, or otherwise, of the > following statements? This would all depend on what you then want to do with the output. If you are just going to save the output to, say for example, a file then it shouldn't matter the encoding - just save the file as is from an output stream. You may, of course, need to figure out the output type (XML, HTML or text) in order to determine the best file extension to give the output file. You might also be as well to look into BOMs (Byte Order Markers) - as these, if present on the output, will give you good indications of the encoding that was used for the output (see also http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing and http://www.unicode.org). Hope this helps Marrow http://www.marrowsoft.com - home of Xselerator (XSLT IDE and debugger) http://www.topxml.com/Xselerator -----Original Message----- From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Hugh Dixon Sent: 10 December 2002 03:34 To: XSL-List@xxxxxxxxxxxxxxxxxxxxxx Subject: [xsl] Identifying output from the (MS) xml parser I am writing some C++ code to run under windows, using the MSXML DOM implementation. My user supplies both xml and xslt input. The input may generate a new xml document, or a flat file. I am trying to determine what comes out of the ms parser. Could someone(s) please advice me of the accuracy, or otherwise, of the following statements? I believe the output from the parser must be one of the following: A result tree; A wide (Unicode) string; An ASCII (8bit) string; I believe which of these is produced will be determined by the <xsl:output> element. Do I have to interrogate the style sheet to find this information? If so, can I assume that the <xsl:stylesheet> element is the second node in the xslt tree, or at least a top level element? I believe the <xsl:output> element can only be a direct child (topmost element) of the <xsl:stylesheet> element. Could someone confirm this? If the output method is "xml" or "html" the output must be a result tree/xml document. I have also seen an attribute of <output> called media-type, but have not been able to find any documentation on this. Can anyone comment on this? If the <output> "method" is neither "xml" not "html", then I assume the output is a character stream. Whether this uses 16 bit or 8 bit/multibyte characters will depend upon the "encoding" attribute. Is there a concise list of the "encoding" values that result in characters of a particular size, or some other way to determine this information? Thanks!!! XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Identifying output from the (, Hugh Dixon | Thread | RE: [xsl] Identifying output from t, Michael Kay |
RE: [xsl] JAXP: namespaces/namespac, DPawson | Date | RE: [xsl] JAXP: namespaces/namespac, Michael Kay |
Month |