Subject: Re: [xsl] vcf to xml? From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Sat, 25 Jun 2022 19:48:31 -0000 |
Norm Tovey-Walsh ndw@xxxxxxxxxx writes: > "Pieter Masereeuw pieter@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes: >> I wonder why nobody is mentioning invisible XML here. See >> https://invisiblexml.org/. > > Because if I mention it, Ibll get interested in writing the grammar and > I donbt have the time! :-) (Sigh.) Well, I guess it just goes to show that not all of us have as much self-discipline as Norm does. Dave Pawson and I did write the grammar. When an ixml processor is presented with the ixml grammar given below and the vCard data in Eliot Kimber's earlier email, it produces the output shown below. To test this, interested readers can try the invisible XML processors listed at https://invisiblexml.org/ -- perhaps the simplest approach is to just paste the input and the grammar into the text fields of jO iXML [1]. If you want a command-line interface, Coffeepot [2] is your friend. The nameless parser at [3] also offers a web interface, but does not (yet) support the 'insertions' construct recently added to the spec and used in the grammar below. (Coming soon, I hope.) [1] https://github.com/johnlumley/jwiXML [2] https://coffeepot.nineml.org/ [3] https://www.cwi.nl/~steven/ixml/tutorial/run.html The XML produced by this grammar is less compact than the output from Eliot's stylesheet, but the grammar does handle continuation lines, parameters, parameters with multiple semicolon-separated values, multiple comma-separated property values, and escaped colons, semicolons, commas, newlines, and backslashes. And if for downstream processing one wants <ADR>...</> instead of <property name="ADR">...</>, well, it's easy to write an XSLT stylesheet to do that. The language of vCards is simple enough that you don't really need a context-free grammar for it: there is no recursion. But the rules are complex enough that trying to do it with regular expressions would be a challenge. It's a huge help to be able to write separate rules for things like 'invisible line break followed by a tab', 'invisible line break following by one blank', and 'invisible line break followed by two blanks' -- all of which require different treatment. If I were doing this in XSLT, I would build up the regular expression bit by bit with variables, which allow a similar separation of concerns, but are (in my limited experience) less convenient than adding a nonterminal in an ixml grammar. My thanks to Dave Pawson for sanity-checking the results and helping persuade me not to be quite as pedantic as RFC 6350 as regards the precise amount of whitespace allowed to follow the END:VCARD signal. -Michael Sperberg-McQueen p.s. A fuller description of the state of play with regard to ixml processors and other infrastructure will be given in a talk at Balisage 2022. If you haven't registered for Balisage yet, you still have time to do so. ----------------------------------------------------------------- 5 attachments: input grammar, example card 1, example card 2, sample output for card 1, sample output for card 2 ----------------------------------------------------------------- ----- 1 The input grammar --------------------------------------- ----------------------------------------------------------------- { Generic vCard syntax Adapted from https://datatracker.ietf.org/doc/html/rfc6350#section-3.3 by eliminating all specific keywords and just recording their values. 0 first version 1 fix ambiguity in value for case of a single field. hide CRLF for legibility hide ALPHA etc. for legibility 2 allow invisible line breaks within quoted values (!) hide the BEGIN/END literals 3 make 'name' more complicated to ensure that 'BEGIN' and 'END' are not recognized as names but force the recognition of vcard boundaries. There ought to be a simpler way. 4 Hide magic character sequences, invisible line breaks, and other apparatus, since they appear to be working correctly. Also hide the internals of quoted parameter values. 5 allow colons to be escaped, too, make the CRLF rules at the end of a card less pedantic. 2022-06-24/2022-06-25, Michael Sperberg-McQueen and Dave Pawson } { A collection of vcards is one or more cards. } vcards = vcard+. { A vcard is a BEGIN:VCARD, a sequence of content lines, and and END:VCARD. The RFC requires that certain fields occur, and writes them into the grammar, but we will accept any fields as long as they conform to the generic syntax. } vcard = -"BEGIN:VCARD", CRLF, property+, -"END:VCARD", CRLF*. { A property (the RFC calls it a 'contentline') is represented by an optional group marker, a property name, optional parameters, and one or more comma-separated values, ending with a CRLF. Note that the values may contain line-continuations and escaped characters (\, for comma, \; for semi-colon, \n for newline, \\ for backslash), which should be unescaped by the parser. } property = (@group, -".")?, name, (-";", param)*, -":", value ++ -",", CRLF. group = (ALPHA | DIGIT | "-")+. { In principle name could be very simple. But we want to distinguish normal names from x-names, and we want to ensure that BEGIN and END are not recognized as names but as keywords. So we have a more complicated definition. } @name = not-an-x-name | not-begin | not-end | normal-name | x-name . { not-an-x-name, though it begins with X } -not-an-x-name = ["Xx"], (~["-"], (ALPHA | DIGIT | "-")*)?. { not-begin, though it begins with B... } -not-begin = "BEGI", (~["nN"], (ALPHA | DIGIT | "-")*)? | "BEG", (~["iI"], (ALPHA | DIGIT | "-")*)? | "BE", (~["gG"], (ALPHA | DIGIT | "-")*)? | "B", (~["eE"], (ALPHA | DIGIT | "-")*)? . { not-end, though it begins with E or EN } -not-end = ["Ee"], ["Nn"], (~["Dd"], (ALPHA | DIGIT | "-")*)? | ["Ee"], (~["Nn"], (ALPHA | DIGIT | "-")*)? . { normal-name: does not look like x-name, begin, or end at any point } -normal-name = ~["XxBbEe"], (ALPHA | DIGIT | "-")*. { The spec defines a lot of names as part of the grammar, and expects the parser to adjust its parsing based on the name. It also requires case insensitivity. But since all of the names explicitly specified also match the general rules for iana-tokens, the grammar is hopelessly ambiguous. So we leave recognition of the known fields and their semantics to the application. name = "SOURCE" | "KIND" | "FN" | "N" | "NICKNAME" | "PHOTO" | "BDAY" | "ANNIVERSARY" | "GENDER" | "ADR" | "TEL" | "EMAIL" | "IMPP" | "LANG" | "TZ" | "GEO" | "TITLE" | "ROLE" | "LOGO" | "ORG" | "MEMBER" | "RELATED" | "CATEGORIES" | "NOTE" | "PRODID" | "REV" | "SOUND" | "UID" | "CLIENTPIDMAP" | "URL" | "KEY" | "FBURL" | "CALADRURI" | "CALURI" | "XML" | iana-token | x-name. iana-token = (ALPHA | DIGIT | "-")+. { identifier registered with IANA } An alternative approach would be to assume that the input data uses only the known field names and x-names, and does not use any IANA-registered tokens. The cost : benefit ratio of that approach seems too high, so we go generic instead. } x-name = ["xX"], "-", (ALPHA | DIGIT | "-")+. { Names that begin with "x-" or "X-" are reserved for experimental use, not intended for released products, or for use in bilateral agreements. } param = name, -"=", param-value ++ -",". param-value = pv-char* | quoted-pvalue. -quoted-pvalue = -DQUOTE, qpv-char*, -DQUOTE. -pv-char = SAFE-CHAR | magic. -qpv-char = QSAFE-CHAR | magic. value = -field | field, (-";", field)+. field = data-char*. -data-char = non-special-char | -visible-blank | tab | magic . { 'Magic' sequences are those requiring special handling } -magic = INVISIBLE-1 | INVISIBLE-2 | INVISIBLE-3 | esc-NL { newline } | esc-SEMICOLON | esc-COLON | esc-COMMA | esc-BS . -INVISIBLE-1 = CRLF, tab. -INVISIBLE-2 = CRLF, invisible-blank. -INVISIBLE-3 = INVISIBLE-2, visible-blank. -non-special-char = ~[#0D; #0A; #09; #20; ";"; ","; #5C]. { non-special = not whitespace (meaningful for line folding) not semicolon or comma or backslash (need escaping) } -esc-NL = -"\n", +#0A. -esc-SEMICOLON = -#5C, ";". -esc-COLON = -#5C, ":". -esc-COMMA = -#5C, ",". -esc-BS = -#5C, #5C. -visible-blank = #20. -invisible-blank = -#20. tab = -#09. -SAFE-CHAR = WSP | "!" | [#23-#2B; #2D-#39; #3C-#5B; #5D-#7E] | NON-ASCII . { Any character except CTLs, DQUOTE, ";", ":", comma, and backspace } -NON-ASCII = UTF8-2 | UTF8-3 | UTF8-4. -QSAFE-CHAR = visible-blank | tab | "!" | [#23-#5B; #5D-#7E] | NON-ASCII . { Any character except CTLs, DQUOTE, backslash } -CRLF = -#D?, -#A. { strictly speaking the RFC requires the #D } WSP = SP | HTAB. SP = #20. HTAB = #09. DQUOTE = -#22. -ALPHA = ['A'-'Z'; 'a'-'z']. -DIGIT = ['0'-'9']. { Definitions (adapted) from RFC 5234 } { Depending on how the ixml processor reads the file, this may or may not work correctly. Time will tell. } UTF8-2 = [#C2-#DF], UTF8-tail. UTF8-3 = #E0, [#A0-#BF], UTF8-tail | [#E1-#EC], UTF8-tail, UTF8-tail | #ED, [#80-#9F], UTF8-tail | [#EE-#EF], UTF8-tail, UTF8-tail . UTF8-4 = #F0, [#90-#BF], UTF8-tail, UTF8-tail | [#F1-#F3], UTF8-tail, UTF8-tail, UTF8-tail | #F4, [#80-#8F], UTF8-tail, UTF8-tail . -UTF8-tail = [#80-#BF]. ----------------------------------------------------------------- ----- 2 Sample input data (1) ----------------------------------- ----- (From Eliot Kimber's mail to xsl-list) -------------------- ----------------------------------------------------------------- BEGIN:VCARD VERSION:3.0 N:Lastname;Surname FN:Displayname ORG:EVenX URL:http://www.evenx.com/ EMAIL:info@xxxxxxxxx TEL;TYPE=voice,work,pref:+49 1234 56788 ADR;TYPE=intl,work,postal,parcel:;;Wallstr. 1;Berlin;;12345;Germany END:VCARD ----------------------------------------------------------------- ----- 3 Sample input data (2) ----------------------------------- ----- (From https://github.com/ertant/vCard) -------------------- ----------------------------------------------------------------- BEGIN:VCARD VERSION:4.0 N:Gump;Forrest;;; FN:Forrest Gump ORG:Bubba Gump Shrimp Co. TITLE:Shrimp Man PHOTO;MEDIATYPE=image/gif:http://www.example.com/dir_photos/my_photo.gif TEL;TYPE=work,voice;VALUE=uri:tel:+11115551212 TEL;TYPE=home,voice;VALUE=uri:tel:+14045551212 ADR;TYPE=work;LABEL="100 Waters Edge\nBaytown, LA 30314\nUnited States of A merica":;;100 Waters Edge;Baytown;LA;30314;United States of America ADR;TYPE=home;LABEL="42 Plantation St.\nBaytown, LA 30314\nUnited States of America":;;42 Plantation St.;Baytown;LA;30314;United States of America EMAIL:forrestgump@xxxxxxxxxxx REV:20080424T195243Z END:VCARD ----------------------------------------------------------------- ----- 4 Sample output XML (1) ----------------------------------- ----------------------------------------------------------------- <vcards> <vcard> <property name="VERSION"> <value>3.0</value> </property> <property name="N"> <value> <field>Lastname</field> <field>Surname</field> </value> </property> <property name="FN"> <value>Displayname</value> </property> <property name="ORG"> <value>EVenX</value> </property> <property name="URL"> <value>http://www.evenx.com/</value> </property> <property name="EMAIL"> <value>info@xxxxxxxxx</value> </property> <property name="TEL"> <param name="TYPE"> <param-value>voice</param-value> <param-value>work</param-value> <param-value>pref</param-value> </param> <value>+49 1234 56788</value> </property> <property name="ADR"> <param name="TYPE"> <param-value>intl</param-value> <param-value>work</param-value> <param-value>postal</param-value> <param-value>parcel</param-value> </param> <value> <field/> <field/> <field>Wallstr. 1</field> <field>Berlin</field> <field/> <field>12345</field> <field>Germany</field> </value> </property> </vcard> </vcards> ----------------------------------------------------------------- ----- 5 Sample output XML (2) ----------------------------------- ----------------------------------------------------------------- <vcards> <vcard> <property name="VERSION"> <value>4.0</value> </property> <property name="N"> <value> <field>Gump</field> <field>Forrest</field> <field/> <field/> <field/> </value> </property> <property name="FN"> <value>Forrest Gump</value> </property> <property name="ORG"> <value>Bubba Gump Shrimp Co.</value> </property> <property name="TITLE"> <value>Shrimp Man</value> </property> <property name="PHOTO"> <param name="MEDIATYPE"> <param-value>image/gif</param-value> </param> <value>http://www.example.com/dir_photos/my_photo.gif</value> </property> <property name="TEL"> <param name="TYPE"> <param-value>work</param-value> <param-value>voice</param-value> </param> <param name="VALUE"> <param-value>uri</param-value> </param> <value>tel:+11115551212</value> </property> <property name="TEL"> <param name="TYPE"> <param-value>home</param-value> <param-value>voice</param-value> </param> <param name="VALUE"> <param-value>uri</param-value> </param> <value>tel:+14045551212</value> </property> <property name="ADR"> <param name="TYPE"> <param-value>work</param-value> </param> <param name="LABEL"> <param-value>100 Waters Edge Baytown, LA 30314 United States of America</param-value> </param> <value> <field/> <field/> <field>100 Waters Edge</field> <field>Baytown</field> <field>LA</field> <field>30314</field> <field>United States of America</field> </value> </property> <property name="ADR"> <param name="TYPE"> <param-value>home</param-value> </param> <param name="LABEL"> <param-value>42 Plantation St. Baytown, LA 30314 United States ofAmerica</param-value> </param> <value> <field/> <field/> <field>42 Plantation St.</field> <field>Baytown</field> <field>LA</field> <field>30314</field> <field>United States of America</field> </value> </property> <property name="EMAIL"> <value>forrestgump@xxxxxxxxxxx</value> </property> <property name="REV"> <value>20080424T195243Z</value> </property> </vcard> </vcards> ----------------------------------------------------------------- -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] vcf to xml?, Norm Tovey-Walsh ndw | Thread | Re: [xsl] vcf to xml?, Eliot Kimber eliot.k |
Re: [xsl] vcf to xml?, Paul Tyson phtyson@x | Date | Re: [xsl] vcf to xml?, Norm Tovey-Walsh ndw |
Month |