Re: [xsl] vcf to xml?

Subject: Re: [xsl] vcf to xml?
From: "C. M. Sperberg-McQueen cmsmcq@xxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sat, 25 Jun 2022 19:48:31 -0000
Norm Tovey-Walsh ndw@xxxxxxxxxx writes:

> "Pieter Masereeuw pieter@xxxxxxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> writes:
>> I wonder why nobody is mentioning invisible XML here. See
>> https://invisiblexml.org/.
>
> Because if I mention it, Ibll get interested in writing the grammar and
> I donbt have the time! :-)

(Sigh.) Well, I guess it just goes to show that not all of us have as
much self-discipline as Norm does.  Dave Pawson and I did write the
grammar.

When an ixml processor is presented with the ixml grammar given below
and the vCard data in Eliot Kimber's earlier email, it produces the
output shown below.  To test this, interested readers can try the
invisible XML processors listed at https://invisiblexml.org/ -- perhaps
the simplest approach is to just paste the input and the grammar into
the text fields of jO	iXML [1].  If you want a command-line interface,
Coffeepot [2] is your friend.  The nameless parser at [3] also offers a
web interface, but does not (yet) support the 'insertions' construct
recently added to the spec and used in the grammar below.  (Coming soon,
I hope.)

[1] https://github.com/johnlumley/jwiXML
[2] https://coffeepot.nineml.org/
[3] https://www.cwi.nl/~steven/ixml/tutorial/run.html

The XML produced by this grammar is less compact than the output from
Eliot's stylesheet, but the grammar does handle continuation lines,
parameters, parameters with multiple semicolon-separated values,
multiple comma-separated property values, and escaped colons,
semicolons, commas, newlines, and backslashes.  And if for downstream
processing one wants <ADR>...</> instead of <property name="ADR">...</>,
well, it's easy to write an XSLT stylesheet to do that.

The language of vCards is simple enough that you don't really need a
context-free grammar for it: there is no recursion.  But the rules are
complex enough that trying to do it with regular expressions would be a
challenge.  It's a huge help to be able to write separate rules for
things like 'invisible line break followed by a tab', 'invisible line
break following by one blank', and 'invisible line break followed by two
blanks' -- all of which require different treatment.  If I were doing
this in XSLT, I would build up the regular expression bit by bit with
variables, which allow a similar separation of concerns, but are (in my
limited experience) less convenient than adding a nonterminal in an ixml
grammar.

My thanks to Dave Pawson for sanity-checking the results and helping
persuade me not to be quite as pedantic as RFC 6350 as regards the
precise amount of whitespace allowed to follow the END:VCARD signal.

-Michael Sperberg-McQueen

p.s. A fuller description of the state of play with regard to ixml
processors and other infrastructure will be given in a talk at Balisage
2022.  If you haven't registered for Balisage yet, you still have time
to do so.

-----------------------------------------------------------------
5 attachments:  input grammar, example card 1, example card 2,
sample output for card 1, sample output for card 2
-----------------------------------------------------------------
----- 1 The input grammar ---------------------------------------
-----------------------------------------------------------------
{ Generic vCard syntax
  Adapted from https://datatracker.ietf.org/doc/html/rfc6350#section-3.3
  by eliminating all specific keywords and just recording their values.

  0 first version
  1 fix ambiguity in value for case of a single field.
    hide CRLF for legibility
    hide ALPHA etc. for legibility
  2 allow invisible line breaks within quoted values (!)
    hide the BEGIN/END literals
  3 make 'name' more complicated to ensure that 'BEGIN' and 'END'
    are not recognized as names but force the recognition of vcard
    boundaries.  There ought to be a simpler way.
  4 Hide magic character sequences, invisible line breaks, and
    other apparatus, since they appear to be working correctly.
    Also hide the internals of quoted parameter values.
  5 allow colons to be escaped, too, make the CRLF rules at the
    end of a card less pedantic.

  2022-06-24/2022-06-25, Michael Sperberg-McQueen and Dave Pawson

}

                       { A collection of vcards is one or more cards. }
              vcards = vcard+.

                       { A vcard is a BEGIN:VCARD, a sequence of
                         content lines, and and END:VCARD.  The RFC
                         requires that certain fields occur, and
                         writes them into the grammar, but we will
                         accept any fields as long as they conform to
                         the generic syntax. }
               vcard = -"BEGIN:VCARD", CRLF,
                       property+,
                       -"END:VCARD", CRLF*.

                       { A property (the RFC calls it a 'contentline')
                         is represented by an optional group marker, a
                         property name, optional parameters, and one
                         or more comma-separated values, ending with a
                         CRLF. Note that the values may contain
                         line-continuations and escaped characters (\,
                         for comma, \; for semi-colon, \n for newline,
                         \\ for backslash), which should be unescaped
                         by the parser. }

              property = (@group, -".")?, name,
                         (-";", param)*,
                         -":", value ++ -",",
                         CRLF.

                 group = (ALPHA | DIGIT | "-")+.

                         { In principle name could be very simple.
                           But we want to distinguish normal names
                           from x-names, and we want to ensure that
                           BEGIN and END are not recognized as names
                           but as keywords. So we have a more complicated
                           definition. }
                 @name = not-an-x-name
		       | not-begin
		       | not-end
		       | normal-name
                       | x-name
                       .

                         { not-an-x-name, though it begins with X }
        -not-an-x-name = ["Xx"], (~["-"], (ALPHA | DIGIT | "-")*)?.

                         { not-begin, though it begins with B... }
            -not-begin = "BEGI", (~["nN"], (ALPHA | DIGIT | "-")*)?
	               | "BEG", (~["iI"], (ALPHA | DIGIT | "-")*)?
	               | "BE", (~["gG"], (ALPHA | DIGIT | "-")*)?
	               | "B", (~["eE"], (ALPHA | DIGIT | "-")*)?
		       .

                         { not-end, though it begins with E or EN }
              -not-end = ["Ee"], ["Nn"], (~["Dd"], (ALPHA | DIGIT | "-")*)?
	               | ["Ee"], (~["Nn"], (ALPHA | DIGIT | "-")*)?
		       .

                         { normal-name: does not look like x-name,
                           begin, or end at any point }
          -normal-name = ~["XxBbEe"], (ALPHA | DIGIT | "-")*.


{ The spec defines a lot of names as part of the grammar, and expects
  the parser to adjust its parsing based on the name.  It also requires
  case insensitivity.  But since all of the names explicitly specified
  also match the general rules for iana-tokens, the grammar is
  hopelessly ambiguous.  So we leave recognition of the known fields
  and their semantics to the application.

name  = "SOURCE" | "KIND" | "FN" | "N" | "NICKNAME"
      | "PHOTO" | "BDAY" | "ANNIVERSARY" | "GENDER" | "ADR" | "TEL"
      | "EMAIL" | "IMPP" | "LANG" | "TZ" | "GEO" | "TITLE" | "ROLE"
      | "LOGO" | "ORG" | "MEMBER" | "RELATED" | "CATEGORIES"
      | "NOTE" | "PRODID" | "REV" | "SOUND" | "UID" | "CLIENTPIDMAP"
      | "URL" | "KEY" | "FBURL" | "CALADRURI" | "CALURI" | "XML"
      | iana-token | x-name.

iana-token = (ALPHA | DIGIT | "-")+.
     { identifier registered with IANA }

An alternative approach would be to assume that the input data uses
only the known field names and x-names, and does not use any
IANA-registered tokens.  The cost : benefit ratio of that approach
seems too high, so we go generic instead.
}

              x-name = ["xX"], "-", (ALPHA | DIGIT | "-")+.
                       { Names that begin with "x-" or "X-" are
                         reserved for experimental use, not intended
                         for released products, or for use in
                         bilateral agreements. }

               param = name, -"=", param-value ++ -",".
         param-value = pv-char* | quoted-pvalue.
      -quoted-pvalue = -DQUOTE, qpv-char*, -DQUOTE.
            -pv-char = SAFE-CHAR | magic.
           -qpv-char = QSAFE-CHAR | magic.

               value = -field
                     | field, (-";", field)+.
               field = data-char*.

          -data-char = non-special-char
                     | -visible-blank
                     | tab
                     | magic
                     .
		       { 'Magic' sequences are those requiring special
                         handling }
            -magic   = INVISIBLE-1
                     | INVISIBLE-2
                     | INVISIBLE-3
                     | esc-NL { newline }
                     | esc-SEMICOLON
                     | esc-COLON
                     | esc-COMMA
                     | esc-BS
                     .

        -INVISIBLE-1 = CRLF, tab.
        -INVISIBLE-2 = CRLF, invisible-blank.
        -INVISIBLE-3 = INVISIBLE-2, visible-blank.
   -non-special-char = ~[#0D; #0A; #09; #20; ";"; ","; #5C].
                       { non-special =
                         not whitespace (meaningful for line folding)
                         not semicolon or comma or backslash (need
                             escaping)
                       }
             -esc-NL = -"\n", +#0A.
      -esc-SEMICOLON = -#5C, ";".
          -esc-COLON = -#5C, ":".
          -esc-COMMA = -#5C, ",".
             -esc-BS = -#5C, #5C.
      -visible-blank = #20.
    -invisible-blank = -#20.
                 tab = -#09.


          -SAFE-CHAR = WSP
	             | "!"
		     | [#23-#2B; #2D-#39; #3C-#5B; #5D-#7E]
	             | NON-ASCII
		     .
                       { Any character except CTLs, DQUOTE, ";", ":",
                         comma, and backspace }

          -NON-ASCII = UTF8-2 | UTF8-3 | UTF8-4.

         -QSAFE-CHAR = visible-blank
	             | tab
	             | "!"
		     | [#23-#5B; #5D-#7E]
		     | NON-ASCII
		     .
                       { Any character except CTLs, DQUOTE,
                         backslash }

               -CRLF = -#D?, -#A. { strictly speaking the RFC requires the #D
}
                 WSP = SP | HTAB.
                  SP = #20.
                HTAB = #09.
              DQUOTE = -#22.
              -ALPHA = ['A'-'Z'; 'a'-'z'].
              -DIGIT = ['0'-'9'].

{ Definitions (adapted) from RFC 5234 }
{ Depending on how the ixml processor reads the file,
  this may or may not work correctly.  Time will tell. }

              UTF8-2 = [#C2-#DF], UTF8-tail.

              UTF8-3 = #E0, [#A0-#BF], UTF8-tail
                     | [#E1-#EC], UTF8-tail, UTF8-tail
                     | #ED, [#80-#9F], UTF8-tail
                     | [#EE-#EF], UTF8-tail, UTF8-tail
                     .
              UTF8-4 = #F0, [#90-#BF], UTF8-tail, UTF8-tail
                     | [#F1-#F3], UTF8-tail, UTF8-tail, UTF8-tail
                     | #F4, [#80-#8F], UTF8-tail, UTF8-tail
                     .
          -UTF8-tail = [#80-#BF].
-----------------------------------------------------------------
----- 2 Sample input data (1) -----------------------------------
----- (From Eliot Kimber's mail to xsl-list) --------------------
-----------------------------------------------------------------
BEGIN:VCARD
VERSION:3.0
N:Lastname;Surname
FN:Displayname
ORG:EVenX
URL:http://www.evenx.com/
EMAIL:info@xxxxxxxxx
TEL;TYPE=voice,work,pref:+49 1234 56788
ADR;TYPE=intl,work,postal,parcel:;;Wallstr. 1;Berlin;;12345;Germany
END:VCARD
-----------------------------------------------------------------
----- 3 Sample input data (2) -----------------------------------
----- (From https://github.com/ertant/vCard) --------------------
-----------------------------------------------------------------
BEGIN:VCARD
VERSION:4.0
N:Gump;Forrest;;;
FN:Forrest Gump
ORG:Bubba Gump Shrimp Co.
TITLE:Shrimp Man
PHOTO;MEDIATYPE=image/gif:http://www.example.com/dir_photos/my_photo.gif
TEL;TYPE=work,voice;VALUE=uri:tel:+11115551212
TEL;TYPE=home,voice;VALUE=uri:tel:+14045551212
ADR;TYPE=work;LABEL="100 Waters Edge\nBaytown, LA 30314\nUnited States of A
 merica":;;100 Waters Edge;Baytown;LA;30314;United States of America
ADR;TYPE=home;LABEL="42 Plantation St.\nBaytown, LA 30314\nUnited States of
 America":;;42 Plantation St.;Baytown;LA;30314;United States of America
EMAIL:forrestgump@xxxxxxxxxxx
REV:20080424T195243Z
END:VCARD
-----------------------------------------------------------------
----- 4 Sample output XML (1) -----------------------------------
-----------------------------------------------------------------
<vcards>
   <vcard>
      <property name="VERSION">
         <value>3.0</value>
      </property>
      <property name="N">
         <value>
            <field>Lastname</field>
            <field>Surname</field>
         </value>
      </property>
      <property name="FN">
         <value>Displayname</value>
      </property>
      <property name="ORG">
         <value>EVenX</value>
      </property>
      <property name="URL">
         <value>http://www.evenx.com/</value>
      </property>
      <property name="EMAIL">
         <value>info@xxxxxxxxx</value>
      </property>
      <property name="TEL">
         <param name="TYPE">
            <param-value>voice</param-value>
            <param-value>work</param-value>
            <param-value>pref</param-value>
         </param>
         <value>+49 1234 56788</value>
      </property>
      <property name="ADR">
         <param name="TYPE">
            <param-value>intl</param-value>
            <param-value>work</param-value>
            <param-value>postal</param-value>
            <param-value>parcel</param-value>
         </param>
         <value>
            <field/>
            <field/>
            <field>Wallstr. 1</field>
            <field>Berlin</field>
            <field/>
            <field>12345</field>
            <field>Germany</field>
         </value>
      </property>
   </vcard>
</vcards>
-----------------------------------------------------------------
----- 5 Sample output XML (2) -----------------------------------
-----------------------------------------------------------------
<vcards>
   <vcard>
      <property name="VERSION">
         <value>4.0</value>
      </property>
      <property name="N">
         <value>
            <field>Gump</field>
            <field>Forrest</field>
            <field/>
            <field/>
            <field/>
         </value>
      </property>
      <property name="FN">
         <value>Forrest Gump</value>
      </property>
      <property name="ORG">
         <value>Bubba Gump Shrimp Co.</value>
      </property>
      <property name="TITLE">
         <value>Shrimp Man</value>
      </property>
      <property name="PHOTO">
         <param name="MEDIATYPE">
            <param-value>image/gif</param-value>
         </param>
         <value>http://www.example.com/dir_photos/my_photo.gif</value>
      </property>
      <property name="TEL">
         <param name="TYPE">
            <param-value>work</param-value>
            <param-value>voice</param-value>
         </param>
         <param name="VALUE">
            <param-value>uri</param-value>
         </param>
         <value>tel:+11115551212</value>
      </property>
      <property name="TEL">
         <param name="TYPE">
            <param-value>home</param-value>
            <param-value>voice</param-value>
         </param>
         <param name="VALUE">
            <param-value>uri</param-value>
         </param>
         <value>tel:+14045551212</value>
      </property>
      <property name="ADR">
         <param name="TYPE">
            <param-value>work</param-value>
         </param>
         <param name="LABEL">
            <param-value>100 Waters Edge
Baytown, LA 30314
United States of America</param-value>
         </param>
         <value>
            <field/>
            <field/>
            <field>100 Waters Edge</field>
            <field>Baytown</field>
            <field>LA</field>
            <field>30314</field>
            <field>United States of America</field>
         </value>
      </property>
      <property name="ADR">
         <param name="TYPE">
            <param-value>home</param-value>
         </param>
         <param name="LABEL">
            <param-value>42 Plantation St.
Baytown, LA 30314
United States ofAmerica</param-value>
         </param>
         <value>
            <field/>
            <field/>
            <field>42 Plantation St.</field>
            <field>Baytown</field>
            <field>LA</field>
            <field>30314</field>
            <field>United States of America</field>
         </value>
      </property>
      <property name="EMAIL">
         <value>forrestgump@xxxxxxxxxxx</value>
      </property>
      <property name="REV">
         <value>20080424T195243Z</value>
      </property>
   </vcard>
</vcards>
-----------------------------------------------------------------

--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Current Thread