Re: [xsl] How to output the characters denoted by the regex \p{P}

Subject: Re: [xsl] How to output the characters denoted by the regex \p{P}
From: "David Carlisle d.p.carlisle@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 28 Feb 2025 14:46:44 -0000
First reply got blocked for being too long..

Unicode 16  appears to have 850 characters with category P...

<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:xs="http://www.w3.org/2001/XMLSchema";>

 <xsl:output method="text"/>

 <xsl:variable name="u"
      select="doc('
https://raw.githubusercontent.com/w3c/xml-entities/refs/heads/gh-pages/unicod
e.xml'
)"/>

 <xsl:template name="P">
  <xsl:value-of select="'Unicode', $u/*/@unicode,'&#10;'"/>
  <xsl:for-each
select="$u//character[unicodedata/@category[starts-with(.,'P')]]">
   <xsl:value-of select="
    '&#10;',
    replace(@id,'U0?','U+'),
    codepoints-to-string(@dec),
    description
    "/>
  </xsl:for-each>
 </xsl:template>
</xsl:stylesheet>

then

saxon10 -it:P rc5.xsl

produces


Unicode 16

 U+0021 ! EXCLAMATION MARK
 U+0022 " QUOTATION MARK
 U+0023 # NUMBER SIGN
 U+0025 % PERCENT SIGN
 U+0026 & AMPERSAND
 U+0027 ' APOSTROPHE
...


On Fri, 28 Feb 2025 at 12:53, Roger L Costello costello@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> ChatGPT:
>
> > Please list all the characters denoted by the regular expression \p{P}
>
> The regular expression \p{P} is a Unicode property that matches any
> punctuation character. This includes a wide range of symbols used in
> writing to separate sentences and words, indicate pauses, or convey other
> structural or expressive functions. Here is a list of some common
> punctuation characters that \p{P} would match:
>
> Period: .
> Comma: ,
> Question mark: ?
> Exclamation mark: !
> Colon: :
> Semicolon: ;
> Quotation marks: " and '
> Apostrophe: '
> Hyphen: -
> Dash: b (em dash) and b (en dash)
> Parentheses: ( and )
> Brackets: [ and ]
> Braces: { and }
> Slash: / and \
> Vertical bar: |
> Ellipsis: b&
> Angle brackets: < and >
>
> This list is not exhaustive, as Unicode includes many other punctuation
> marks used in various languages and scripts. The \p{P} property covers all
> such characters, including those specific to certain languages or
> typographic conventions.
>
> > How many characters are in \p{P}
>
> The exact number of characters matched by \p{P} can vary depending on the
> version of the Unicode standard being used, as new punctuation characters
> may be added in future updates. As of Unicode 15.0, which was released in
> September 2022, there are over 100 punctuation characters. This includes
> common punctuation marks like periods, commas, and quotation marks, as well
> as less common ones used in various languages and scripts.
>
> For the most precise and up-to-date count, you would need to refer to the
> Unicode Character Database for the specific version you are interested in.
> The database provides detailed information about each character, including
> its properties and classifications.

Current Thread