[xsl] Re: cleanup of <div>-elements

Subject: [xsl] Re: cleanup of <div>-elements
From: "Piez, Wendell A. (Fed) wendell.piez@xxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 27 Feb 2023 17:12:57 -0000
Hi Monika,

The reason Chris asks his question is that this will impact how good your
solution can be.

In XSLT it is often easy to implement if it is easy to define. The question
here is whether you  can easily and deterministically distinguish between a
div element that should become a p, and one that should stay a div. Answer
that question and the code is straightforward.

A rule to do this might be something like "any div that has a child `sub`, `a`
or untagged text becomes a p, while any other div (containing only the blocks)
stays a div".

But how well this works depends on your case. One reason we use schemas to
validate!

Regards, Wendell

From: Chris Papademetrious christopher.papademetrious@xxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Monday, February 27, 2023 11:40 AM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: [xsl] Re: cleanup of <div>-elements

Hi Monika,

Will the content between headings always be limited to known "block-level"
element types (p, ol, ul, etc.)?


  *   Chris

From: Madlik, Monika (LNG-VIE)
monika.madlik@xxxxxxxxxxxxx<mailto:monika.madlik@xxxxxxxxxxxxx>
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx<mailto:xsl-list-service@xxxxxxxxxxxx
rytech.com>>
Sent: Monday, February 27, 2023 11:31 AM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx<mailto:xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
Subject: [xsl] cleanup of <div>-elements

Hi,

I have a problem with an XML-file that has to be converted.

I get XML-files that are semi-structured. So I have the h1/h2-information in
it and also tables, lists, ...
Paragraphs are tagged with <p> - but not always. Sometimes <p> is missing and
instead of it a weird construct of <div>-elements is tagged around texts and
other elements.

Is there a possibility to unravel this div-constructs without loosing texts
and structure? I need to have the element <p> around texts and markup for i.e.
strong text or italic text, ...

My problem is, that the div-elements could appear in any form and any depth
and it's also possible that many div-elements are wrapped around other
div-elements.

Example-XML:
<root>
              <h1>...</h1>
              <p>...</p>
              <ul>
                            <li>...</li>
                            <li>...</li>
              </ul>
              <div>
                            <h1>...</h1>
                            <h2>...</h2>
                            <p>...</p>
                            <h2>...</h2>
                            <p>...</p>
                            <h1>...</h1>
                            <p>...</p>
                            <h2>...</h2>
                            <p>...</p>
                            <div>
                                          <h1>...</h1>
                                          <div>...<sup><a href="#footnote-9"
id="9" rel="footnote">[9]</a></sup></div>
                            </div>
                            <div>
                                          <br/> ... <strong>...</strong>
...<sup><a href="#footnote-10" id="10" rel="footnote">[10]</a></sup>
                                          <div>
                                                         <h1>...</h1>
                                          </div>
                            </div>
                            <p>...</p>
              </div>
</root>

The yellow marked text should look like this after my transformation:
<h1>...</h1>
<p>...<sup><a href="#footnote-9" id="9" rel="footnote">[9]</a></sup></p>
<p><br/> ... <strong>...</strong> ...<sup><a href="#footnote-10" id="10"
rel="footnote">[10]</a></sup></p>
<h1>...</h1>


Thanks a lot,
Monika

XSL-List info and
archive<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furl
defense.com%2Fv3%2F__http%3A%2Fwww.mulberrytech.com%2Fxsl%2Fxsl-list__%3B!!A4
F2R9G_pg!f1gr0_ZMDVVg5f0HueHWUmEtGAy0Ib1jVOTHPev3cS_JRsYAj2KVqqaBRy6TcodgbJbS
sUHr5NtB3jPhiTW1C69-eZ_Z3clJqtBFEHmwEpy76u2UInUL%24&data=05%7C01%7Cwendell.pi
ez%40nist.gov%7C43fe0bf1c15948004eb108db18e140a4%7C2ab5d82fd8fa4797a93e054655
c61dec%7C1%7C0%7C638131127969071944%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=nUhiF
Er5GG5TMn2xYUdCv1de7%2BicHLH3UA6qVmogokc%3D&reserved=0>
EasyUnsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=https%3A%
2F%2Furldefense.com%2Fv3%2F__http%3A%2Flists.mulberrytech.com%2Funsub%2Fxsl-l
ist%2F3380743__%3B!!A4F2R9G_pg!f1gr0_ZMDVVg5f0HueHWUmEtGAy0Ib1jVOTHPev3cS_JRs
YAj2KVqqaBRy6TcodgbJbSsUHr5NtB3jPhiTW1C69-eZ_Z3clJqtBFEHmwEpy76gLiPaVt%24&dat
a=05%7C01%7Cwendell.piez%40nist.gov%7C43fe0bf1c15948004eb108db18e140a4%7C2ab5
d82fd8fa4797a93e054655c61dec%7C1%7C0%7C638131127969071944%7CUnknown%7CTWFpbGZ
sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C300
0%7C%7C%7C&sdata=WeT6q7In0Gky3%2BrB9ve5DMaRmzLgCqUoFH5XO6isNa0%3D&reserved=0>
(by email)
XSL-List info and
archive<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.
mulberrytech.com%2Fxsl%2Fxsl-list&data=05%7C01%7Cwendell.piez%40nist.gov%7C43
fe0bf1c15948004eb108db18e140a4%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C6
38131127969071944%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzI
iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=20MBrVbDg4kjZbpxku8Y7TC
IrQuDn9yx2vFvJWcr9ko%3D&reserved=0>
EasyUnsubscribe<https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2
F%2Flists.mulberrytech.com%2Funsub%2Fxsl-list%2F3302254&data=05%7C01%7Cwendel
l.piez%40nist.gov%7C43fe0bf1c15948004eb108db18e140a4%7C2ab5d82fd8fa4797a93e05
4655c61dec%7C1%7C0%7C638131127969071944%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wL
jAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7
23ukTrFj8PeZE7OeUvh7HroyORqcNPGk3dvPKB0GVo%3D&reserved=0> (by email<>)

Current Thread