|
Subject: Re: [xsl] [Part 1] XML Design for Data Science Analysis From: "Michael Kay michaelkay90@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Sun, 10 May 2026 22:38:54 -0000 |
I'm reminded of my first ever paid programming project, which was analyzing some data for an achaeologist. He had a few dozen data points and wanted to prove some conjecture by finding clusters. The first attempt didn't give the results he wanted, so he suggested applying weights to the data. That didn't work either, so he suggested different weights. At that point I suggested that if he told me what results he wanted, I could calculate the weights that would give the desired clustering. With this suggestion the penny dropped, namely that given a limited number of data points you can prove anything you want. And I learned a lesson that has guided my career ever since: the customer is often wrong. Michael Kay Saxonica > On 10 May 2026, at 19:38, Roger L Costello costello@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi Folks, > > Table 1 shows three people at your workplace. Which two would you say are bclosestb to each other? In other words, which two would naturally bclusterb together? > > Table 1. Three people at your workplace > > Person > Age > Kids > Income > A > 36 > 3 > $100,000 > B > 37 > 2 > $80,000 > C > 22 > 0 > $101,000 > At first glance, persons A and B seem more similar: they are both working parents in their mid-30s. Person C looks different: much younger, no kids, but with a high income. > > However, if the data is not scaled, the income variable can dominate most distance formulas. That can make persons A and C appear bcloserb than A and B, simply because A and C have similar incomes. > > What does it mean to bscaleb the data? > > Scaling means transforming variables so they are on comparable numeric ranges. > > You are not changing the meaning of the data. You are changing the units so one variable does not overpower the others. > > Common scaling methods include: > > Min-max scaling > Standardization, also called z-scores > Normalization > Example: min-max scaling > > A common approach is to map every variable into the range 0 to 1. > > The formula is: > > scaled value = (x - minimum value) / (maximum value - minimum value) > How would you represent Table 1 in XML, given the goal of scaling the data? > > Here is a conventional, row-oriented design: > > <Workplace> > <Staff> > <Person>A</Person> > <Age>36</Age> > <Kids>3</Kids> > <Income currency="USD">100000</Income> > </Staff> > <Staff> > <Person>B</Person> > <Age>37</Age> > <Kids>2</Kids> > <Income currency="USD">80000</Income> > </Staff> > <Staff> > <Person>C</Person> > <Age>22</Age> > <Kids>0</Kids> > <Income currency="USD">101000</Income> > </Staff> > </Workplace> > The XPath expressions for scaling are somewhat verbose because the values are distributed across multiple <Staff> elements instead of grouped together by feature. > > Still, XPath handles it quite nicely. > > Min-max scaling of ages: > > for $staff in /Workplace/Staff > return > (xs:decimal($staff/Age) - min(/Workplace/Staff/Age ! xs:decimal(.))) > div > (max(/Workplace/Staff/Age ! xs:decimal(.)) - > min(/Workplace/Staff/Age ! xs:decimal(.))) > Min-max scaling of kids: > > for $staff in /Workplace/Staff > return > (xs:decimal($staff/Kids) - min(/Workplace/Staff/Kids ! xs:decimal(.))) > div > (max(/Workplace/Staff/Kids ! xs:decimal(.)) - > min(/Workplace/Staff/Kids ! xs:decimal(.))) > Min-max scaling of incomes: > > for $staff in /Workplace/Staff > return > (xs:decimal($staff/Income) - min(/Workplace/Staff/Income ! xs:decimal(.))) > div > (max(/Workplace/Staff/Income ! xs:decimal(.)) - > min(/Workplace/Staff/Income ! xs:decimal(.))) > XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> > EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/3500899> (by email <>)
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| [xsl] [Part 1] XML Design for Data , Roger L Costello cos | Thread | Re: [xsl] [Part 1] XML Design for D, Andre Cusson akhu01@ |
| [xsl] [Part 2] XML Design for Data , Roger L Costello cos | Date | Re: [xsl] [Part 1] XML Design for D, Andre Cusson akhu01@ |
| Month |