7.13 kmeans example wine dataset

20211229

We can illustrate the application of k-means to a new dataset which also requires normalisation. The sample dataset contains observations of the chemical analysis of wine. The dataset is available from the UCI Machine Learning repository.

The source dataset is not in our usual csv format so we will need to do a little pre-processing to get it into the form we need.

First download the data from the repository where it consists of two files, the actual data and a file containing a description of the columns.

wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data -O wine.data
wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names -O wine.names

The data is in csv format but without a header, and so some command line tools will help transform it. First we get the variable names from the names file.

egrep '^\s+[0-9][0-9]?)' wine.names |
  cut -d')' -f2 |
  tr 'A-Z' 'a-z' |
  awk '{printf("%s,", $1)}' |
  sed 's|,$||' |
  awk '{print}' > wine.header

That’s a bit of command line magic, but breaking it down into one command at a time, it should make sense in combination with seeing the contents of the names file. For reference, the standard Linux commands used include egrep, cut, tr, awk, and sed. The resulting single line header is saved to the file wine.header.

Now we can pre-pend that header file to the data file, after removing the first column which is a wine class variable, using a cut and cat and saving the result to a file wine.csv using a tee as well as displaying the head to the console:

cut -d"," -f2- wine.data |
    cat wine.header - |
    tee wine.csv |
    head -4

The top 4 lines are displayed as below, confirming the dataset is now in the right csv format:

alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185

Notice the scale of the different variables. They vary considerably. This is not so good for a distance base algorithm like k-means so we will normalise the data.

cat wine.csv |
    ml normalise kmeans |
    tee norm.csv |
    head -4

Compare the resulting data ranges:

alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
1.514,-0.561,0.231,-1.166,1.909,0.807,1.032,-0.658,1.221,0.251,0.361,1.843,1.01
0.246,-0.498,-0.826,-2.484,0.018,0.567,0.732,-0.818,-0.543,-0.292,0.405,1.11,0.963
0.196,0.021,1.106,-0.268,0.088,0.807,1.212,-0.497,2.13,0.268,0.317,0.786,1.391

This dataset is ready for a cluster analysis.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0