7.13 kmeans example wine dataset
20211229
We can illustrate the application of k-means to a new dataset which also requires normalisation. The sample dataset contains observations of the chemical analysis of wine. The dataset is available from the UCI Machine Learning repository.
The source dataset is not in our usual csv format so we will need to do a little pre-processing to get it into the form we need.
First download the data from the repository where it consists of two files, the actual data and a file containing a description of the columns.
wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data -O wine.data
wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names -O wine.names
The data is in csv format but without a header, and so some command line tools will help transform it. First we get the variable names from the names file.
egrep '^\s+[0-9][0-9]?)' wine.names |
cut -d')' -f2 |
tr 'A-Z' 'a-z' |
awk '{printf("%s,", $1)}' |
sed 's|,$||' |
awk '{print}' > wine.header
Thatβs a bit of command line magic, but breaking it down into one
command at a time, it should make sense in combination with seeing the
contents of the names file. For reference, the standard Linux
commands used include egrep,
cut, tr, awk,
and sed. The resulting single line header is saved
to the file wine.header
.
Now we can pre-pend that header file to the data file, after
removing the first column which is a wine class variable, using a
cut and cat and saving the
result to a file wine.csv
using a tee as well as
displaying the head to the console:
The top 4 lines are displayed as below, confirming the dataset is now in the right csv format:
alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
Notice the scale of the different variables. They vary considerably. This is not so good for a distance base algorithm like k-means so we will normalise the data.
Compare the resulting data ranges:
alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
1.514,-0.561,0.231,-1.166,1.909,0.807,1.032,-0.658,1.221,0.251,0.361,1.843,1.01
0.246,-0.498,-0.826,-2.484,0.018,0.567,0.732,-0.818,-0.543,-0.292,0.405,1.11,0.963
0.196,0.021,1.106,-0.268,0.088,0.807,1.212,-0.497,2.13,0.268,0.317,0.786,1.391
This dataset is ready for a cluster analysis.
Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0