In this section, we are going to perform k-means analysis for each of Nutrient as is, Nutrient average, Food categories as is, and Food categories average.
Before proceeding, ensure to have four directories (folders) named as:
K-means analysis results will be saved in this folder. You can change those directory names, but if you do so, you will need to modify the directory name, “res_dir_xxx_xxxx”, to match your new directory name.
Name the path to DietR directory where input files are pulled.
main_wd <- "~/GitHub/DietR"
Load necessary packages.
library(ggplot2)
library(ggfortify)
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Set your ggplot2 theme.
theme_set(theme_bw(base_size = 14))
Import source code to run the analyses to follow.
source("lib/specify_data_dir.R")
source("lib/k-means.R")
You can come back to the main directory by:
setwd(main_wd)
Specify the directory where the data is.
SpecifyDataDirectory(directory.name= "eg_data/VVKAJ/")
Load Nut_asis data.
Tot_m_QCed_Nut_asis <- read.table(file="VVKAJ_Tot_m_QCed_Nut_asis_c_rv.txt", sep="\t", header=T)
Scale your input file and name it as k-means_input.
kmeans_input <- scale(Tot_m_QCed_Nut_asis)
Ensure your input file has the correct number of rows and columns.
dim(kmeans_input)
## [1] 45 39
Specify the directory (folder) to save the results.
res_dir_nut_asis = "kmeans_Nut_asis"
Specify the prefix of filenames to be saved.
res_prefix_nut_asis = "VVKAJ_Nut_asis"
Run the elbow, silhouette, and gap methods to find an optimum K (number of clusters). Do not alter the name of the input file: kmeans_input
. This function below assumes that the input is named as kmeans_input
. You can only run those three methods for K = 1 through (number of observations - 1). The gap method output will be printed on the Console. The gap values are plotted in xxx_gapmethod.pdf.
ChooseK(out.dir= res_dir_nut_asis, out.prefix= res_prefix_nut_asis)
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = kmeans_input, FUNcluster = kmeans, K.max = k.values[length(k.values)], B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..14; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 13
## logW E.logW gap SE.sim
## [1,] 4.534176 4.805370 0.2711936 0.02298366
## [2,] 4.413634 4.690320 0.2766854 0.01970103
## [3,] 4.324135 4.612156 0.2880211 0.01952644
## [4,] 4.232962 4.546859 0.3138978 0.02019184
## [5,] 4.172491 4.489167 0.3166767 0.01997891
## [6,] 4.112269 4.437340 0.3250711 0.01906811
## [7,] 4.060030 4.389245 0.3292146 0.01932907
## [8,] 4.002075 4.343304 0.3412295 0.01957126
## [9,] 3.941963 4.298034 0.3560717 0.02008874
## [10,] 3.885026 4.253908 0.3688828 0.02016421
## [11,] 3.819605 4.209691 0.3900865 0.02020015
## [12,] 3.761350 4.166198 0.4048475 0.01974978
## [13,] 3.707321 4.122736 0.4154147 0.02010712
## [14,] 3.665302 4.079125 0.4138227 0.02003152
This code will generate three output files all at once in the res_dir_xxx_xxxx:
Output file postfix | File content |
---|---|
_elbowmethod.pdf | Shows total within−clusters sum of squares for each K. |
_gapmethod.pdf | Shows the Gap statistic (k) for each K, and if the factoextra package is used, the optimal K is marked by a dotted line. |
_silhouettemethod.pdf | Shows the Average silhouette width (~ goodness of fit) for each K, and if the factoextra package is used, the optimal K is marked by a dotted line. |
Look at the three figures generated by the ChooseK function above. The elbow and gap methods did not give a distinct peak, but the silhouette method gave a peak at K=4. K=2, 3, 4, and 5 also have relatively high silhouette width (~goodness of fit).
With specific K values in mind, perform k-means analysis with one specified K. This will save Dim1 x Dim2 plot as a .pdf file in your out.dir. Change the filename as necessary.
OneK(myK= 4, out.dir= res_dir_nut_asis, out.fn = "VVKAJ_Nut_asis_K4")
oneKplot
Or try multiple Ks and print the biplots in one panel. Likewise, This will save a Dim1 x Dim2 plot for each of the chosen K as a .pdf file in your out.dir. Change the filename to be saved as a PDF as necessary. This MultipleK function uses the factoextra and gridExtra packages.
MultipleK(myKs = c(2,3,4,5), out.dir = res_dir_nut_asis, out.fn = "VVKAJ_Nut_asis_K2-5")
Load Nut_ave data.
Tot_m_QCed_Nut_ave <- read.table(file="VVKAJ_Tot_mean_m_QCed_Nut_ave_c_rv.txt", sep="\t", header=T)
Scale your input file and name it as k-means_input.
kmeans_input <- scale(Tot_m_QCed_Nut_ave)
Ensure your input file has the correct number of rows and columns.
dim(kmeans_input)
## [1] 15 29
Specify the directory (folder) to save the results.
res_dir_nut_ave = "kmeans_Nut_ave"
Specify the prefix of filenames to be saved.
res_prefix_nut_ave = "VVKAJ_Nut_ave"
Run elbow, silhouette, and gap methods to find an optimum K (number of clusters).
ChooseK(out.dir= res_dir_nut_ave, out.prefix= res_prefix_nut_ave)
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = kmeans_input, FUNcluster = kmeans, K.max = k.values[length(k.values)], B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..14; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 3.2630253 3.2663851 0.003359764 0.05234145
## [2,] 3.0784342 3.0438595 -0.034574707 0.05372168
## [3,] 2.9161106 2.8695076 -0.046603006 0.05471007
## [4,] 2.7752731 2.7095985 -0.065674514 0.05519485
## [5,] 2.6142074 2.5583486 -0.055858850 0.05523991
## [6,] 2.4625300 2.4007042 -0.061825789 0.05691960
## [7,] 2.2924520 2.2365856 -0.055866370 0.06063636
## [8,] 2.0889716 2.0602011 -0.028770473 0.06452527
## [9,] 1.8802611 1.8654634 -0.014797657 0.06983161
## [10,] 1.6202210 1.6415037 0.021282670 0.07686758
## [11,] 1.3401983 1.3770350 0.036836669 0.08485334
## [12,] 0.9895242 1.0433402 0.053816022 0.09208387
## [13,] 0.5150737 0.5792387 0.064165052 0.10424095
## [14,] -0.4562883 -0.1938366 0.262451756 0.12409388
With specific K values in mind, perform k-means analysis with one specified K.
OneK(myK= 5, out.dir= res_dir_nut_ave, out.fn = "VVKAJ_Nut_ave_K5")
oneKplot
Try multiple Ks and print the biplots in one panel.
MultipleK(myKs = c(2,3,4,5), out.dir = res_dir_nut_ave, out.fn = "VVKAJ_Nut_ave_K2-5")
Load Cat_asis data.
Tot_m_QCed_Cat_asis <- read.table(file="VVKAJ_Tot_m_QCed_Cat_asis_c_rv.txt", sep="\t", header=T)
Scale your input file and name it as k-means_input.
kmeans_input <- scale(Tot_m_QCed_Cat_asis)
Ensure your input file has the correct number of rows and columns.
dim(kmeans_input)
## [1] 45 30
Specify the directory (folder) to save the results.
res_dir_cat_asis = "kmeans_Cat_asis"
Specify the prefix of filenames to be saved.
res_prefix_cat_asis = "VVKAJ_Cat_asis"
Run elbow, silhouette, and gap methods to find an optimum K (number of clusters).
ChooseK(out.dir= res_dir_cat_asis, out.prefix= res_prefix_cat_asis)
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = kmeans_input, FUNcluster = kmeans, K.max = k.values[length(k.values)], B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..14; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 4.418470 4.704881 0.2864109 0.02523690
## [2,] 4.324929 4.588765 0.2638362 0.01947462
## [3,] 4.239884 4.525344 0.2854598 0.01838547
## [4,] 4.170741 4.469961 0.2992198 0.01916238
## [5,] 4.115400 4.419720 0.3043203 0.01845343
## [6,] 4.073602 4.372798 0.2991961 0.01814911
## [7,] 4.022522 4.328058 0.3055360 0.01864750
## [8,] 3.972951 4.283995 0.3110447 0.01873790
## [9,] 3.922178 4.241455 0.3192770 0.01925196
## [10,] 3.877252 4.198902 0.3216500 0.02015669
## [11,] 3.826764 4.156332 0.3295683 0.02069141
## [12,] 3.783749 4.114248 0.3304992 0.02096826
## [13,] 3.724537 4.071392 0.3468548 0.02156035
## [14,] 3.672544 4.028501 0.3559563 0.02157291
With specific K values in mind, perform k-means analysis with one specified K.
OneK(myK=3, out.dir= res_dir_cat_asis, out.fn = "VVKAJ_Cat_asis_K3")
oneKplot
Try multiple Ks and print the biplots in one panel.
MultipleK(myKs = c(3,4,5,6), out.dir = res_dir_cat_asis, out.fn = "VVKAJ_Cat_asis_K3-6")
Load Cat_ave data.
Tot_m_QCed_Cat_ave <- read.table(file="VVKAJ_Tot_mean_m_QCed_Cat_ave_c_rv.txt", sep="\t", header=T)
Scale your input file and name it as k-means_input.
kmeans_input <- scale(Tot_m_QCed_Cat_ave)
Ensure your input file has the correct number of rows and columns.
dim(kmeans_input)
## [1] 15 25
Specify the directory (folder) to save the results.
res_dir_cat_ave = "kmeans_Cat_ave"
Specify the prefix of filenames to be saved.
res_prefix_cat_ave = "VVKAJ_Cat_ave"
Run elbow, silhouette, and gap methods to find an optimum K (number of clusters).
ChooseK(out.dir= res_dir_cat_ave, out.prefix= res_prefix_cat_ave)
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = kmeans_input, FUNcluster = kmeans, K.max = k.values[length(k.values)], B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..14; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 3.19066329 3.2065563 1.589302e-02 0.04676962
## [2,] 3.00741139 3.0073421 -6.932819e-05 0.05303620
## [3,] 2.84352271 2.8345688 -8.953876e-03 0.05360053
## [4,] 2.69703510 2.6806427 -1.639236e-02 0.05241182
## [5,] 2.55196510 2.5335184 -1.844667e-02 0.05350532
## [6,] 2.42386845 2.3824293 -4.143910e-02 0.05665905
## [7,] 2.28405617 2.2215331 -6.252309e-02 0.06034179
## [8,] 2.12673548 2.0470466 -7.968886e-02 0.06266336
## [9,] 1.94250302 1.8514305 -9.107256e-02 0.06602434
## [10,] 1.72955188 1.6278869 -1.016650e-01 0.07192173
## [11,] 1.46373426 1.3661157 -9.761852e-02 0.07723756
## [12,] 1.11437462 1.0370556 -7.731898e-02 0.08325284
## [13,] 0.64687840 0.5735845 -7.329393e-02 0.09188411
## [14,] -0.08712478 -0.2004423 -1.133175e-01 0.12555598
With specific K values in mind, perform k-means analysis with one specified K.
OneK(myK= 4, out.dir= res_dir_cat_ave, out.fn = "VVKAJ_Cat_ave_K4")
oneKplot
Try multiple Ks and print the biplots in one panel.
MultipleK(myKs = c(2,3,4,6), out.dir = res_dir_cat_ave, out.fn = "VVKAJ_Cat_ave_K2-4&6")
Come back to the main directory before you start running another script.
setwd(main_wd)