Introduction

In this script, we will perform Principal Component Analysis (PCA) with the NHANES 2-day average totals, subsetted for males, 60-79 years old.

Load functions and packages

Name the path to DietR directory where input files are pulled.

main_wd <- "~/GitHub/DietR"

Load necessary packages.

library(ggplot2)
library(ggfortify)

Load the necessary functions.

source("lib/specify_data_dir.R")
source("lib/ggplot2themes.R")
source("lib/PCA.R")

You can come back to the main directory by:

setwd(main_wd)

Specify the directory where the data is.

SpecifyDataDirectory(directory.name = "eg_data/NHANES/Laboratory_data/")

Create two folders named “males60to79_Nut_PCA” and “males60to79_Cat_PCA” inside “Laboratory_data” to save PCA results of Nutrients and Food Category data, respectively.

PCA with nutrients and BMI

Your input data should be a dataframe with uncorrelated variables with non-zero variance and with no missing data.

pca_input <- read.table("QCtotal_d_ga_body_meta_glu_comp_2_males60to79_c_Nut_rv.txt", 
                        sep="\t", header=T)

Ensure your input file has the correct number of rows and columns.

dim(pca_input)

## [1] 233  38

colnames(pca_input)

##  [1] "BMXBMI" "SODI"   "CARB"   "FF"     "TFAT"   "PFAT"   "CHL"    "ATOC"  
##  [9] "ATOA"   "VARA"   "ACAR"   "BCAR"   "CRYP"   "LYCO"   "LZ"     "VB1"   
## [17] "VB2"    "NIAC"   "FDFE"   "VB12"   "B12A"   "VC"     "VD"     "CALC"  
## [25] "POTA"   "MAGN"   "ZINC"   "CAFF"   "THEO"   "ALCO"   "S140"   "S120"  
## [33] "M161"   "M201"   "M221"   "P184"   "P226"   "P225"

Perform PCA with the subset data, scaled.

scaled_pca <- prcomp(x= pca_input, scale= TRUE)

Specify the directory (folder that you created in “Laboratory_data”) to save the results.

res_dir_Nut = "males60to79_Nut_PCA"

Specify the prefix of filenames to be saved.

res_prefix_Nut = "males60to79_Nut_fn"

Save PCA output files in a specified folder (out.dir) and a prefix (out.prefix).

OutputPCA(pca.data= pca_input, pca.result= scaled_pca, 
          out.dir= res_dir_Nut, out.prefix= res_prefix_Nut)

As we saw in the PCA with ASA24 data, this code will generate several output files all at once in the res_dir: res_prefix_Nut.

Output file postfix	File content
_biplotdots.pdf PCA	biplot with the datapoints as dots and with the variables shown as arrows
_biplotlabeled.pdf	PCA biplot with the datapoints labed with numbers and with the variables shown as arrows
_biplotlabeledwoarrows.pdf	The datapoints labed with numbers; variables are hidden
_directions.pdf	Variables shown as arrows only; the datapoints are hidden
_loadings_PC1.pdf	Loadings of PC1 shown as a bar chart.
_loadings_PC2.pdf	Loadings of PC2 shown as a bar chart.
_PC_loadings.txt	Text file with loadings of each PC for the variables.
_PC_var_explained.txt	Amount of variation explained by each PC
_PCs.txt	The input file (totals/items) and the PCA results combined
_scree.pdf	PCA scree plot, showing the amount of variance explained by each PC.

Combine the input with all the variables and the PC results.

“input” is your items/totals input file with no missing data in your variables, but including the ones that are highly correlated and therefore were removed when processed into clustering input; however, these variables are going to be included here so that we will be able to take a look at all the variables and see which of them are correlated with the PCA axes and metadata etc.

SaveInputAndPCs(input="QCtotal_d_ga_body_meta_glu_comp_2_males60to79_c_Nut.txt", 
                pca.results = scaled_pca, 
                out.dir= res_dir_Nut, out.prefix= res_prefix_Nut)

[NOTE] Even though the input file has both nutrients (Nut) and food categories (Cat) data, PCA was done with only either Nut or Cat, not both.

Color-code datapoints on a biplot by a factor - GLU_index, in this case

Load the totals_males60to79 data with no missing data in your variables of interest.

totals_males60to79 <- read.table("QCtotal_d_ga_body_meta_glu_comp_2_males60to79_c_Nut.txt", 
                                 sep="\t", header=T)

Change GLU_index to a factor so that factors will be displayed in order.

totals_males60to79$GLU_index <- factor(totals_males60to79$GLU_index, 
                                       levels= c("Normal", "Prediabetic", "Diabetic"))

Use the autoplot function. Specify which PC to display in the x and y arguments.

food_Nut_PCA <- autoplot(scaled_pca, x=1, y=2,    
                         loadings=T, loadings.label=T, loadings.colour = 'grey50',  # loadings.label=T if want to see it
                         data = totals_males60to79,  size= 3 ) +     
  geom_point(size = 3, alpha = 1, na.rm = T, shape = 21, aes(fill= GLU_index)) +
  theme_bw(base_size = 12) + theme(aspect.ratio = 1) +  
  theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  scale_fill_manual( values= c("steelblue3", "yellow", "hotpink"))

food_Nut_PCA

Save as a .pdf.

ggsave("males60to79_Nut_PCA/males60to79_c_Nut_PCA_by_GLU_index_PC12.pdf", 
       food_Nut_PCA, device="pdf", width=7, height=6.5)

PCA with food categories and BMI