Introduction

Here, we will prepare ASA24 totals data for PCA and clustering analyses.
We will need to calculate average dietary data per person across all days (if desired), remove variables that have zero variance, and collapse variables by correlation (i.e. remove redundancy of variables that are highly correlated).


Part of descriptors of filenames for data type and processing pattern are defined as follows for each of the combination of two types of dietary data (nutrients or food categories) and two types of measurements (each day or average across days).

Data type Nutrients (PROT - B12_ADD) Food categories (F_TOTAL - A_DRINKS)
As is (1 row = user/day) Nut_asis Cat_asis
Averaged across days (1 row = user) Nut_ave Cat_ave


Load functions and packages

Name the path to DietR directory where input files are pulled.

main_wd <- "~/GitHub/DietR"

Load the necessary functions.

source("lib/specify_data_dir.R")
source("lib/prep_data_for_clustering.R")

Call color palette.

distinct100colors <- readRDS("lib/distinct100colors.rda")

You can come back to the main directory by:

setwd(main_wd)


Import data and prepare them for analyses

Specify the directory where the data is.

SpecifyDataDirectory(directory.name= "eg_data/VVKAJ/")

There may be some variables that you would like to omit before performing PCA.
Define which columns to drop.

drops <- c("FoodAmt", "KCAL", "MOIS")

Load the totals data (for each day, not averaged, but with the individuals in the QC-ed mean totals). Note that _m stands for “metadata added”, not “means”.

totals <- read.table("VVKAJ_Tot_m_QCed.txt", sep= "\t", header= T)

Take only the columns whose names are NOT in the drops vector.

totals_2 <- totals[ , !(names(totals) %in% drops)]

Load the averaged and QC-ed totals that has one data per participant with metadata.

totals_mean <- read.table("VVKAJ_Tot_mean_m_QCed.txt", sep="\t", header=T)  

Take only the columns whose names are NOT in the drops vector.

totals_mean_2 <- totals_mean[ , !(names(totals_mean) %in% drops)]



NUTRIENTS: Use data as is

Obtain the column numbers for UserName, BMI, start.col=“PROT” through end.col=“B12_ADD”.

userID_col <- match("UserName", names(totals_2))
BMI_col   <-  match("BMI"     , names(totals_2))
start_col <-  match("PROT"    , names(totals_2))
end_col   <-  match("B12_ADD" , names(totals_2))

Select the BMI, body weight, and the nutrient variables.

user_BMI_nut <- totals_2[ , c(userID_col, BMI_col, start_col:end_col)]

Ensure user_BMI_nut has only the selected columns (variables).

colnames(user_BMI_nut)
##  [1] "UserName" "BMI"      "PROT"     "TFAT"     "CARB"     "ALC"     
##  [7] "CAFF"     "THEO"     "SUGR"     "FIBE"     "CALC"     "IRON"    
## [13] "MAGN"     "PHOS"     "POTA"     "SODI"     "ZINC"     "COPP"    
## [19] "SELE"     "VC"       "VB1"      "VB2"      "NIAC"     "VB6"     
## [25] "FOLA"     "FA"       "FF"       "FDFE"     "VB12"     "VARA"    
## [31] "RET"      "BCAR"     "ACAR"     "CRYP"     "LYCO"     "LZ"      
## [37] "ATOC"     "VK"       "CHOLE"    "SFAT"     "S040"     "S060"    
## [43] "S080"     "S100"     "S120"     "S140"     "S160"     "S180"    
## [49] "MFAT"     "M161"     "M181"     "M201"     "M221"     "PFAT"    
## [55] "P182"     "P183"     "P184"     "P204"     "P205"     "P225"    
## [61] "P226"     "VITD"     "CHOLN"    "VITE_ADD" "B12_ADD"

Process this input, user_BMI_nut, for clustering analysis as follows.

  1. Take complete cases in your variables of interest,
  2. Save the original totals of the complete cases individuals as a .txt,
  3. Keep non-zero columns,
  4. Remove the userID,
  5. Identify correlated variables and remove them,
  6. Save with uncorrelated variables as a .txt,
  7. Save correlation matrix as a .txt.
PrepForClustering(input_df = user_BMI_nut,
                  userID = "UserName",
                  original_totals_df= totals, 
                  complete_cases_fn=   "VVKAJ_Tot_m_QCed_Nut_asis_c.txt",
                  clustering_input_fn= "VVKAJ_Tot_m_QCed_Nut_asis_c_rv.txt",
                  corr_matrix_fn=      "VVKAJ_Tot_m_QCed_Nut_asis_c_corr_matrix.txt")
## user_BMI_nut has 45 rows and 65 variables.
## The following column(s) in user_BMI_nut have missing data shown below.
## Rows (samples) containing those missing data will be removed.
##  named numeric(0)
## No columns had zero variance. 
## The numeric data without ID has 45 rows and 64 variables.
## Clustering 64 features...getting means...choosing reps...collapsed from 64 to 39.
## After removing correlated variables, 45 rows and 39 variables remained.


NUTRIENTS: Take average of each user across all days

Obtain the column numbers for UserName, BMI, start.col=“PROT” through end.col=“B12_ADD” in totals_mean_2.

UserName_col <- match("UserName" , names(totals_mean_2)) 
BMI_col   <-    match("BMI"      , names(totals_mean_2)) 
start_col <-    match("PROT"     , names(totals_mean_2))  
end_col   <-    match("B12_ADD"  , names(totals_mean_2)) 

Select the BMI, body weight, and the nutrient variables.

m_user_BMI_nut <- totals_mean_2[ , c(UserName_col, BMI_col, start_col:end_col)]

Process this input for clustering analyses.

PrepForClustering(input_df = m_user_BMI_nut,
                  userID = "UserName",
                  original_totals_df= totals_mean, 
                  complete_cases_fn=   "VVKAJ_Tot_mean_m_QCed_Nut_ave_c.txt",
                  clustering_input_fn= "VVKAJ_Tot_mean_m_QCed_Nut_ave_c_rv.txt",
                  corr_matrix_fn=      "VVKAJ_Tot_mean_m_QCed_Nut_ave_c_corr_matrix.txt")
## m_user_BMI_nut has 15 rows and 65 variables.
## The following column(s) in m_user_BMI_nut have missing data shown below.
## Rows (samples) containing those missing data will be removed.
##  named numeric(0)
## No columns had zero variance. 
## The numeric data without ID has 15 rows and 64 variables.
## Clustering 64 features...getting means...choosing reps...collapsed from 64 to 29.
## After removing correlated variables, 15 rows and 29 variables remained.


FOOD CATEGORIES: Use data as is

Obtain the column numbers for BMI, UserName, start.col=“F_TOTAL” through end.col=“A_DRINKS”.

userID_col <- match("UserName" , names(totals_2))
BMI_col   <-  match("BMI"      , names(totals_2))
start_col <-  match("F_TOTAL"  , names(totals_2))
end_col   <-  match("A_DRINKS" , names(totals_2))

Select the BMI, body weight, and the nutrient variables.

user_BMI_cat <- totals_2[ , c(userID_col, BMI_col, start_col:end_col)]

Process this input for clustering analyses.

PrepForClustering(input_df = user_BMI_cat,
                  userID = "UserName",
                  original_totals_df= totals, 
                  complete_cases_fn=   "VVKAJ_Tot_m_QCed_Cat_asis_c.txt",
                  clustering_input_fn= "VVKAJ_Tot_m_QCed_Cat_asis_c_rv.txt",
                  corr_matrix_fn=      "VVKAJ_Tot_m_QCed_Cat_asis_c_corr_matrix.txt")
## user_BMI_cat has 45 rows and 39 variables.
## The following column(s) in user_BMI_cat have missing data shown below.
## Rows (samples) containing those missing data will be removed.
##  named numeric(0)
## The following column(s) in  input_df_c_wo_ID  had zero variance and were removed. 
## The numeric data without ID now has 45 rows and 37 variables.
## PF_ORGAN 
##       23 
## Clustering 37 features...getting means...choosing reps...collapsed from 37 to 30.
## After removing correlated variables, 45 rows and 30 variables remained.


FOOD CATEGORIES: Take average of each user across all days

Obtain the column numbers for UserName, BMI, start.col=“F_TOTAL” through end.col=“A_DRINKS” in totals_mean_2.

UserName_col <- match("UserName" , names(totals_mean_2)) 
BMI_col   <-    match("BMI"      , names(totals_mean_2)) 
start_col <-    match("F_TOTAL"     , names(totals_mean_2))  
end_col   <-    match("A_DRINKS"  , names(totals_mean_2)) 

Pick up the BMI, body weight, and the nutrient variables.

m_user_BMI_cat <- totals_mean_2[ , c(UserName_col, BMI_col, start_col:end_col)]

Process this input for clustering analyses.

PrepForClustering(input_df = m_user_BMI_cat,
                  userID = "UserName",
                  original_totals_df= totals_mean, 
                  complete_cases_fn=   "VVKAJ_Tot_mean_m_QCed_Cat_ave_c.txt",
                  clustering_input_fn= "VVKAJ_Tot_mean_m_QCed_Cat_ave_c_rv.txt",
                  corr_matrix_fn=      "VVKAJ_Tot_mean_m_QCed_Cat_ave_c_corr_matrix.txt")
## m_user_BMI_cat has 15 rows and 39 variables.
## The following column(s) in m_user_BMI_cat have missing data shown below.
## Rows (samples) containing those missing data will be removed.
##  named numeric(0)
## The following column(s) in  input_df_c_wo_ID  had zero variance and were removed. 
## The numeric data without ID now has 15 rows and 37 variables.
## PF_ORGAN 
##       23 
## Clustering 37 features...getting means...choosing reps...collapsed from 37 to 25.
## After removing correlated variables, 15 rows and 25 variables remained.



Come back to the main directory before you start running another script.

  setwd(main_wd)