In this section, we will take the phylogeny of food items into account in clustering individuals according to their dietary data. In order to do so, we will use the phyloseq package, which uses phylogeny of microbes and their abundance. We will replace microbes with food items consumed by our dietary study participants.
Name your main directory for future use.
main_wd <- "~/GitHub/DietR"
If you have not downloaded and installed the phyloseq package yet, you can do so by first installing BiocManager (if you have not done so):
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
Then download and install the phyloseq package.
BiocManager::install("phyloseq")
Install the devtools package necessary for installing the pairwiseAdonis package.
if (!require("devtools", quietly = TRUE))install.packages("devtools")
## Warning: package 'devtools' was built under R version 4.1.3
## Warning: package 'usethis' was built under R version 4.1.3
##
## Attaching package: 'devtools'
## The following object is masked from 'package:BiocManager':
##
## install
Install pairwise adonis function from Github. (https://github.com/pmartinezarbizu/pairwiseAdonis)
devtools::install_github("pmartinezarbizu/pairwiseAdonis/pairwiseAdonis")
load the necessary packages.
library(phyloseq)
##
## Attaching package: 'phyloseq'
## The following object is masked _by_ '.GlobalEnv':
##
## distance
library(ggtree)
library(pairwiseAdonis)
## Loading required package: vegan
## Loading required package: permute
##
## Attaching package: 'permute'
## The following object is masked from 'package:devtools':
##
## check
## Loading required package: lattice
## This is vegan 2.5-7
Load necessary functions and ggplot formatting themes.
source("lib/specify_data_dir.R")
source("lib/ordination.R")
source("lib/ggplot2themes.R")
source("lib/sort_IFC_by_ID.R")
source("lib/plot.axis.1to4.by.factor.R")
Load the distinct 100 colors for use.
distinct100colors <- readRDS("lib/distinct100colors.rda")
You can come back to the main directory by:
setwd(main_wd)
Specify the directory where the data is.
SpecifyDataDirectory(directory.name = "eg_data/VVKAJ/")
## [1] "The data directory has been set as"
## [1] "~/GitHub/DietR/eg_data/VVKAJ/"
Load IFC table, and sort the columnnames (userID), leaving the last column (taxonomy) intact. This dataframe will be saved as food
. Also, save food
as a .txt file to be used in the “correlation between Axes and foods” section.
SortIFCByID(ifc.input = "Foodtree/VVKAJ_Items_f_id_s_m_QCed_4Lv.food.ifc.txt",
outfn.for.corr.axis = "Foodtree/VVKAJ_Items_f_id_s_m_QCed_4Lv.food.ifc_sorted.txt")
food
is a matrix of Food descriptions (rows) x SampleID (columns).
head(food)[1:6, 1:4]
## vvkaj.00001 vvkaj.00002 vvkaj.00003 vvkaj.00004
## Milk NFS 0 0 0 0
## Milk whole 0 183 0 0
## Milk reduced fat 2 0 0 0 0
## Milk low fat 1 0 0 61 0
## Soy milk 0 0 0 0
## Almond milk sweetened 0 0 0 0
Format the food object and create an ifc_table called IFC
.
PrepFood(data= food)
tax <- read.delim("Foodtree/VVKAJ_Items_f_id_s_m_QCed_4Lv.tax.txt")
Format the tax file and create a taxonomy table called TAX
.
PrepTax(data= tax)
meta <- read.table( "ind_metadata_UserxDay.txt", sep="\t", header=T)
Format the metadata file and save it as SAMPLES
.
PrepMeta(data= meta)
foodtree <- read_tree("Foodtree/VVKAJ_Items_f_id_s_m_QCed_4Lv.tree.nwk")
It is OK to see a message saying that
“Found more than one class”phylo” in cache; using the first, from namespace ‘phyloseq’
Also defined by ‘tidytree’“.
Format foodtree and save it as TREE
.
PrepTree(data= foodtree)
Again, it is OK to see the same message as the previous line.
phyfoods <- phyloseq(IFC, TAX, SAMPLES, TREE)
It is OK to see a message (or multiple of them) saying that
Found more than one class “phylo” in cache; using the first, from namespace ‘phyloseq’
Also defined by ‘tidytree’.
Show the sample names and ensure they are vvkaj.00xxx.
sample_names(phyfoods)
## [1] "vvkaj.00001" "vvkaj.00002" "vvkaj.00003" "vvkaj.00004" "vvkaj.00005"
## [6] "vvkaj.00006" "vvkaj.00007" "vvkaj.00008" "vvkaj.00009" "vvkaj.00010"
## [11] "vvkaj.00011" "vvkaj.00012" "vvkaj.00013" "vvkaj.00014" "vvkaj.00015"
## [16] "vvkaj.00016" "vvkaj.00017" "vvkaj.00018" "vvkaj.00019" "vvkaj.00020"
## [21] "vvkaj.00021" "vvkaj.00022" "vvkaj.00023" "vvkaj.00024" "vvkaj.00025"
## [26] "vvkaj.00026" "vvkaj.00027" "vvkaj.00028" "vvkaj.00029" "vvkaj.00030"
## [31] "vvkaj.00031" "vvkaj.00032" "vvkaj.00033" "vvkaj.00034" "vvkaj.00035"
## [36] "vvkaj.00036" "vvkaj.00037" "vvkaj.00038" "vvkaj.00039" "vvkaj.00040"
## [41] "vvkaj.00041" "vvkaj.00042" "vvkaj.00043" "vvkaj.00044" "vvkaj.00045"
Show metadata.
head(sample_data(phyfoods), n=3)
## SampleID UserName StudyDayNo Diet Gender Age Weight Height
## vvkaj.00001 vvkaj.00001 VVKAJ101 1 Vegetarian M 31 79 186
## vvkaj.00002 vvkaj.00002 VVKAJ101 2 Vegetarian M 31 79 186
## vvkaj.00003 vvkaj.00003 VVKAJ101 3 Vegetarian M 31 79 186
## BMI Waist.Circumference
## vvkaj.00001 22.83501 80
## vvkaj.00002 22.83501 80
## vvkaj.00003 22.83501 80
Check the level 1 foods in your food tree.
L1s = tax_table(phyfoods)[, "L1"]
as.vector(unique(L1s))
## [1] "L1_Milk_and_Milk_Products"
## [2] "L1_Meat_Poultry_Fish_and_Mixtures"
## [3] "L1_Eggs"
## [4] "L1_Dry_Beans_Peas_Other_Legumes_Nuts_and_Seeds"
## [5] "L1_Grain_Product"
## [6] "L1_Fruits"
## [7] "L1_Vegetables"
## [8] "L1_Fats_Oils_and_Salad_Dressings"
## [9] "L1_Sugars_Sweets_and_Beverages"
Change to the folder called “Ordination” in your “VVKAJ” folder.
SpecifyDataDirectory(directory.name = "eg_data/VVKAJ/Ordination/")
Perform Principal Coordinate Analysis (PCoA) with weighted unifrac distance of your food data. Ordination by UNweighted unifrac distances can be done by having the “weighted” argument as FALSE. This may take a few minutes depending on your data size. e.g. a large phyloseq object (7.9 MB) takes ~ 1 min.
ordinated <- phyloseq::ordinate(phyfoods, method="PCoA", distance="unifrac", weighted=TRUE)
Save the percent variance explained as a txt file.
Eigen(eigen.input = ordinated$values$Relative_eig,
output.fn="VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_WEIGHTED_eigen_percent.txt")
Merge the first n axes to the metadata and save it as a txt file. This will be used for plotting ordination results.
MergeAxesAndMetadata(ord.object=ordinated, number.of.axes=10, meta.data= meta,
output.fn= "VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_WEIGHTED_meta_users.txt")
Read in the eigenvalues for axis labels of biplots.
eigen_loaded <- read.table("VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_WEIGHTED_eigen_percent.txt", header=T)
Make a vector that contains the variance explained.
eigen_loaded_vec <- eigen_loaded[, 2]
Read in the metadata and users’ Axis values.
meta_usersdf <- read.table("VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_WEIGHTED_meta_users.txt", header=T)
Change Diet to a factor so that factor levels will be displayed in order.
meta_usersdf$Diet <- factor(meta_usersdf$Diet,
levels= c("Vegetarian", "Vegan", "Keto", "American", "Japanese"))
Take a look at meta_usersdf that has been loaded.
head(meta_usersdf, 3)
## Row.names SampleID UserName StudyDayNo Diet Gender Age Weight
## 1 vvkaj.00001 vvkaj.00001 VVKAJ101 1 Vegetarian M 31 79
## 2 vvkaj.00002 vvkaj.00002 VVKAJ101 2 Vegetarian M 31 79
## 3 vvkaj.00003 vvkaj.00003 VVKAJ101 3 Vegetarian M 31 79
## Height BMI Waist.Circumference Axis.1 Axis.2 Axis.3
## 1 186 22.83501 80 -0.0442801041 0.02759311 -0.01355771
## 2 186 22.83501 80 -0.0007933678 -0.04075574 0.07943972
## 3 186 22.83501 80 -0.0842314600 0.02669820 -0.02079188
## Axis.4 Axis.5 Axis.6 Axis.7 Axis.8 Axis.9
## 1 -0.04497503 0.06643215 -0.04633301 0.03528304 -0.03953240 0.04564185
## 2 -0.02767431 -0.05829929 0.02566912 -0.01358921 -0.03447211 -0.11926416
## 3 -0.10074383 0.03985384 0.01108057 -0.02331700 -0.03071878 -0.04394001
## Axis.10
## 1 0.03792176
## 2 -0.04040412
## 3 -0.03060788
Save Axes 1 & 2, 1 & 3, 2 & 3, 3 & 4, 2 & 4 biplots with and without ellipses with specified confidence interval.
The results are saved with filenames with the specified “prefix_AxisXY.pdf” or “prefix_AxisXY_ellipses.pdf”. You need to supply the same number of colors in the order of the factor level to be used. dot.colors
are for datapoints, and ellipses.colors
are for ellipses outlines.
PlotAxis1to4ByFactor(axis.meta.df = meta_usersdf,
factor.to.color = "Diet",
eigen.vector = eigen_loaded_vec,
dot.colors = distinct100colors,
ellipses.colors = distinct100colors,
ellipses.cflevel = 0.95,
out.prefix = "VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_WEIGHTED_diet"
)
The function above saves 10 biplots in the PDF format. Let us take a look at Axis 1 vs. Axis 2.
It seems that Axis1 separated the Japanese diet group from the rest, and Axis2 separated the Keto diet group from the rest.
Some of the Diet groups seem to form distinct clusters. Use beta-diversity and adonis (permanova) tests to see if they are actually distinct from one another.
Generate a weighted unifrac distance matrix.
dist_matrix <- phyloseq::distance(phyfoods, method = "wunifrac")
Perform dispersion test.
vegan::betadisper computes centeroids and distance of each datapoint from it.
dispr <- vegan::betadisper(d=dist_matrix, phyloseq::sample_data(phyfoods)$Diet)
Show the centroids and dispersion of each group.
plot(dispr)
Use dispr to do a permutation test for homogeneity of multivariate dispersion. The set.seed function ensures the same permutation results will be obtained every time; otherwise, the p-values will slightly differ each run, as it is a permutation test.
set.seed(123)
vegan::permutest(dispr, perm=5000)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 5000
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 4 0.022031 0.0055078 2.6938 5000 0.04839 *
## Residuals 40 0.081785 0.0020446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If p>0.05, the dispersion of each group are not different, and the assumption for adonis is met. The results here indicate that the dispersion of each group may be different, so we should consider this information in discussion. Nevertheless, we will proceed for demonstration purposes.
Use adonis to test whether there is a difference between groups’ composition. i.e., composition among groups (food they consumed) is similar or not.
set.seed(123)
vegan::adonis(dist_matrix ~ phyloseq::sample_data(phyfoods)$Diet, permutations = 5000)
##
## Call:
## vegan::adonis(formula = dist_matrix ~ phyloseq::sample_data(phyfoods)$Diet, permutations = 5000)
##
## Permutation: free
## Number of permutations: 5000
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2
## phyloseq::sample_data(phyfoods)$Diet 4 0.9401 0.235020 3.4251 0.25513
## Residuals 40 2.7447 0.068617 0.74487
## Total 44 3.6848 1.00000
## Pr(>F)
## phyloseq::sample_data(phyfoods)$Diet 2e-04 ***
## Residuals
## Total
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If overall adonis is significant, which is true in this case, you can run pairwise adonis to see which group pairs are different.
pairwise.adonis(dist_matrix, phyloseq::sample_data(phyfoods)$Diet, perm = 5000,
p.adjust.m = "none")
## pairs Df SumsOfSqs F.Model R2 p.value
## 1 Vegetarian vs Vegan 1 0.07557735 1.178911 0.06862547 0.20815837
## 2 Vegetarian vs Keto 1 0.13097005 1.888662 0.10557872 0.00739852
## 3 Vegetarian vs American 1 0.15189915 2.540675 0.13703252 0.00019996
## 4 Vegetarian vs Japanese 1 0.31482162 5.654587 0.26112652 0.00039992
## 5 Vegan vs Keto 1 0.19900379 2.430678 0.13188217 0.00019996
## 6 Vegan vs American 1 0.16773768 2.319599 0.12661843 0.00019996
## 7 Vegan vs Japanese 1 0.36815946 5.398094 0.25226984 0.00019996
## 8 Keto vs American 1 0.21818296 2.813418 0.14954314 0.00019996
## 9 Keto vs Japanese 1 0.32249324 4.391284 0.21535104 0.00019996
## 10 American vs Japanese 1 0.40135753 6.282904 0.28196075 0.00019996
## p.adjusted sig
## 1 0.20815837
## 2 0.00739852 *
## 3 0.00019996 **
## 4 0.00039992 **
## 5 0.00019996 **
## 6 0.00019996 **
## 7 0.00019996 **
## 8 0.00019996 **
## 9 0.00019996 **
## 10 0.00019996 **
This table indicates all the combination of the five diets are significantly different (p<0.05) except Vegetarian vs. Vegan. We expect Vegetarian and Vegan diets to be similar, so this makes sense.
Generate and save a weighted unifrac distance matrix of “Samples”.
WeightedUnifracDis(input.phyloseq.obj = phyfoods,
output.fn = "VVKAJ_Items_f_id_s_m_QCed_4Lv_WEIGHTED_uni_dis.txt")
You can perform Principal Coordinate Analysis (PCoA) with UNweighted unifrac distance of your food data.
ordinated_u = phyloseq::ordinate(phyfoods, method="PCoA", distance="unifrac", weighted=FALSE)
Use the same code above for creating plots, but now with ordinated_u for the ord.object argument, and change WEIGHTED to UNweighted, or an appropriate name for the method you selected.
Save the percent variance explained as a txt file.
Eigen(eigen.input = ordinated_u$values$Relative_eig,
output.fn="VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_UNweighted_eigen_percent.txt")
Merge the first n axes to the metadata and save it as a txt file. This will be used for plotting ordination results.
MergeAxesAndMetadata(ord.object=ordinated_u, number.of.axes=10, meta.data= meta,
output.fn= "VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_UNweighted_meta_users.txt")
Read in the eigenvalues for axis labels of biplots.
eigen_loaded <- read.table("VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_UNweighted_eigen_percent.txt", header=T)
Make a vector that contains the variance explained.
eigen_loaded_vec <- eigen_loaded[, 2]
Read in the metadata and users’ Axis values.
meta_usersdf <- read.table("VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_UNweighted_meta_users.txt", header=T)
Change Diet to a factor so that factor levels will be displayed in order.
meta_usersdf$Diet <- factor(meta_usersdf$Diet,
levels= c("Vegetarian", "Vegan", "Keto", "American", "Japanese"))
Take a look at meta_usersdf that has been loaded.
head(meta_usersdf, 3)
## Row.names SampleID UserName StudyDayNo Diet Gender Age Weight
## 1 vvkaj.00001 vvkaj.00001 VVKAJ101 1 Vegetarian M 31 79
## 2 vvkaj.00002 vvkaj.00002 VVKAJ101 2 Vegetarian M 31 79
## 3 vvkaj.00003 vvkaj.00003 VVKAJ101 3 Vegetarian M 31 79
## Height BMI Waist.Circumference Axis.1 Axis.2 Axis.3
## 1 186 22.83501 80 -0.09008072 0.12526150 0.224690714
## 2 186 22.83501 80 -0.10016605 -0.29241014 0.008973829
## 3 186 22.83501 80 -0.17290956 -0.04453934 0.065796514
## Axis.4 Axis.5 Axis.6 Axis.7 Axis.8 Axis.9
## 1 -0.09324589 0.008335857 -0.08133357 -0.1488169 0.02877081 -0.06084911
## 2 -0.19562671 -0.091767769 0.08512958 -0.1743513 0.05350784 -0.03272327
## 3 -0.18999341 0.134560378 0.14815162 -0.1078155 -0.13799218 0.25383971
## Axis.10
## 1 0.11004841
## 2 -0.04213860
## 3 0.03525768
Save Axes 1 & 2, 1 & 3, 2 & 3, 3 & 4, 2 & 4 biplots with and without ellipses with specified confidence interval. The results are saved with filenames with the specified “prefix_AxisXY.pdf” or “prefix_AxisXY_ellipses.pdf”. You need to supply the same number of colors in the order of the factor level to be used. dot.colors
are for datapoints, and ellipses.colors
are for ellipses outlines.
PlotAxis1to4ByFactor(axis.meta.df = meta_usersdf,
factor.to.color = "Diet",
eigen.vector = eigen_loaded_vec,
dot.colors = distinct100colors,
ellipses.colors = distinct100colors,
ellipses.cflevel = 0.95,
out.prefix = "VVKAJ_Items_f_id_s_m_QCed_4Lv_ord_UNweighted_diet"
)
Some of the Diet groups seem to form distinct clusters. Use beta-diversity and adonis tests to see if they are actually distinct from one another.
Generate a weighted unifrac distance matrix.
dist_matrix <- phyloseq::distance(phyfoods, method = "unifrac") # UNweighted
Dispersion test and plot
vegan::betadisper computes centeroids and distance of each datapoint from it.
dispr <- vegan::betadisper(d=dist_matrix, phyloseq::sample_data(phyfoods)$Diet)
Show the centroids and dispersion of each group.
plot(dispr)
Use dispr to do a permutation test for homogeneity of multivariate dispersion.
set.seed(123)
vegan::permutest(dispr, perm=5000)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 5000
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 4 0.047486 0.0118715 10.91 5000 2e-04 ***
## Residuals 40 0.043525 0.0010881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If p>0.05, the dispersion of each group are not different, and the assumption for adonis is met. The results here indicate that the dispersion of each group may be different, so we should consider this information in discussion. Nevertheless, we will proceed for demonstration purposes.
Use adonis to test whether there is a difference between groups’ composition. i.e., composition among groups (food they consumed) is similar or not.
set.seed(123)
vegan::adonis(dist_matrix ~ phyloseq::sample_data(phyfoods)$Diet, permutations = 5000)
##
## Call:
## vegan::adonis(formula = dist_matrix ~ phyloseq::sample_data(phyfoods)$Diet, permutations = 5000)
##
## Permutation: free
## Number of permutations: 5000
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2
## phyloseq::sample_data(phyfoods)$Diet 4 3.696 0.92399 2.2278 0.18219
## Residuals 40 16.590 0.41475 0.81781
## Total 44 20.286 1.00000
## Pr(>F)
## phyloseq::sample_data(phyfoods)$Diet 2e-04 ***
## Residuals
## Total
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If overall adonis is significant, which is true in this case, you can run pairwise adonis to see which group pairs are different.
pairwise.adonis(dist_matrix, phyloseq::sample_data(phyfoods)$Diet, perm = 5000,
p.adjust.m = "none")
## pairs Df SumsOfSqs F.Model R2 p.value
## 1 Vegetarian vs Vegan 1 0.4374127 0.9882078 0.05817022 0.50789842
## 2 Vegetarian vs Keto 1 0.6542830 1.5148347 0.08648867 0.00139972
## 3 Vegetarian vs American 1 0.7838799 1.7984651 0.10104608 0.00019996
## 4 Vegetarian vs Japanese 1 1.1927048 3.1013470 0.16236274 0.00039992
## 5 Vegan vs Keto 1 0.7691534 1.7648633 0.09934573 0.00059988
## 6 Vegan vs American 1 0.7535836 1.7136319 0.09674085 0.00019996
## 7 Vegan vs Japanese 1 1.2403751 3.1929421 0.16636022 0.00019996
## 8 Keto vs American 1 0.9196267 2.1434374 0.11813844 0.00019996
## 9 Keto vs Japanese 1 1.1584619 3.0666706 0.16083933 0.00019996
## 10 American vs Japanese 1 1.3304323 3.4855237 0.17887760 0.00019996
## p.adjusted sig
## 1 0.50789842
## 2 0.00139972 *
## 3 0.00019996 **
## 4 0.00039992 **
## 5 0.00059988 **
## 6 0.00019996 **
## 7 0.00019996 **
## 8 0.00019996 **
## 9 0.00019996 **
## 10 0.00019996 **
Similarly to the results with weighted unifrac distances, this table also indicates all the combination of the five diets are significantly different (p<0.05) except Vegetarian vs. Vegan. We expect Vegetarian and Vegan diets to be similar, so this makes sense.
Generate and save an unweighted unifrac distance matrix of “Samples”.
UnweightedUnifracDis(input.phyloseq.obj = phyfoods,
output.fn = "VVKAJ_Items_f_id_s_m_QCed_4Lv_UNweighted_uni_dis.txt")
Come back to the main directory before you start running another script.
setwd(main_wd)