Ecological analysis with QIIME2
By George Kalogiannis & Balig Panossian, Designed from the official QIIME2 tutorials
You’ve reached the point in the workflow where you can confidently inspect amplicon reads, clean them up, and generate reliable ASV tables and taxonomic summaries. In other words, you now know what sequences are present and how many of each occur in every sample. The next step is to turn those counts into ecological insight: measuring within-sample diversity, comparing whole communities across treatments, and linking patterns to phylogenetic relationships. The sections that follow introduce the core ecological tools you’ll use—alpha and beta diversity metrics, phylogeny-based distances, and statistical tests that reveal how environmental or experimental factors shape the microbiome.
Learning Outcomes
By the end of this session you will be able to:
- Compute alpha- and beta-diversity metrics and visualise them in interactive .qzv files
- Construct a phylogenetic tree and run UniFrac-based ordination and PERMANOVA tests
- Export QIIME artifacts to R with and run statistical analyses on
Directories
Yesterday you worked in the directory called wednesday_outputs
. We will be using the files you created for today’s ecological analysis and will output files to a directory called thursday_outputs
.
mkdir thursday_outputs
Phylogenetics
A phylogenetic tree is a diagram that represents the evolutionary relationships among different organisms. By aligning and comparing the sequences, we can infer how closely related they are and construct a tree that reflects their shared evolutionary history. Some communities may appear different in terms of ASV presence but are composed of closely related taxa. To assess this, we construct a phylogenetic tree of all ASVs.
This tree is essential for certain types of diversity metrics, such as UniFrac, which measure not just the presence or absence of microbes, but how evolutionarily different the communities are. Understanding these relationships helps us interpret microbial functions and ecological roles more effectively. It’s also useful for certain types of diversity comparisons between samples.
Multiple sequence alignment
We begin by using Mafft, which aligns all representative ASV sequences so that homologous nucleotide positions are lined up across sequences:
qiime alignment mafft \ --i-sequences wednesday_outputs/rep-seqs-dada2-filtered.qza \ --o-alignment thursday_outputs/aligned-rep-seqs.qza
Output:
aligned-rep-seqs.qza
Masking sites
Masking is the process of removing highly variable, gappy, or uninformative positions from a multiple sequence alignment. These positions often arise from sequencing noise, misalignments, or non-homologous regions, and can distort phylogenetic inference by introducing noise into the tree-building process.
In QIIME 2, this command takes the aligned ASV sequences and filters out alignment columns (positions) that don’t contain reliable, conserved sequence information—leaving a cleaner dataset for more accurate and robust tree construction:
qiime alignment mask \ --i-alignment thursday_outputs/aligned-rep-seqs.qza \ --o-masked-alignment thursday_outputs/masked-aligned-rep-seqs.qza
Output:
masked-aligned-rep-seqs.qza
Creating a tree
FastTree builds a phylogenetic tree from the masked, aligned ASV sequences using an approximate maximum-likelihood method. This tree reflects the evolutionary relationships among ASVs and is essential for phylogeny-based diversity metrics like UniFrac.
The resulting tree is unrooted, meaning it shows relationships but not direction of ancestry:
qiime phylogeny fasttree \ --i-alignment thursday_outputs/masked-aligned-rep-seqs.qza \ --o-tree thursday_outputs/unrooted-tree.qza
Output:
unrooted-tree.qza
Midpoint rooting
Rooting the tree defines a starting point for evolutionary comparisons. Since our tree is initially unrooted (no known ancestor), we use midpoint rooting, which places the root at the midpoint of the longest distance between any two tips—giving a balanced view of divergence.
qiime phylogeny midpoint-root \ --i-tree thursday_outputs/unrooted-tree.qza \ --o-rooted-tree thursday_outputs/rooted-tree.qza
Output:
rooted-tree.qza
Visualisation
Once you’ve generated a rooted phylogenetic tree and cleaned, filtered abundance data, it’s time to visualise these relationships in an interactive, intuitive way. QIIME 2’s Empress plugin allows you to explore phylogenetic trees alongside sample metadata and taxonomic annotations. This is particularly powerful for understanding not only which taxa are present, but how community shifts map onto the evolutionary history of your organisms.
The tree-plot command displays the phylogenetic tree and overlays taxonomic information. This is useful for checking the structure of the tree and seeing how ASVs group by lineage. It’s also a helpful tool to explore diversity patterns across evolutionary lineages.
qiime empress tree-plot \ --i-tree thursday_outputs/rooted-tree.qza \ --m-feature-metadata-file wednesday_outputs/taxonomy.qza \ --o-visualization thursday_outputs/empress-tree-tax.qzv
Output:
empress-tree-tax.qzv
The community-plot command takes things further by integrating the phylogenetic tree, ASV abundance table, sample metadata, and taxonomy. This interactive plot lets you explore which lineages dominate particular samples or treatments, trace shifts in community composition, and highlight specific branches or taxa across groups.
qiime empress community-plot \ --i-tree thursday_outputs/rooted-tree.qza \ --i-feature-table wednesday_outputs/table-dada2-filtered.qza \ --m-sample-metadata-file /mnt/lustre/groups/WCHPC/wednesday_data/wednesday_metadata.tsv \ --m-feature-metadata-file wednesday_outputs/taxonomy.qza \ --o-visualization thursday_outputs/community-empress-tree-tax.qzv
Output:
community-empress-tree-tax.qzv
Diversity
Microbial ecology is the study of how microbes interact with each other and their environment. In agricultural settings, these interactions can profoundly affect crop health, nutrient cycling, and disease suppression. Ecological analysis allows us to characterize microbial diversity, compare communities between samples (like different soil types or plant treatments), and identify patterns that inform how microbial communities function and respond to changes.
Alpha Diversity: Within-Sample Diversity
Alpha diversity refers to the number and evenness of microbial species in a single sample. Three common metrics that Qiime2 can calculate are these:
Observed Features: The count of unique sequence variants or species.
qiime diversity alpha \ --i-table wednesday_outputs/table-dada2-filtered.qza \ --p-metric observed_features \ --o-alpha-diversity thursday_outputs/alpha-observed-features.qza
Output:
alpha-observed-features.qza
Shannon Diversity Index: Takes into account both richness and evenness; higher values indicate more diverse communities.
qiime diversity alpha \ --i-table wednesday_outputs/table-dada2-filtered.qza \ --p-metric shannon \ --o-alpha-diversity thursday_outputs/alpha-shannon.qza
Output:
alpha-shannon.qza
Faith’s Phylogenetic Diversity: Considers how evolutionarily diverse a community is, using a phylogenetic tree.
qiime diversity alpha-phylogenetic \ --i-table wednesday_outputs/table-dada2-filtered.qza \ --i-phylogeny thursday_outputs/rooted-tree.qza \ --p-metric faith_pd \ --o-alpha-diversity thursday_outputs/alpha-faith-pd.qza
Output:
alpha-faith-pqd.qza
As we have seen throughout the day, Qiime .qza
files don’t contain data in an interpretable format. We thus want to export the alpha diversity metrics into three separate tables, using the export
function.
Repeat this for each sample:
qiime tools export \ --input-path thursday_outputs/alpha-<metric>.qza \ --output-path thursday_outputs/alpha-<metric>
Output: three directories containing alpha-diversity outputs.
Beta Diversity: Between-Sample Differences
Beta diversity compares microbial composition across samples. It helps you answer questions like “Do different fertilizers cause different microbial communities?” or “Are microbes in root zones different from those in bulk soil?”. Common metrics include:
Bray-Curtis Dissimilarity: Based on differences in counts. This abundance-based metric calculates dissimilarity between two communities based on the counts of shared features.
qiime diversity beta \ --i-table wednesday_outputs/table-dada2-filtered.qza \ --p-metric braycurtis \ --o-distance-matrix thursday_outputs/beta-braycurtis.qza
Output:
beta-braycurtis.qza
Jaccard Distance: Based on shared presence/absence of species. This metric compares samples based only on presence or absence of features, ignoring their abundance.
qiime diversity beta \ --i-table wednesday_outputs/table-dada2-filtered.qza \ --p-metric jaccard \ --o-distance-matrix thursday_outputs/beta-jaccard.qza
Output:
beta-jaccard.qza
UniFrac (weighted/unweighted): Measures how phylogenetically different two communities are.
a. Unweighted UniFrac
qiime diversity beta-phylogenetic \ --i-table wednesday_outputs/table-dada2-filtered.qza \ --i-phylogeny thursday_outputs/rooted-tree.qza \ --p-metric unweighted_unifrac \ --o-distance-matrix thursday_outputs/beta-unweighted-unifrac.qza
Output:
beta-unweighted-unifrac.qza
b. Weighted UniFrac
qiime diversity beta-phylogenetic \ --i-table wednesday_outputs/table-dada2-filtered.qza \ --i-phylogeny thursday_outputs/rooted-tree.qza \ --p-metric weighted_unifrac \ --o-distance-matrix thursday_outputs/beta-weighted-unifrac.qza
Output:
beta-weighted-unifrac.qza
Lets export the Weighted UniFrac beta-diversity so we can then use it in R
qiime tools export \ --input-path thursday_outputs/beta-weighted-unifrac.qza \ --output-path thursday_outputs/beta-weighted-unifrac
Output:
beta-braycurtis/distance-matrix.tsv
Statistical Analysis of Diversity
It’s time to move the data outputs into the ARUA directory we were working in on Tuesday and open up R. Place the data into the data
directory, we will be running our code from the code
directory. You can bring the data onto your local machine using sftp
.
# Load libraries library(dplyr) library(vegan) library(readr) library(ggplot2) set.seed(719885) # Load data metadata <- read.csv("../data/wednesday_metadata.tsv", sep = "\t") alpha <- read.csv("../data/alpha-diversity.tsv", sep = "\t") distance_matrix <- read.csv("../data/distance-matrix.tsv", sep = "\t") # Prepare and merge metadata colnames(alpha)[1] <- "sample.id" colnames(distance_matrix)[1] <- "sample.id" colnames(alpha)[1] <- "sample.id" # Change row names <<<< R expects a true matrix (you need to remove any text columns) rownames(distance_matrix) <- distance_matrix[,1] distance_matrix$sample.id <- NULL metadata <- metadata %>% filter(Genotype != "Soil") %>% inner_join(alpha, by = "sample.id") %>% mutate( GenotypeGroup = case_when( Genotype == "phr1" ~ "PHR1", Genotype == "SPX1/SPX2" ~ "SPX", TRUE ~ "non_psr" ), GenotypeGroup = factor(GenotypeGroup), Genotype = factor(Genotype), Experiment = factor(Experiment) ) # Match sample IDs with distance matrix valid_ids <- intersect(rownames(distance_matrix), metadata$sample.id) distance_matrix <- distance_matrix[valid_ids, valid_ids] metadata <- metadata %>% filter(sample.id %in% valid_ids) %>% arrange(factor(sample.id, levels = rownames(distance_matrix))) # Alpha diversity model summary(lm(shannon_entropy ~ GenotypeGroup + Experiment, data = metadata)) # Beta diversity (PERMANOVA) adonis_result <- adonis2(as.dist(distance_matrix) ~ GenotypeGroup + Experiment, data = metadata) print(adonis_result) # Plot: Alpha diversity boxplot ggplot(metadata, aes(x = GenotypeGroup, y = shannon_entropy, fill = GenotypeGroup)) + geom_boxplot(alpha = 0.8) + labs(title = "Alpha Diversity (Shannon Index)", x = "Genotype Group", y = "Shannon Entropy") + theme_minimal() + theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
PCoA Ordination (Optional)
As an extra step, you can plot a Principal Coordinate Analysis (PCoA) of the Weighted UniFrac distance matrix:
You must first convert the distance matrix (e.g., beta-weighted-unifrac.qza) into a PCoA object:
qiime diversity pcoa \ --i-distance-matrix wednesday_outputs/beta-weighted-unifrac.qza \ --o-pcoa wednesday_outputs/beta-weighted-unifrac-pcoa.qza
Make the plot in Qiime:
qiime emperor plot \ --i-pcoa wednesday_outputs/beta-weighted-unifrac-pcoa.qza \ --m-metadata-file /mnt/lustre/groups/WCHPC/wednesday_data/wednesday_metadata.tsv \ --o-visualization wednesday_outputs/unifrac-emperor.qzv
Then you can export it and bring the output file into R to visualise:
qiime tools export \ --input-path wednesday_outputs/beta-weighted-unifrac-pcoa.qza \ --output-path wednesday_outputs/unifrac-pcoa
Now, in R (with the other files you had before):
pcoa <- read_delim("../data/unifrac-pcoa/ordination.txt", delim = "\t", skip = 9, col_names = FALSE) colnames(pcoa) <- c("sample.id", paste0("PC", 1:ncol(pcoa))) pcoa <- pcoa[-((nrow(pcoa)-1):nrow(pcoa)), ] plot_data <- left_join(pcoa, metadata, by = "sample.id") ggplot(plot_data, aes(x = PC1, y = PC2, color = Genotype)) + geom_point(size = 4, alpha = 0.8) + theme_minimal() + labs(x = "PC1", y = "PC2", color = "Lifestyle")
Other analyses
This tutorial covered a range of analyses that can be done with microbiome data but there are other types on analyses that can be done too.
- Functional analysis - Several packages attempt to impute function from taxonomy including PiCrust, Tax4fun, Piphillin
- Inferring ecological interaction networks -SPIEC-EASI. Co-Variance is an issue, but SPIEC-EASI attempts to model conditional independence.
- Data management tools - Qiita, and from ARS scientist Dan Manter myPhyloDB
- Set analysis - From ARS Scientist Devin Coleman-Derr MetaComet
- Other general analysis tools - Mothur and the R-based Phyloseq