Ecological analysis with QIIME2

By George Kalogiannis & Balig Panossian, Designed from the official QIIME2 tutorials

You’ve reached the point in the workflow where you can confidently inspect amplicon reads, clean them up, and generate reliable ASV tables and taxonomic summaries. In other words, you now know what sequences are present and how many of each occur in every sample. The next step is to turn those counts into ecological insight: measuring within-sample diversity, comparing whole communities across treatments, and linking patterns to phylogenetic relationships. The sections that follow introduce the core ecological tools you’ll use—alpha and beta diversity metrics, phylogeny-based distances, and statistical tests that reveal how environmental or experimental factors shape the microbiome.

Learning Outcomes

By the end of this session you will be able to:

Compute alpha- and beta-diversity metrics and visualise them in interactive .qzv files
Construct a phylogenetic tree and run UniFrac-based ordination and PERMANOVA tests
Export QIIME artifacts to R with and run statistical analyses on

Directories

Yesterday you worked in the directory called wednesday_outputs. We will be using the files you created for today’s ecological analysis and will output files to a directory called thursday_outputs.

mkdir thursday_outputs

Phylogenetics

A phylogenetic tree is a diagram that represents the evolutionary relationships among different organisms. By aligning and comparing the sequences, we can infer how closely related they are and construct a tree that reflects their shared evolutionary history. Some communities may appear different in terms of ASV presence but are composed of closely related taxa. To assess this, we construct a phylogenetic tree of all ASVs.

This tree is essential for certain types of diversity metrics, such as UniFrac, which measure not just the presence or absence of microbes, but how evolutionarily different the communities are. Understanding these relationships helps us interpret microbial functions and ecological roles more effectively. It’s also useful for certain types of diversity comparisons between samples.

Multiple sequence alignment

We begin by using Mafft, which aligns all representative ASV sequences so that homologous nucleotide positions are lined up across sequences:

qiime alignment mafft \
 --i-sequences wednesday_outputs/rep-seqs-dada2-filtered.qza \
 --o-alignment thursday_outputs/aligned-rep-seqs.qza

Output: aligned-rep-seqs.qza

Masking sites

Masking is the process of removing highly variable, gappy, or uninformative positions from a multiple sequence alignment. These positions often arise from sequencing noise, misalignments, or non-homologous regions, and can distort phylogenetic inference by introducing noise into the tree-building process.

In QIIME 2, this command takes the aligned ASV sequences and filters out alignment columns (positions) that don’t contain reliable, conserved sequence information—leaving a cleaner dataset for more accurate and robust tree construction:
qiime alignment mask \
 --i-alignment thursday_outputs/aligned-rep-seqs.qza \
 --o-masked-alignment thursday_outputs/masked-aligned-rep-seqs.qza
Output: masked-aligned-rep-seqs.qza

Creating a tree

FastTree builds a phylogenetic tree from the masked, aligned ASV sequences using an approximate maximum-likelihood method. This tree reflects the evolutionary relationships among ASVs and is essential for phylogeny-based diversity metrics like UniFrac.

The resulting tree is unrooted, meaning it shows relationships but not direction of ancestry:
qiime phylogeny fasttree \
 --i-alignment thursday_outputs/masked-aligned-rep-seqs.qza \
 --o-tree thursday_outputs/unrooted-tree.qza
Output: unrooted-tree.qza

Midpoint rooting

Rooting the tree defines a starting point for evolutionary comparisons. Since our tree is initially unrooted (no known ancestor), we use midpoint rooting, which places the root at the midpoint of the longest distance between any two tips—giving a balanced view of divergence.

qiime phylogeny midpoint-root \
 --i-tree thursday_outputs/unrooted-tree.qza \
 --o-rooted-tree thursday_outputs/rooted-tree.qza

Output: rooted-tree.qza

Visualisation

Once you’ve generated a rooted phylogenetic tree and cleaned, filtered abundance data, it’s time to visualise these relationships in an interactive, intuitive way. QIIME 2’s Empress plugin allows you to explore phylogenetic trees alongside sample metadata and taxonomic annotations. This is particularly powerful for understanding not only which taxa are present, but how community shifts map onto the evolutionary history of your organisms.

The tree-plot command displays the phylogenetic tree and overlays taxonomic information. This is useful for checking the structure of the tree and seeing how ASVs group by lineage. It’s also a helpful tool to explore diversity patterns across evolutionary lineages.
qiime empress tree-plot \
 --i-tree thursday_outputs/rooted-tree.qza \
 --m-feature-metadata-file wednesday_outputs/taxonomy.qza \
 --o-visualization thursday_outputs/empress-tree-tax.qzv
Output: empress-tree-tax.qzv

The community-plot command takes things further by integrating the phylogenetic tree, ASV abundance table, sample metadata, and taxonomy. This interactive plot lets you explore which lineages dominate particular samples or treatments, trace shifts in community composition, and highlight specific branches or taxa across groups.
qiime empress community-plot \
 --i-tree thursday_outputs/rooted-tree.qza \
 --i-feature-table wednesday_outputs/table-dada2-filtered.qza \
 --m-sample-metadata-file /mnt/lustre/groups/WCHPC/wednesday_data/wednesday_metadata.tsv \
 --m-feature-metadata-file wednesday_outputs/taxonomy.qza \
 --o-visualization thursday_outputs/community-empress-tree-tax.qzv
Output: community-empress-tree-tax.qzv

Diversity

Microbial ecology is the study of how microbes interact with each other and their environment. In agricultural settings, these interactions can profoundly affect crop health, nutrient cycling, and disease suppression. Ecological analysis allows us to characterize microbial diversity, compare communities between samples (like different soil types or plant treatments), and identify patterns that inform how microbial communities function and respond to changes.

Alpha Diversity: Within-Sample Diversity

Alpha diversity refers to the number and evenness of microbial species in a single sample. Three common metrics that Qiime2 can calculate are these:

Observed Features: The count of unique sequence variants or species.
qiime diversity alpha \
 --i-table wednesday_outputs/table-dada2-filtered.qza \
 --p-metric observed_features \
 --o-alpha-diversity thursday_outputs/alpha-observed-features.qza
Output: alpha-observed-features.qza

Shannon Diversity Index: Takes into account both richness and evenness; higher values indicate more diverse communities.
qiime diversity alpha \
 --i-table wednesday_outputs/table-dada2-filtered.qza \
 --p-metric shannon \
 --o-alpha-diversity thursday_outputs/alpha-shannon.qza
Output: alpha-shannon.qza

Faith’s Phylogenetic Diversity: Considers how evolutionarily diverse a community is, using a phylogenetic tree.
qiime diversity alpha-phylogenetic \
 --i-table wednesday_outputs/table-dada2-filtered.qza \
 --i-phylogeny thursday_outputs/rooted-tree.qza \
 --p-metric faith_pd \
 --o-alpha-diversity thursday_outputs/alpha-faith-pd.qza
Output: alpha-faith-pqd.qza

As we have seen throughout the day, Qiime .qza files don’t contain data in an interpretable format. We thus want to export the alpha diversity metrics into three separate tables, using the export function.

Repeat this for each sample:
qiime tools export \
 --input-path thursday_outputs/alpha-<metric>.qza \
 --output-path thursday_outputs/alpha-<metric>
Output: three directories containing alpha-diversity outputs.

Beta Diversity: Between-Sample Differences

Beta diversity compares microbial composition across samples. It helps you answer questions like “Do different fertilizers cause different microbial communities?” or “Are microbes in root zones different from those in bulk soil?”. Common metrics include:

Bray-Curtis Dissimilarity: Based on differences in counts. This abundance-based metric calculates dissimilarity between two communities based on the counts of shared features.
qiime diversity beta \
 --i-table wednesday_outputs/table-dada2-filtered.qza \
 --p-metric braycurtis \
 --o-distance-matrix thursday_outputs/beta-braycurtis.qza
Output: beta-braycurtis.qza

Jaccard Distance: Based on shared presence/absence of species. This metric compares samples based only on presence or absence of features, ignoring their abundance.
qiime diversity beta \
 --i-table wednesday_outputs/table-dada2-filtered.qza \
 --p-metric jaccard \
 --o-distance-matrix thursday_outputs/beta-jaccard.qza
Output: beta-jaccard.qza

UniFrac (weighted/unweighted): Measures how phylogenetically different two communities are.

a. Unweighted UniFrac

qiime diversity beta-phylogenetic \
 --i-table wednesday_outputs/table-dada2-filtered.qza \
 --i-phylogeny thursday_outputs/rooted-tree.qza \
 --p-metric unweighted_unifrac \
 --o-distance-matrix thursday_outputs/beta-unweighted-unifrac.qza

Output: beta-unweighted-unifrac.qza

b. Weighted UniFrac

qiime diversity beta-phylogenetic \
 --i-table wednesday_outputs/table-dada2-filtered.qza \
 --i-phylogeny thursday_outputs/rooted-tree.qza \
 --p-metric weighted_unifrac \
 --o-distance-matrix thursday_outputs/beta-weighted-unifrac.qza

Output: beta-weighted-unifrac.qza

Lets export the Weighted UniFrac beta-diversity so we can then use it in R
qiime tools export \
 --input-path thursday_outputs/beta-weighted-unifrac.qza \
 --output-path thursday_outputs/beta-weighted-unifrac
Output: beta-braycurtis/distance-matrix.tsv

Statistical Analysis of Diversity

It’s time to move the data outputs into the ARUA directory we were working in on Tuesday and open up R. Place the data into the data directory, we will be running our code from the code directory. You can bring the data onto your local machine using sftp.

# Load libraries
library(dplyr)
library(vegan)
library(readr)
library(ggplot2)
set.seed(719885)

# Load data
metadata <- read.csv("../data/wednesday_metadata.tsv", sep = "\t")
alpha <- read.csv("../data/alpha-diversity.tsv", sep = "\t")
distance_matrix <- read.csv("../data/distance-matrix.tsv", sep = "\t")

# Prepare and merge metadata
colnames(alpha)[1] <- "sample.id"
colnames(distance_matrix)[1] <- "sample.id"
colnames(alpha)[1] <- "sample.id"

# Change row names <<<< R expects a true matrix (you need to remove any text columns)
rownames(distance_matrix) <- distance_matrix[,1]
distance_matrix$sample.id <- NULL

metadata <- metadata %>%
 filter(Genotype != "Soil") %>%
 inner_join(alpha, by = "sample.id") %>%
 mutate(
   GenotypeGroup = case_when(
     Genotype == "phr1" ~ "PHR1",
     Genotype == "SPX1/SPX2" ~ "SPX",
     TRUE ~ "non_psr"
   ),
   GenotypeGroup = factor(GenotypeGroup),
   Genotype = factor(Genotype),
   Experiment = factor(Experiment)
 )

# Match sample IDs with distance matrix
valid_ids <- intersect(rownames(distance_matrix), metadata$sample.id)
distance_matrix <- distance_matrix[valid_ids, valid_ids]
metadata <- metadata %>% filter(sample.id %in% valid_ids) %>%
 arrange(factor(sample.id, levels = rownames(distance_matrix)))

# Alpha diversity model
summary(lm(shannon_entropy ~ GenotypeGroup + Experiment, data = metadata))

# Beta diversity (PERMANOVA)
adonis_result <- adonis2(as.dist(distance_matrix) ~ GenotypeGroup + Experiment, data = metadata)
print(adonis_result)

# Plot: Alpha diversity boxplot
ggplot(metadata, aes(x = GenotypeGroup, y = shannon_entropy, fill = GenotypeGroup)) +
 geom_boxplot(alpha = 0.8) +
 labs(title = "Alpha Diversity (Shannon Index)",
      x = "Genotype Group",
      y = "Shannon Entropy") +
 theme_minimal() +
 theme(legend.position = "none",
       axis.text.x = element_text(angle = 45, hjust = 1))

PCoA Ordination (Optional)

As an extra step, you can plot a Principal Coordinate Analysis (PCoA) of the Weighted UniFrac distance matrix:

You must first convert the distance matrix (e.g., beta-weighted-unifrac.qza) into a PCoA object:
qiime diversity pcoa \
 --i-distance-matrix wednesday_outputs/beta-weighted-unifrac.qza \
 --o-pcoa wednesday_outputs/beta-weighted-unifrac-pcoa.qza

Make the plot in Qiime:

qiime emperor plot \
 --i-pcoa wednesday_outputs/beta-weighted-unifrac-pcoa.qza \
 --m-metadata-file /mnt/lustre/groups/WCHPC/wednesday_data/wednesday_metadata.tsv \
 --o-visualization wednesday_outputs/unifrac-emperor.qzv

Then you can export it and bring the output file into R to visualise:

qiime tools export \
  --input-path wednesday_outputs/beta-weighted-unifrac-pcoa.qza  \
  --output-path wednesday_outputs/unifrac-pcoa

Now, in R (with the other files you had before):

pcoa <- read_delim("../data/unifrac-pcoa/ordination.txt", delim = "\t", skip = 9, col_names = FALSE)

colnames(pcoa) <- c("sample.id", paste0("PC", 1:ncol(pcoa)))

pcoa <- pcoa[-((nrow(pcoa)-1):nrow(pcoa)), ]
plot_data <- left_join(pcoa, metadata, by = "sample.id")
ggplot(plot_data, aes(x = PC1, y = PC2, color = Genotype)) +
 geom_point(size = 4, alpha = 0.8) +
 theme_minimal() +
 labs(x = "PC1", y = "PC2", color = "Lifestyle")

Other analyses

This tutorial covered a range of analyses that can be done with microbiome data but there are other types on analyses that can be done too.

Functional analysis - Several packages attempt to impute function from taxonomy including PiCrust, Tax4fun, Piphillin
Inferring ecological interaction networks -SPIEC-EASI. Co-Variance is an issue, but SPIEC-EASI attempts to model conditional independence.
Data management tools - Qiita, and from ARS scientist Dan Manter myPhyloDB
Set analysis - From ARS Scientist Devin Coleman-Derr MetaComet
Other general analysis tools - Mothur and the R-based Phyloseq