Shotgun metagenomics

mbtools has support for aligning shotgun metagenomic reads. Before proceeding we recommend you preprocess the reads first as described in an earlier vignette.

library(mbtools)

Aligning reads

We use minimap2 for everything since it performs as good as other aligners but does not require explicit building of the reference. This way your reference database can just be a (compressed) fasta file.

As example data we will use 3 samples generated with the polyester read sampler from a list of 10 reference genomes in equal abundances.

Preparing the example data

Let’s create our file list for the example data and reference database:

fi <- system.file("extdata/shotgun", package = "mbtools") %>%
      find_read_files()
ref <- system.file("extdata/genomes/zymo_mock.fna.gz",
                   package = "mbtools")

Which are 3 paired-end files.

Aligning

As always we will need A config object.

## $reference
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz"
## 
## $build_index
## [1] FALSE
## 
## $threads
## [1] 3
## 
## $alignment_dir
## [1] "alignments"
## 
## $max_hits
## [1] 100
## 
## $use_existing
## [1] FALSE
## 
## $limited_memory
## [1] FALSE
## 
## attr(,"class")
## [1] "config"

This will be sufficient to align reads. As always the first argument can also be an artifact from quality_control or preprocess.

## INFO [2019-05-28 16:44:18] Aligning 3 samples on 3 threads. Keeping up to 100 secondary alignments.
## INFO [2019-05-28 16:44:20] Finished aligning even1.
## INFO [2019-05-28 16:44:23] Finished aligning even2.
## INFO [2019-05-28 16:44:25] Finished aligning even3.

You will get an output artifact that logs the created alignments…

##       id            alignment success
## 1: even1 alignments/even1.bam    TRUE
## 2: even2 alignments/even2.bam    TRUE
## 3: even3 alignments/even3.bam    TRUE

…the size of all the alignments on disk…

print(alns$disk_size, unit = "auto")
## 2.4 Mb

…and the logs in case something goes wrong.

cat(alns$logs[[1]])
## [M::mm_idx_gen::1.824*1.17] collected minimizers
## [M::mm_idx_gen::2.019*1.34] sorted minimizers
## [M::main::2.020*1.34] loaded/built the index for 10 target sequence(s)
## [M::mm_mapopt_update::2.020*1.34] mid_occ = 1000
## [M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 10
## [M::mm_idx_stat::2.102*1.33] distinct minimizers: 10112254 (98.45% are singletons); average occurrences: 1.040; average spacing: 5.997
## [M::worker_pipeline::2.234*1.35] mapped 12598 sequences
## [M::main] Version: 2.17-r941
## [M::main] CMD: minimap2 -acx sr -t 3 --secondary=yes -N 100 -I 100G /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/shotgun/even1_S1_L001_R1_001.fasta.gz /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/shotgun/even1_S1_L001_R2_001.fasta.gz
## [M::main] Real time: 2.307 sec; CPU: 3.088 sec; Peak RSS: 0.476 GB