Long read metagenomics

mbtools has support for aligning long metagenomic reads for Oxford Nanopore data. Before proceeding we recommend you preprocess the reads first as described in an earlier vignette.

library(mbtools)

Aligning reads

We use minimap2 for everything since it performs as good as other aligners but does not require explicit building of the reference. This way your reference database can just be a (compressed) fasta file.

As example data we will use 3 samples which are subsampled data from MinION sequencing for a list of 10 reference genomes in equal abundances.

Preparing the example data

Let’s create our file list for the example data and reference database:

fi <- system.file("extdata/nanopore", package = "mbtools") %>%
      find_read_files()
ref <- system.file("extdata/genomes/zymo_mock.fna.gz",
                   package = "mbtools")

Which are 3 single-end files.

Aligning

As always we will need A config object.

## $reference
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz"
## 
## $build_index
## [1] FALSE
## 
## $threads
## [1] 3
## 
## $alignment_dir
## [1] "alignments"
## 
## $max_hits
## [1] 100
## 
## $use_existing
## [1] FALSE
## 
## $limited_memory
## [1] FALSE
## 
## attr(,"class")
## [1] "config"

This will be sufficient to align reads. As always the first argument can also be an artifact from quality_control or preprocess.

## INFO [2019-05-28 16:44:36] Aligning 3 samples on 3 threads. Keeping up to 100 secondary alignments.
## INFO [2019-05-28 16:44:39] Finished aligning even1.
## INFO [2019-05-28 16:44:43] Finished aligning even2.
## INFO [2019-05-28 16:44:46] Finished aligning even3.

You will get an output artifact that logs the created alignments…

##       id            alignment success
## 1: even1 alignments/even1.bam    TRUE
## 2: even2 alignments/even2.bam    TRUE
## 3: even3 alignments/even3.bam    TRUE

…the size of all the alignments on disk…

print(alns$disk_size, unit = "auto")
## 12.4 Mb

…and the logs in case something goes wrong.

cat(alns$logs[[1]])
## [M::mm_idx_gen::1.946*1.16] collected minimizers
## [M::mm_idx_gen::2.226*1.39] sorted minimizers
## [M::main::2.226*1.39] loaded/built the index for 10 target sequence(s)
## [M::mm_mapopt_update::2.355*1.37] mid_occ = 25
## [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 10
## [M::mm_idx_stat::2.442*1.35] distinct minimizers: 10418059 (91.27% are singletons); average occurrences: 1.132; average spacing: 5.347
## [M::worker_pipeline::3.214*1.59] mapped 1000 sequences
## [M::main] Version: 2.17-r941
## [M::main] CMD: minimap2 -acx map-ont -t 3 --secondary=yes -N 100 -I 100G /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/nanopore/even1_S1_L001_R1_001.fastq.gz
## [M::main] Real time: 3.289 sec; CPU: 5.182 sec; Peak RSS: 0.518 GB