Preprocessing of raw data

For filtering and trimming of the raw reads we usually use the DADA2 functions but wrap them in a reproducible workflow step.

library(mbtools)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Also loading:
##   - dada2=1.12.0
##   - data.table=1.12.2
##   - ggplot2=3.1.1
##   - magrittr=1.5
##   - phyloseq=1.28.0
##   - ShortRead=1.42.0
##   - yaml=2.2.0
## Found tools:
##   - minimap2=2.17-r941
##   - slimm=0.3.4
##   - samtools=1.9
## 
## Attaching package: 'mbtools'
## The following object is masked _by_ 'package:BiocGenerics':
## 
##     normalize

Finding your files

We will again use our helper function to get a list of sequencing files.

path <- system.file("extdata/16S", package = "mbtools")
files <- find_read_files(path)
print(files)
##                                                                                                              forward
## 1: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D0_S188_L001_R1_001.fastq.gz
## 2: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D1_S189_L001_R1_001.fastq.gz
## 3: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D2_S190_L001_R1_001.fastq.gz
## 4: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D3_S191_L001_R1_001.fastq.gz
## 5: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/Mock_S280_L001_R1_001.fastq.gz
##                                                                                                              reverse
## 1: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D0_S188_L001_R2_001.fastq.gz
## 2: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D1_S189_L001_R2_001.fastq.gz
## 3: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D2_S190_L001_R2_001.fastq.gz
## 4: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/F3D3_S191_L001_R2_001.fastq.gz
## 5: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/16S/Mock_S280_L001_R2_001.fastq.gz
##      id injection_order lane
## 1: F3D0             188    1
## 2: F3D1             189    1
## 3: F3D2             190    1
## 4: F3D3             191    1
## 5: Mock             280    1

Configuration

All mbtools workflow step come with corresponding config_* that returns an example/default configuration. Changes can be done a-posteriori or by directly passing in the parameters. We will specify a temporary directory as storage point for the preprocessed data and truncate the forward reads to 240 bp and the reverse reads to 200 bp (based on our previous quality assessment).

config <- config_preprocess(out_dir = tempdir(), truncLen = c(240, 200))
config
## $threads
## [1] 1
## 
## $out_dir
## [1] "/var/folders/55/dv0p21y96g1cq84sr1zd3kym0000gr/T//RtmpXCaPpi"
## 
## $trimLeft
## [1] 10
## 
## $truncLen
## [1] 240 200
## 
## $maxEE
## [1] 2
## 
## $truncQ
## [1] 2
## 
## $maxN
## [1] 0
## 
## attr(,"class")
## [1] "config"

We can see that there are some more parameters that we could specify.

Running the preprocessing step

We can now run our preprocessing step.

filtered <- preprocess(files, config)
## INFO [2019-05-28 16:35:48] Preprocessing reads for 5 paired-end samples...
## INFO [2019-05-28 16:35:55] 4.03e+04/4.48e+04 (89.75%) reads passed preprocessing.

This will report the percentage of passed reads on the logging interface but you can also inspect that in detail by

##      raw preprocessed   id
## 1:  7793         6992 F3D0
## 2:  5869         5210 F3D1
## 3: 19620        17706 F3D2
## 4:  6758         6114 F3D3
## 5:  4779         4280 Mock