Map Module¶
Welcome to the technical documentation for the map module. This page aims to be a detailed documentation of each rule within the module by stating its inputs, outputs (and how they relate to other rules), configurable parameters, and the software used. Moreover, when needed, there will be explanations and examples of what that particular rule does.
The schema below is a visual representation of the individual module steps and how they are related.
Third-party software used¶
Tag lines were taken from the developers' websites (code repository or manual)
| Name | License | Tag line | More info |
|---|---|---|---|
| cutadapt | MIT | "[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads" | code / manual / publication |
| FASTX-Toolkit | AGPL-3.0 | "[...] collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing" | code / manual |
| Oligomap | GPLv3 | "[...] fast identification of nearly-perfect matches of small RNAs in sequence databases. It allows to exhaustively identify all the perfect and 1-error (where an error is defined to be a mismatch, insertion or deletion) matches of large sets of small RNAs to target sequences" | code / publication |
| SAMtools | MIT | "[...] suite of programs for interacting with high-throughput sequencing data" | code / manual / publication |
| segemehl | GPLv3 | "[...] map short sequencer reads to reference genomes" | manual / publication |
Configuration file¶
Some parameters within the workflow can be modified. Refer to the configuration template for a detailed explanation of each option.
Map Workflow¶
finish_map¶
Target rule as required by Snakemake.
Local rule
BAM index file (.bam.bai); from
index_all_alns_bam
start¶
(Workflow input) miRNA sequencing library, gziped
(.fa.gz/.fasta.gz or fq.gz/.fastaq.gz)
miRNA sequencing library, copied, renamed (.fa, .fastq); used in
fasta_quality_filter or
format_fasta
fastq_quality_filter¶
Conduct quality control for reads library with fastx_toolkit.
miRNA sequencing library, copied, renamed (.fastq); from
start
- config_template.yaml
q_value: Minimum Q (Phred) score to keep (default: 10)p_value: Minimum % of bases that must have a Q (Phred) quality (default: 50)
miRNA sequencing library, filtered (.fastq); used in
fastq_to_fasta
fastq_to_fasta¶
Convert reads file from FASTQ to FASTA with fastx_toolkit.
Sequence identifiers are renamed to numbers.
miRNA sequencºing library, filtered (.fastq); from
fastq_quality_filter
miRNA sequencing library (.fa); used in
format_fasta
format_fasta¶
Format read's sequences to appear on a single line with fastx_toolkit.
miRNA sequencing library (.fa); from start or
fastq_to_fasta
miRNA sequencing library, formatted (.fasta); used in
remove_adapters
remove_adapters¶
Trim 3' adapters and N bases at either end. Filter reads by minimum length
and number of inner N bases with cutadapt.
miRNA sequencing library, formatted (.fasta); from
format_fasta
- samples.tsv
- Adapter to be removed; specified in the sample's table column
adapter
- Adapter to be removed; specified in the sample's table column
- config_template.yaml
error_rate: Fraction of allowed errors in the matched adapters (default: 0.1)overlap: Minimum overlap length between adapter and read to trim the bases (default: 3)minimum_length: Minimum length for a processed read to be kept (default: 15)max_n: Maximum number of innerNbases for a processed read to be kept (default: 0)
miRNA sequencing library, filtered, without adapters (.fasta); used in
collapse_identical_reads
collapse_identical_reads¶
Collapse and rename identical reads fastx_toolkit.
Sequences are renamed in the format
R-N, whereRis the assigned number to the unique entry, andNis the amount of identical sequences within the library collapsed in it.
miRNA sequencing library, filtered, without adapters (.fasta); from
remove_adapters
miRNA sequencing library, collapsed, renamed (.fasta); used in
filter_fasta_for_oligomap,
map_genome_segemehl and
map_transcriptome_segemehl
map_genome_segemehl¶
Align short reads to reference genome with segemehl.
- miRNA sequencing library, collapsed, renamed (
.fasta); from collapse_identical_reads - Genome sequence, trimmed IDs (
.fa); from trim_genome_seq_ids - segemehl genome index (
.idx); from generate_segemehl_index_genome
Alignments file (.sam); used in
merge_genome_maps
map_transcriptome_segemehl¶
Align short reads to reference transcriptome with segemehl.
- miRNA sequencing library, collapsed, renamed (
.fasta); from collapse_identical_reads - Transcriptome sequence, trimmed IDs (
.fa); from trim_transcriptome_seq_ids - segemehl transcriptome index (
.idx); from generate_segemehl_index_transcriptome
Alignments file (.sam); used in
merge_transcriptome_maps
filter_fasta_for_oligomap¶
Filter reads by length with a custom script.
Required for an optimal mapping speed. Oligomap is specifically written for short reads. Therefore, reads with more bases than the default maximum (30 nts) makes the mapping slower.
miRNA sequencing library, collapsed, renamed (.fasta); from
collapse_identical_reads
- config_template.yaml
max_length_reads: Maximum length of processed reads to be mapped with oligomap (default: 30)
miRNA sequencing library, collapsed, filtered (.fasta); used in
map_genome_oligomap and
map_transcriptome_oligomap
map_genome_oligomap¶
Align short reads to reference genome with oligomap.
Refer to Oligomap's Output format section for a specific explanation and examples on the output format.
- miRNA sequencing library, collapsed, filtered (
.fasta); from filter_fasta_for_oligomap - Genome sequence, trimmed IDs (
.fa); from trim_genome_seq_ids
- Alignments file (
.oligomap); used in sort_genome_oligomap - Alignments report (
.txt)
sort_genome_oligomap¶
Sort oligomap alignments by query name with a custom script.
Alignments file (.oligomap); from
map_genome_oligomap
Alignments file, sorted (.oligomap); used in
convert_genome_to_sam_oligomap
convert_genome_to_sam_oligomap¶
Convert aligned reads .oligomap to .sam and filter alignments by number of
hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/NwhereNis the number of genomic loci it aligns to and a largeNmakes the contribution negligible.
Alignments file, sorted (.oligomap); from
sort_genome_oligomap
- config_template.yaml
nh: Maximum number of mappings per read to be kept (default: 100)
Alignments file, filtered (.sam); used in
merge_genome_maps
map_transcriptome_oligomap¶
Align short reads to reference transcriptome with oligomap.
Refer to Oligomap's Output format section for a specific explanation and examples on the output format.
- miRNA sequencing library, collapsed, filtered (
.fasta); from filter_fasta_for_oligomap - Transcriptome sequence, trimmed IDs (
.fa); from trim_transcriptome_seq_ids
- Alignments file (
.oligomap); used in sort_transcriptome_oligomap - Alignments report (
.txt)
sort_transcriptome_oligomap¶
Sort oligomap alignments by query name with a custom script.
Alignments file (.oligomap); from
map_transcriptome_oligomap
Alignments file, sorted (.oligomap); used in
convert_transcriptome_to_sam_oligomap
convert_transcriptome_to_sam_oligomap¶
Convert aligned reads .oligomap to .sam and filter alignments by number of
hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/NwhereNis the number of genomic loci it aligns to and a largeNmakes the contribution negligible.
Alignments file, sorted (.oligomap); from
sort_transcriptome_oligomap
- config_template.yaml
nh: Maximum number of mappings per read to be kept (default: 100)
Alignments file, filtered (.sam); used in
merge_transcriptome_maps
merge_genome_maps¶
Concatenate segemehl and oligomap genome alignments.
- Alignments file (
.sam); from map_genome_segemehl - Alignments file, filtered (
.sam); from convert_genome_to_sam_oligomap
Alignments file (.sam); used in
filter_genome_by_nh
merge_transcriptome_maps¶
Concatenate segemehl and oligomap transcriptome alignments.
- Alignments file (
.sam); from map_transcriptome_segemehl - Alignments file, filtered (
.sam); from convert_transcriptome_to_sam_oligomap
Alignments file (.sam); used in
filter_transcriptome_by_nh
filter_genome_by_nh¶
Filter merged genome alignments by the number of hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/NwhereNis the number of genomic loci it aligns to and a largeNmakes the contribution negligible.
Alignments file (.sam); from
merge_genome_maps
- config_template.yaml
nh: Maximum number of mappings per read to be kept (default: 100)
Alignments file, filtered (.sam); used in
remove_header_genome_mappings
filter_transcriptome_by_nh¶
Filter merged transcriptome alignments by the number of hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/NwhereNis the number of genomic loci it aligns to and a largeNmakes the contribution negligible.
Alignments file (.sam); from
merge_transcriptome_maps
- config_template.yaml
nh: Maximum number of mappings per read to be kept (default: 100)
Alignments file, filtered (.sam); used in
remove_header_transcriptome_mappings
remove_header_genome_mappings¶
Remove the SAM header of the genome alignments file with SAMtools.
Alignments file (.sam); from
filter_genome_by_nh
Alignments file, without SAM header (.sam); used in
merge_all_maps
remove_header_transcriptome_mappings¶
Remove the SAM header of the transcriptome alignments file with SAMtools.
Alignments file (.sam); from
filter_transcriptome_by_nh
Alignments file, without SAM header (.sam); used in
transcriptome_to_genome_maps
transcriptome_to_genome_maps¶
Convert the alignments' transcriptome coordinates to genomic ones with a custom script.
- Alignments file, without SAM header (
.sam); from remove_header_transcriptome_mappings - Exon annotations (
.bed); from convert_exons_gtf_to_bed
Alignments file, without SAM header (.sam); used in
merge_all_maps
merge_all_maps¶
Concatenate the four alignment files into a single one.
Alignments files, without SAM header (.sam); from
remove_header_genome_mappings and
transcriptome_to_genome_maps
Alignments file, without SAM header (.sam); used in
add_header_all_maps
add_header_all_maps¶
Add the SAM header to the aligned reads merged file with SAMtools.
Alignments file, without SAM header (.sam); from
merge_all_maps
Alignments file (.sam); used in sort_maps_by_id
sort_maps_by_id¶
Sort alignments by reads ID with SAMtools.
Alignments file (.sam); from
add_header_all_maps
Alignments file, sorted (.sam); used in
remove_inferiors
remove_inferiors¶
Remove duplicate and inferior alignments with a custom script.
Alignments are considered to be duplicates if having identical entries for the fields
QNAME,FLAG,RNAME,POSandCIGAR. Alignments are considered to be inferiors if having the sameQNAMEand a bigger edit distance than the smaller one within the group. The tagsNH(number of hits) andHI(query hit index) are updated accordingly.
Alignments file, sorted (.sam); from
sort_maps_by_id
Alignments file, filtered (.sam); used in
filter_by_indels
Example 1 | Remove duplicates
IN:
1-2 0 19 44414 1 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * HI:i:0 NH:i:1 NM:i:0 MD:Z:21 RG:Z:A1 YZ:Z:0
1-2 0 19 44414 255 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * NM:i:0 MD:Z:21 NH:i:1
OUT:
1-2 0 19 44414 255 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * MD:Z:21 NH:i:1 NM:i:0
Example 2 | Remove inferiors single alignment
IN:
1-704 16 19 207362 1 18M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:0 NH:i:2 NM:i:0 MD:Z:18 RG:Z:A1 YZ:Z:0
1-704 272 19 471264 1 16M1I1M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:1 NH:i:2 NM:i:2 MD:Z:11G5 RG:Z:A1 YZ:Z:0
OUT:
1-704 16 19 207362 1 18M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:0 NH:i:1 NM:i:0 MD:Z:18 RG:Z:A1 YZ:Z:0
Example 3 | Remove inferiors multiple alignments
IN:
1-1197 0 19 56327 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:0 NH:i:4 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
1-1197 256 19 68983 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:1 NH:i:4 NM:i:3 MD:Z:1C10AT1 RG:Z:A1 YZ:Z:0
1-1197 256 19 76967 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:2 NH:i:4 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
1-1197 256 19 92363 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:4 NH:i:4 NM:i:3 MD:Z:1C11TA RG:Z:A1 YZ:Z:0
OUT:
1-1197 0 19 56327 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:0 NH:i:2 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
1-1197 256 19 76967 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:1 NH:i:2 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
filter_by_indels¶
Filter multimappers favoring indels over mismatches with a custom script.
Given that indels are more frequent in miRNAs than mismatches, as demonstrated by Saunders et al. (2017), Neilsen et al. (2012) and Schmauch et al. (2024), only those "multimappers" (defined here as alignments of the same read mapping to different genomic loci with the same edit distance) that contain a higher or equal number of indels compared to mismatches are retained.
Alignments file, sorted, filtered (.sam); from
remove_inferiors
Alignments file, sorted, filtered (.sam); used in
convert_all_alns_sam_to_bam and
filter_sam_by_intersecting_primir
Example 1 | Different number of indels
IN:
1-1 0 19 77595 255 8M1D14M * 0 0 CTGACATCAGTGATTCTCCTGC * MD:Z:3G1T2^A14 NH:i:2 NM:i:3 XA:Z:Q XI:i:1
1-1 0 19 330456 255 4M1D1M1I3M1D13M * 0 0 CTGACATCAGTGATTCTCCTGC * MD:Z:4^G4^A13 NH:i:2 NM:i:3 XA:Z:Q XI:i:0
Alignments:
CTGACATC-AGTGATTCTCCTGC
||| | || |||||||||||||| (1 indel, 2 mismatches, discarded)
CTGGCTTCAAGTGATTCTCCTGC
CTGA-CATCA-GTGATTCTCCTGC
|||| | ||| ||||||||||||| (3 indels, 0 mismatches, retained)
CTGAGC-TCAAGTGATTCTCCTGC
OUT:
1-1 0 19 330456 255 4M1D1M1I3M1D13M * 0 0 CTGACATCAGTGATTCTCCTGC * MD:Z:4^G4^A13 NH:i:1 HI:i:1 NM:i:3 XA:Z:Q XI:i:0
Example 2 | Equal number of indels
IN:
1-1 0 19 142777 255 5M1I15M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:0
1-1 0 19 270081 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14G0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:2
1-1 0 19 545543 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:1
Alignments:
GCTTCAAGCCTCCCACCTAGC
||||| ||||||||| |||| (1 Indel, 2 mismatches, retained)
GCTTC-AGCCTCCCAAGTAGC
GCTTCAAGCCTCCCACCTAGC
|||||| |||||||| |||| (1 Indel, 2 mismatches, retained)
GCTTCA-GCCTCCCAGGTAGC
GCTTCAAGCCTCCCACCTAGC
|||||| |||||||| |||| (1 Indel, 2 mismatches, retained)
GCTTCA-GCCTCCCAAGTAGC
OUT:
1-1 0 19 142777 255 5M1D15M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 HI:i:1 NM:i:3 XA:Z:Q XI:i:0
1-1 0 19 270081 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14G0G4 NH:i:3 HI:i:2 NM:i:3 XA:Z:Q XI:i:2
1-1 0 19 545543 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 HI:i:3 NM:i:3 XA:Z:Q XI:i:1
convert_all_alns_sam_to_bam¶
Convert alignments .sam file to .bam with
SAMtools.
Required by BEDTools to intersect alignments with pri-miR annotations.
Alignments file, filtered (.sam); from
filter_by_indels
Alignments file (.bam); used in
sort_all_alns_bam_by_position
sort_all_alns_bam_by_position¶
Sort alignments by position with SAMtools.
Required by BEDTools to intersect alignments with pri-miR annotations more efficiently.
Alignments file (.bam); from
convert_all_alns_sam_to_bam
Alignments file, sorted (.bam); used in
index_all_alns_bam and
intersect_extended_primir
index_all_alns_bam¶
Create index BAM file with SAMtools.
Indexing is required by genome viewers such as IGV to quickly display alignments in a genomic region of interest.
Alignments file, sorted (.bam); from
sort_all_alns_bam_by_position
BAM index file (.bam.bai); used in
intersect_extended_primir