Skip to content

module filter_multimappers.py

Filter miRNA reads mapped to multiple locations by indel count.

From all the alignments with the same query name and edit distance, keep the ones with the higher amount of indels. Supplementary alignments are dismissed as they are part of a chimeric alignment, which is composed of multiple linear alignments with minimal overlap.

Additionally, the NH and HI tags are updated to match the new amount of alignments and keep their identifier within the new set respectively.

If the CLI flag --nh is set, query names are appended the suffix _# were # is the updated alignment's NH tag.

The following assumptions are made:

  • The input SAM file is sorted by query name.

Examples

Example 1 | Different number of indels

IN SAM records:

read-1  0       19      77595   255     8M1D14M *       0       0       CTGACATCAGTGATTCTCCTGC  *       MD:Z:3G1T2^A14  NH:i:2  NM:i:3  XA:Z:Q  XI:i:1
read-1  0       19      330456  255     4M1D1M1I3M1D13M *       0       0       CTGACATCAGTGATTCTCCTGC  *       MD:Z:4^G4^A13   NH:i:2  NM:i:3  XA:Z:Q  XI:i:0

Alignments:

CTGACATC-AGTGATTCTCCTGC
||| | || |||||||||||||| (1 indel, 2 mismatches, discarded)
CTGGCTTCAAGTGATTCTCCTGC

CTGA-CATCA-GTGATTCTCCTGC
|||| | ||| ||||||||||||| (3 indels, 0 mismatches, retained)
CTGAGC-TCAAGTGATTCTCCTGC

Command:

filter_multimappers.py SAM > out_SAM

OUT SAM record:

read-1  0       19      330456  255     4M1D1M1I3M1D13M *       0       0       CTGACATCAGTGATTCTCCTGC  *       MD:Z:4^G4^A13   NH:i:1  HI:i:1  NM:i:3  XA:Z:Q  XI:i:0

Example 2 | Equal number of indels

IN SAM records:

read-2  0       19      142777  255     5M1I15M *       0       0       GCTTCAAGCCTCCCACCTAGC   *       MD:Z:14A0G4     NH:i:3  NM:i:3  XA:Z:Q  XI:i:0
read-2  0       19      270081  255     6M1I14M *       0       0       GCTTCAAGCCTCCCACCTAGC   *       MD:Z:14G0G4     NH:i:3  NM:i:3  XA:Z:Q  XI:i:2
read-2  0       19      545543  255     6M1I14M *       0       0       GCTTCAAGCCTCCCACCTAGC   *       MD:Z:14A0G4     NH:i:3  NM:i:3  XA:Z:Q  XI:i:1

Alignments:

GCTTCAAGCCTCCCACCTAGC
||||| |||||||||  |||| (1 Indel, 2 mismatches, retained)
GCTTC-AGCCTCCCAAGTAGC

GCTTCAAGCCTCCCACCTAGC
|||||| ||||||||  |||| (1 Indel, 2 mismatches, retained)
GCTTCA-GCCTCCCAGGTAGC

GCTTCAAGCCTCCCACCTAGC
|||||| ||||||||  |||| (1 Indel, 2 mismatches, retained)
GCTTCA-GCCTCCCAAGTAGC

Command:

filter_multimappers.py SAM > out_SAM

OUT SAM records:

read-2_3        0       19      142777  255     5M1D15M *       0       0       GCTTCAAGCCTCCCACCTAGC   *       MD:Z:14A0G4     NH:i:3  HI:i:1  NM:i:3  XA:Z:Q  XI:i:0
read-2_3        0       19      270081  255     6M1I14M *       0       0       GCTTCAAGCCTCCCACCTAGC   *       MD:Z:14G0G4     NH:i:3  HI:i:2  NM:i:3  XA:Z:Q  XI:i:2
read-2_3        0       19      545543  255     6M1I14M *       0       0       GCTTCAAGCCTCCCACCTAGC   *       MD:Z:14A0G4     NH:i:3  HI:i:3  NM:i:3  XA:Z:Q  XI:i:1

Example 3 | Add NH tag as read's name suffix

IN SAM record:

read-3  0       19      5338    1       15M1I7M *       0       0       TCAAAACTGAGGGGCTATTTTCT *       HI:i:1  NH:i:1  NM:i:1  MD:Z:22 RG:Z:A1 YZ:Z:0

Alignments:

TCAAAACTGAGGGGCTATTTTCT
||||||||||||||| ||||||| (1 Indel, 0 mismatches, retained)
TCAAAACTGAGGGGC-ATTTTCT

Command:

filter_multimappers.py SAM --nh > out_SAM

OUT SAM record:

read-3_1        0       19      5338    1       15M1I7M *       0       0       TCAAAACTGAGGGGCTATTTTCT *       HI:i:1  NH:i:1  NM:i:1  MD:Z:22 RG:Z:A1 YZ:Z:0

function parse_arguments

parse_arguments()

Command-line arguments parser.


function count_indels

count_indels(aln: AlignedSegment)  int

Count the number of indels in an alignment based on its CIGAR string.

This function counts the number of indels in the alignment based on the alignment CIGAR string returned by .cigartuples. Insertions are encoded as 1s whereas the deletions are encoded as 2. Refer to the .cigartuples documentation for more information.

Arguments:

  • aln: alignment to count insertions and deletions for.

Returns:

  • int: sum of insertions and deletions in that alignment.

function find_best_alignments

find_best_alignments(
    alns: List[AlignedSegment],
    nh: bool = False
)  List[AlignedSegment]

Find alignments with more indels.

Using the function count_indels each alignment is assigned with its number of indels. Then, only those alignments with the higher amount of indels are returned. In addition, the NH and HI tags are updated to match the new amount of alignments and keep its identifier within the set respectively.

If nh is set to True, all query names are appended a suffix _# were # is the new alignment's NH tag.

Arguments:

  • alns: alignments with the same query name
  • nh: whether to suffix read names with the alignment's NH tag

Returns:

  • best_alignments: alignments with the more indels

function write_output

write_output(alns: List[AlignedSegment])  None

Write the output to the standard output (stdout).

Arguments:

  • alns: alignments with the same query name

function main

main(args)  None

Filter multimappers by indels count.