`module` `oligomap_output_to_sam_nh_filtered.py`¶

Transform oligomap output FASTA file to SAM keeping the best alignments.

Read the input file, pick the best alignment(s) for each read (i.e, all of the alignments have either 0 or 1 error). In addition, if the --nh-filter CLI argument is set, filter the reads with more hits than the number provided. The following sections will cover what the input file must be like and how the output can look like.

EXPECTED INPUT

The FASTA file must be the output of mapping your library with the tool oligomap. Refer to the Output format section in its main README.md for more information.

In addition, alignments have to be sorted by the read name. An example of how an entry would look like in the EXAMPLE section.

OUTPUT FORMAT

The output consist on the filtered set of alignments from the input file in SAM format. Only the best alignments per read (i.e., either all the alignments have 0 or 1 error) are written to the standard output. Moreover, if the --nh-filter CLI argument is given, reads with more hits than the provided value are discarded. If none of the alignments meet the criteria, nothing is returned. The fields in the SAM entry are:

field | value
--------------
QNAME | Read's name
FLAG  | Set to 0 if the reference sequence is found in the positive strand,
      | and to 16 otherwise.
RNAME | Reference sequence name
POS   | Alignment's first position in the reference sequence
MAPQ  | Value set to 255 (mapping quality not available)
CIGAR | Alignment's CIGAR string
RNEXT | Value set to * (unavailable information)
PNEXT | Value set to 0 (unavailable information)
TLEN  | Value set to 0 (unavailable information)
SEQ   | Mapped sequence
QUAL  | Value set to * (Phred-scaled base quality not stored)
NM:i: | Alignment's edit distance (Optional field)
MD:Z: | Alignment's MD string (Optional field)
NH:i: | Alignment's NH (Optional field)

Example¶

Input format:

82-1 (23 nc) 1..23      19      44377..44398
19
errors: 1 orientation: +
CTACAAAGGGAAGCACTTGTCTC
|||||||||||||||||| ||||
CTACAAAGGGAAGCACTT-TCTC



97-1 (22 nc) 1..22      19      5338..5359
19
errors: 1 orientation: +
TCAAAACTGAGGTGCATTTTCT
|||||||||||| |||||||||
TCAAAACTGAGGGGCATTTTCT

Output format:

82-1    0       19      44377   255     18M1I4M *       0       0       CTACAAAGGGAAGCACTTGTCTC *       NM:i:1  MD:Z:23         NH:i:1
97-1    0       19      5338    255     22M     *       0       0       TCAAAACTGAGGTGCATTTTCT  *       NM:i:1  MD:Z:12G9       NH:i:1

Paula Iborra. Zavolan Lab. Adapted version of Alessandro Crippa script. Refactored and documented by Iris Mestres-Pascual.

`function` `parse_arguments`¶

parse_arguments()

Command-line arguments parser.

`function` `get_cigar_md`¶

get_cigar_md(
    errors: str,
    sequence: str,
    bars_line: str,
    ref_seq: str
) → tuple[str, str]

Get the CIGAR and MD strings.

Given the read and target sequences, the number of errors and the alignment's bar representation, this function first checks if there is any error in the mapping. If there is no errors, the CIGAR and MD strings are the read's length followed by and "M" and the read's length respectively. If there is an error, the function checks if the read has a deletion (the read sequence contains a dash), an insertion (the reference sequence contains a dash), or a single-point mutation. Finally, it checks the error's position in the read sequence and builds the CIGAR and MD strings.

Arguments:

errors: number of mapping errors (either 0 or 1)
sequence: read sequence
bars_line: alignment's bar representation
ref_seq: reference sequence

Returns:

cigarStr: alignment's CIGAR string
matchingString: alignment's MD string with its corresponding tag

Examples¶

Example 1 | Perfect match¶

read sequence ->               GAAGGCGCTTCACCTTTGGAGT
alignment representation ->    ||||||||||||||||||||||
reference sequence ->          GAAGGCGCTTCACCTTTGGAGT
errors -> 0

cigarStr: 21M
matchingString: MD:Z:21

Example 2 | One insertion¶

read sequence ->               GAAGGCGCTTCACCTTTGGAGT
alignment representation ->    ||||||||||| ||||||||||
reference sequence ->          GAAGGCGCTTC-CCTTTGGAGT
errors -> 1

cigarStr: 11M1I10M
matchingString: MD:Z:21

Example 3 | One deletion¶

read sequence ->               GAAGGCGCTTC-CCTTTGGAGT
alignment representation ->    ||||||||||| ||||||||||
reference sequence ->          GAAGGCGCTTCCCCTTTGGAGT
errors -> 1

cigarStr: 11M1D10M
matchingString: MD:Z:11C10

Example 4 | One mismatch¶

read sequence ->               GAAGGCGCTTCACCTTTGGAGT
alignment representation ->    ||||||||||| ||||||||||
reference sequence ->          GAAGGCGCTTCCCCTTTGGAGT
errors -> 1

cigarStr: 21M
matchingString: MD:Z:11C10

`function` `get_sam_fields`¶

get_sam_fields(aln: list[str]) → Fields

Create the read's alignment in SAM format.

Given the set of lines oligomap provides as an alignment representation, this function returns a Fields class object with the mandatory fields in a SAM file and three optional ones.

field | value
--------------
QNAME | Read's name
FLAG  | Set to 0 if the reference sequence is found in the positive strand,
      | and to 16 otherwise.
RNAME | Reference sequence name
POS   | Alignment's first position in the reference sequence
MAPQ  | Value set to 255 (mapping quality not available)
CIGAR | Alignment's CIGAR string
RNEXT | Value set to * (unavailable information)
PNEXT | Value set to 0 (unavailable information)
TLEN  | Value set to 0 (unavailable information)
SEQ   | Mapped sequence
QUAL  | Value set to * (Phred-scaled base quality not stored)
NM:i: | Alignment's edit distance (Optional field)
MD:Z: | Alignment's MD string (Optional field)
NH:i: | Alignment's NH (Optional field)

Arguments:

aln: list made up of the lines that form the output alignment after mapping with oligomap.

Returns:

fields: Fields class NamedTuple with the alignment's data as SAM fields.

`function` `eval_aln`¶

eval_aln(
    nhfilter: int,
    d: Dict[str, list],
    min_err_nh: Dict[str, list],
    fields: Fields
) → None

Evaluate an alignment to add, discard or write it to the STDOUT.

Given a read's alignment, this function first checks if the dictionary storing the read alignments is empty. If it is, the function checks if the read has an entry in the dictionary storing the current minimum error and the NH. If there is not, or the minimum number of errors is higher than the alignment under evaluation, the entry for that read in both dictionaries is set to the data of the alignment under evaluation.

If the alignments dictionary is not empty, the function checks if the read under evaluation is the first in the dictionary. If it is, and the number of errors in the current alignment is the same as the minimum error for that read, the number of hits is increased by 1. Then:

If the number of hits does not exceed the maximum NH or the NH is not provided, the alignment under evaluation is appended to the read's entry.
If a maximum NH is set, and the read exceeds it, it is removed.
If the minimum number of errors is higher than the number of errors of the alignment under evaluation, the entry for that read in both dictionaries is set to the data of the current alignment.

If the read is not the first in the dictionary, the function writes all the alignments for the read in the first position to the standard output, and both dictionaries are reset to only include the data for the alignments under evaluation.

Arguments:

nhfilter: maximum number of hits an read can have to be kept
d: dictionary with read names as keys and a list with Fields class NamedTuples as values
min_err_nh: dictionary with read names as keys and a list with the current minimum error and the number of hits in this order as values
fields: read's alignment as a Fields class NamedTuple

`function` `main`¶

main(arguments) → None

Convert the alignments in the oligomap output file to SAM format.

`class` `Fields`¶

Class to store an alignment in its different SAM fields.

Attributes:

read_name: read's name
flags: bitwise SAM flags
ref_seq_name: reference sequence name
position_in_ref: 1-based leftmost mapping position
map_q: mapping quality
cigar_str: CIGAR string
r_next: reference name of the mate/next read
p_next: position of the mate/next read
t_len: observed template length
sequence: segment sequence
qual: ASCII of Phred-scaled base quality + 33
edit_dist: Alignment's edit distance
md_str: MD string

module oligomap_output_to_sam_nh_filtered.py¶

Example¶

function parse_arguments¶

function get_cigar_md¶

Examples¶

Example 1 | Perfect match¶

Example 2 | One insertion¶

Example 3 | One deletion¶

Example 4 | One mismatch¶

function get_sam_fields¶

function eval_aln¶

function main¶

class Fields¶

`module` `oligomap_output_to_sam_nh_filtered.py`¶

`function` `parse_arguments`¶

`function` `get_cigar_md`¶

`function` `get_sam_fields`¶

`function` `eval_aln`¶

`function` `main`¶

`class` `Fields`¶