module mirna_quantification.py¶
Quantify miRNAs and corresponding isomiRs.
Read the input SAM file, calculate the contribution sum of the intersecting alignments for each feature and write the result in tab-delimited format. The following sections will cover what the SAM file must be like, how the counting is done, and how the output can look like. The final section contain examples covering the main modes.
EXPECTED INPUT
The SAM file must contain only the alignments that intersect with canonical
miRNAs and/or isomiRs. For each alignment, the intersecting feature name(s)
must be stored in a tag. If the tag determined by the CLI argument --tag is
empty, the alignment is ignored. In addition, the SAM file must be sorted by
the tag storing the feature names. Optionally, the SAM file can contain
collapsed reads, as, e.g., produced by the fastx_collapser tool of the
FASTX-Toolkit. Finally, read names can also contain the alignment
NH vale (see READ NAME FORMAT section).
FEATURE NAME FORMAT
The name of the intersecting feature(s) has to follow the format:
feat_name|5p-shift|3p-shift|CIGAR|MD|READ_SEQ
- For isomiRs, the whole name is used
- For canonical miRNAs, the
feat_nameis used
A feature is classified as canonical if the 5p and 3p shifts are both 0 and
the CIGAR and MD strings are the same. If there are different features names
separated by a semi-colon, they are treated as a unique isomiR name. For example,
the feature name feat_name_1|0|0|23M|23|read_seq would represent a canonical
miRNA and the feature names feat_name_2|0|0|22M|0A0A2T17|read_seq and
feat_name_3|0|0|22M|22|read_seq;feat_name_4|1|0|23M|17A3T0C0|read_seq would
represent isomiRs.
READ NAME FORMAT
The read name must have one of the following formats:
NAMENAME_NHNAME-COUNTNAME-COUNT_NH
where:
NAMEis an arbitrary unique nameCOUNTis the number of identical reads that were collapsedNHis the alignment NH tag.
For example, the query name my_query_name-13_2 would indicate that the read
represented by this alignment occurred 13 times before collapsing and the
alignment's NH value is 2.
Only if COUNT is in the name the CLI flag --collapsed can be set. Likewise,
the CLI flag --nh can only be set if NH is in the name.
COUNT TABULATION
The count for each feature is the sum of all the intersecting alignment's contribution. The appropriate contribution is the ratio between the number of reads and the alignment's NH.
The CLI flags --collapsed and --nh determine if these values are taken
from the alignment's name:
- If
--collapsedis not set, the number of reads is considered to be 1 - If
--nhis not set, the NH value is obtained from the NH tag - If the NH tag is missing, its value is considered to be 1
TABULATED FEATURES
Which kind of features will appear in the output file is determined by the CLI
argument --mir-list. The three possible options are:
mirnato only include canonical miRNAsisomirto only include isomiRsmirnaandisomirto include both canonical miRNAs and isomiRs
OUTPUT FORMAT
The output is tab-delimited file named mir_counts_LIB, where LIB is the
read's library name set in the CLI option --lib.
If the SAM file is empty, an empty file is produced.
By default each row contains the feature name and its partial count.
Three extra columns can be added by using the CLI flags --count, --len
and --read-ids:
- If
--countis set, the output table will contain the number of best alignments for each feature - If
--lenis set, the output table will contain the read's length - If both
--countand--lenare set, the count will always be followed by the read's length - If
--read-idsis set, a semicolon-separated list of all alignment IDs that overlap with the feature will always be in the last column
Examples¶
Example 1 | Feature name stored in the tag XN¶
Input:
SAM with intersecting feature names in stored in the tag XN
Command:
mirna_quantification.py SAM --tag XN
Output:
hsa-miR-516b-5p 3.0
hsa-miR-517-5p|-1|0|23M|22T|ACCTCTAGATGGAAGCACTGTCG 0.6000000000000001
Example 2 | Quantify mature miRNA¶
Input:
SAM meeting the minimal characteristics
Command:
mirna_quantification.py SAM --mir-list mirna
Output:
hsa-miR-517b-3p 1.3333333333333333
hsa-miR-518b 1.0
Example 3 | Quantify isomiR¶
Input:
SAM meeting the minimal characteristics
Command:
mirna_quantification.py SAM --mir-list isomir
Output:
hsa-miR-512-3p|0|1|23M|22C|AAGTGCTGTCATAGCTGAGGTAA 1.6666666666666665
hsa-miR-517-5p|-1|0|23M|22T|ACCTCTAGATGGAAGCACTGTCG 0.6000000000000001
Example 4 | Collapsed reads with NH tag in the name¶
Input:
SAM with read names following the format NAME-COUNT_NH
Command:
mirna_quantification.py SAM --collpased --nh
Output:
hsa-miR-516b-5p 3.0
hsa-miR-517-5p|-1|0|23M|22T|ACCTCTAGATGGAAGCACTGTCG 0.6000000000000001
Example 5 | Collapsed reads¶
Input:
SAM with read names following the format NAME-COUNT
Command:
mirna_quantification.py SAM --collapsed
Output:
hsa-miR-516b-5p 3.0
hsa-miR-517-5p|-1|0|23M|22T|ACCTCTAGATGGAAGCACTGTCG 0.6000000000000001
Example 6 | Reads with the NH tag in their name¶
Input:
SAM with read names following the format NAME_NH
Command:
mirna_quantification.py SAM --nh
Output:
hsa-miR-516b-5p 3.0
hsa-miR-517-5p|-1|0|23M|22T|ACCTCTAGATGGAAGCACTGTCG 0.6000000000000001
Example 7 | Quantify mature miRNAs and add read IDs to the final table¶
Input:
SAM with read names following the format NAME_NH
Command:
mirna_quantification.py SAM --read-ids --nh --mir-list mirna
Output:
hsa-miR-498-5p 4.333333333333333 270396_3
hsa-miR-1323 12.0 673650_2;673650_2
Example 8 | Quantify isomiRs, add isomiR length and absolute counts¶
Input:
SAM meeting the minimal characteristics
Command:
mirna_quantification.py SAM --count --len --mir-list isomir
Output:
hsa-miR-512-3p|0|1|23M|22C0|AAGTGCTGTCATAGCTGAGGTAA 4.333333333333333 6 23
hsa-miR-512-3p|0|1|23M|3T18C0|AAGGGCTGTCATAGCTGAGGTAA 12.0 8 19
function parse_arguments¶
parse_arguments()
Command-line arguments parser.
function collapsed_nh_contribution¶
collapsed_nh_contribution(aln: AlignedSegment) → float
Get the contribution of the alignment to the overall count.
The contribution is computed as the ratio of the number of reads collapsed in
the alignment and the NH value. It is assumed that the alignment query name
contains the number of collapsed reads as well as the NH value in the format
NAME-COUNT_NH.
Arguments:
aln: Alignment to which the overall contribution is calculated
Returns: Contribution of alignment to overall count
function collapsed_contribution¶
collapsed_contribution(aln: AlignedSegment) → float
Get the contribution of the alignment to the overall count.
The contribution is computed as the ratio of the number of reads collapsed in
the alignment and the value stored in the NH tag. If the tag is missing, the NH
value is 1. It is assumed that the alignment query name contains the number of
collapsed reads in the format NAME-COUNT.
Arguments:
aln: Alignment to which the overall contribution is calculated
Returns: Contribution of alignment to overall count
function nh_contribution¶
nh_contribution(aln: AlignedSegment) → float
Get the contribution of the alignment to the overall count.
The contribution is computed as the ratio of the number of reads collapsed in
the alignment and the value stored in the NH tag. If the tag is missing, the NH
value is 1. It is assumed that the alignment query name contains the NH value in
the format NAME_NH.
Arguments:
aln: Alignment to which the overall contribution is calculated
Returns: Contribution of alignment to overall count
function contribution¶
contribution(aln: AlignedSegment) → float
Get the contribution of the alignment to the overall count.
The contribution is computed as the ratio of the number of reads collapsed in the alignment and the value stored in the NH tag. If the tag is missing, the overall contribution is 1.
Argumens:
aln: Alignment to which the overall contribution is calculated
Returns: Contribution of alignment to overall count
function get_name¶
get_name(pre_name: str) → list[str]
Get the final name for the species name.
Take a string and processes it to obtain the final name for the species and the
type of miRNA the string belongs to. Only the feat_name is returned if the
3p-shift and 5p-shift are 0 and the CIGAR and MD are the same. Otherwise, the
whole input string is returned.
Arguments:
pre_name: string with the formatfeat_name|5-shift|3-shift|CIGAR|MD|read_seq
Returns: list with the species name to be found in the final table and its type
function write_output¶
write_output(
name: str,
species: list[str],
mir_list: list[str],
mirna_out: Path
) → None
Write to the output the correct miRNA type.
Arguments:
name: miRNA species typespecies: List of candidate miRNA species to write in the output filemir_list: List of miRNA types to have in the output tablemirna_out: Path to the output file
function main¶
main(args) → None
Quantify miRNAs and corresponding isomiRs.