module mirna_extension.py¶
Extend miRNA start and end coordinates and ensure name uniqueness.
This method updates the attributes of precursor and mature miRNA entries to ensure consistent naming based on their paralog or sequence variant status.
For precursors:
- A suffix indicates distinct genomic loci (paralogs) that express identical mature sequences. This is typically extracted from the 'Name' or 'ID' attribute.
- Format:
SPECIES-mir-NUMBER[LETTER]-#(Name) andALIAS_#(ID) where:LETTERdenotes a sequence variant of the mature miRNA (paralogous variant with similar but not identical sequences),#indicates the paralog number (replica/locus index), included when multiple loci express the same or similar miRNAs.
For mature miRNAs:
- The replica number is added or replaced as an infix/suffix in the name.
- Formats:
SPECIES-miR-NUMBER[LETTER]-#-ARMSPECIES-miR-NUMBER[LETTER]-#SPECIES-miR-NUMBER[LETTER]-ARM
Cases:
- If a precursor has multiple genomic instances (paralogs), the first occurrence typically lacks a numeric suffix; subsequent ones are numbered incrementally.
- The 'Derives_from' attribute of each mature miRNA is updated to match the precursor's 'ID'.
- If a precursor has a different sequence variant designation ('LETTER') than its associated matures, the mature miRNA names are updated to match the precursor's designation.
Note that the 'Alias' attribute remains unchanged so repeated values may still be present.
Extend mature miRNA start and end coordinates in a GFF3 file by the indicated
number of nucleotides without exceeding chromosome boundaries. If the
extension causes the mature miRNA coordinates to exceed the boundaries of the
corresponding precursor (see note in "Constraints" below), the precursor
coordinates are extended accordingly. The precursor name will be appended
with _-n and/or _+m, where n and m represent the extensions on the 5' and
3' end, respectively (or 0 otherwise). The modified mature miRNA and precursor
annotations will be written to separate GFF3 files.
Constraints:
The script was written for GFF3 files obtained from miRBase.
The following assumptions are made:
- The input GFF3 file contains miRNA annotations for a single species.
- The input GFF3 file contains features of type "miRNA_primary_transcript" (referred to as "precursors") and "miRNA" (referred to as "mature miRNAs").
- The input GFF3 file contains a 'Name' and 'ID' attribute for each precursor.
- Each precursor contains one or more mature miRNAs.
- Each mature miRNA is a child of exactly one precursor and is completely within the boundaries of the precursor.
function parse_arguments¶
parse_arguments()
Parse command-line arguments.
function main¶
main(args) → None
Extend miRNAs start/end coordinates.
class AnnotationException¶
A custom exception class for MirnaExtension class.
class MirnaExtension¶
Class for updating miRNA annotated coordinates and names.
Attributes:
db: In-memory database of the input GFF3 records.db_out: In-memory database of the updated GFF3 records.seq_lengths: Dictionary mapping reference sequence IDs to their lengths.
function __init__¶
__init__() → None
Initialize class.
function adjust_names¶
adjust_names(precursor: Feature, matures: list[Feature]) → None
Adjust miRNA attributes for uniqueness and consistency.
This method updates the attributes of precursor and mature miRNA entries to ensure consistent naming based on their paralog or sequence variant status.
For precursors:
- A suffix indicates distinct genomic loci (paralogs) that express identical
mature sequences. This is typically extracted from the
NameorIDattribute. - Format:
SPECIES-mir-NUMBER[LETTER]-#(Name) andALIAS_#(ID) where:LETTERdenotes a sequence variant of the mature miRNA (paralogous variant with similar but not identical sequences),#indicates the paralog number (replica/locus index), included when multiple loci express the same or similar miRNAs.
For mature miRNAs:
- The replica number is added or replaced as an infix/suffix in the name.
- Formats:
SPECIES-miR-NUMBER[LETTER]-#-ARMSPECIES-miR-NUMBER[LETTER]-#SPECIES-miR-NUMBER[LETTER]-ARM
Cases:
- If a precursor has multiple genomic instances (paralogs), the first occurrence typically lacks a numeric suffix; subsequent ones are numbered incrementally.
- The
Derives_fromattribute of each mature miRNA is updated to match the precursor'sID. - If a precursor has a different sequence variant designation (
LETTER) than its associated matures, the mature miRNA names are updated to match the precursor's designation.
The Alias attribute remains unchanged.
Arguments:
precursor: "miRNA primary transcript" feature entrymatures: list of the corresponding "mature miRNA" feature(s) entry(s)
function process_precursor¶
process_precursor(precursor: Feature, n: int = 6) → list[Feature]
Extend miRNAs start and end coordinates and ensure name uniqueness.
This method elongates the start and end coordinates of mature miRNAs by n
nucleotides. In the case that this extension makes the start/end coordinates
to exceed the corresponding primary miRNA boundaries, these will be extended
as far as the miRNA coordinates. If provided, the elongation will take into
account the chromosome size.
In addition, the method adjust_names is called to ensure uniqueness in the
Name attribute for both the precursor and its arms.
Arguments:
precursor: "miRNA primary transcript" feature entryn: Number of nucleotides to extend miRs start and end coordinates.
function set_db¶
set_db(path: Path) → None
Load GFF3 file into gffutils.FeatureDB.
Arguments:
path: Path to a GFF3 file.
function set_seq_lengths¶
set_seq_lengths(path: Optional[Path] = None) → None
Set the reference sequence lengths.
Arguments:
path: Path to a tabulated file containing the names of reference sequences and the corresponding total length, in base pairs.
function update_db¶
update_db(n: int = 6) → None
Update miRNA annotations in the local database.
Using the method process_precursor annotated coordinates are extended by
n nucleotides and, if needed, the Name attribute is modified to contain the
sequence replica number.
Arguments:
n: Number of nucleotides to extend miRs start and end coordinates.
function write_gff¶
write_gff(path: Path, feature_type: Optional[str] = None) → None
Write features to a GFF3 file.
Arguments:
path: Path to the output file.feature_type: Feature type to write. IfNone, all features will be written.