script sam_trx_to_sam_gen.pl¶
Re-maps a SAM file resulting from aligning a library of sequencing reads against a transcriptome to genomic coordinates.
All reads that do not cross an exon-exon boundary by a specified minimum overlap are discarded by default.
CAUTION
Only marginal validation of the input file type/format performed!
Comments
- The script was written for SAM files produced by a recent (
>= 0.1.4) version of segemehl. SAM files of other origin may or may not work. - Only the 0x10 bit (which informs whether a read sequence was reverse complemented for the alignment) of the FLAG field is considered; all other bits are ignored.
Usage¶
perl sam_trx_to_sam_gen.pl [OPTIONS] --in [FILE|SAM] --exons [FILE|BED] --out [FILE|SAM]
Arguments¶
--in [FILE|SAM](required): Path to the input SAM file (transcript coordinates).--exons [FILE|BED](required): Path to the input BED file of exons (genomic coordinates).--out [FILE|SAM](required): Path to the output SAM file (genomic coordinates).
Options¶
--min-overlap [INT]: Minimum required overlap between read and any of the exons.--tag [STRING]: Tag of the form"ZZ:Z:STRING"that is appended to the end of each line in the output file.--no-strand-info: Used library preparation protocol does not preserve strand information (if unset, all reads mapping to the opposite strand of annotated transcripts - i.e., the appropriate SAM flag is set - are discarded.--include-monoexonic: Do not discard alignments against single exons (default: skip).--head: Print SAM header.--print-report: Print statistics on the number of processed, printed and discarded alignments toSTDOUT.-h|--help: Show this information and die.-u|--usage: Show this information and die.--quiet: Shut up!
Requirements¶
- Perl version:
>= 5.40.2 - Modules:
Getopt::Long:>= 2.58
subroutine usage¶
Returns usage information for current script
Accepts
N/A
Returns
String with usage information
Type
Specialized
subroutine exons_bed_to_hoaoa¶
Reads a BED file of exons and loads them into a hash (key: transcript ID) of arrays (chromosome, strand, exons) of arrays (exon genomic start, exon genomic stop).
Accepts
Sorted (name, start, end) BED6 file of exons (coordinates relative to genome; 0-based, open-ended)
Returns
Reference to hash of arrays of arrays
Type
Generic
subroutine exons_hoaoa_reorder_exons_on_minus_strand¶
Reverses the order of exons for transcripts annotated on the Crick/minus strand
from exons hoaoa generated by subroutine exon_bed_to_hoaoa.
Accepts
Reference to hash of arrays of arrays
Returns
Reference to hash of arrays of arrays
Dependencies
Subroutine exon_bed_to_hoaoa (Alexander Kanitz)
Type
Specialized
subroutine exons_hoaoa_add_cumulative_length¶
Add exon length and cumulative exon length to exons hoaoa generated by sub
exon_bed_to_hoaoa (added as third and fourth elements to inner arrays).
Accepts
Reference to hash of arrays of arrays
Returns
Reference to hash of arrays of arrays
Dependencies
Subroutine exon_bed_to_hoaoa (Alexander Kanitz)
Type
Specialized
subroutine exons_hoaoa_add_intron_length¶
Add distance between exon and previous exon to exons hoaoa generated by
subroutine exon_bed_to_hoaoa (added as fifth element to inner arrays).
Accepts
Reference to hash of arrays of arrays
Returns
Reference to hash of arrays of arrays
Dependencies
Subroutine exon_bed_to_hoaoa (Alexander Kanitz)
Type
Specialized
subroutine trx_sam_to_gen_sam¶
Reads transcript alignments from a SAM file and maps them to genomic coordinates.
Accepts
- SAM input file
- Hash of arrays of arrays of exon genomic coordinates generated by subroutine
exons_bed_to_hoaoaor derivatives - Filename for SAM output file
Dependencies
Subroutine exons_bed_to_hoaoa (Alexander Kanitz)
Type
Specialized
subroutine get_coords_and_frags¶
Gets the reference sequence (i.e., chromosome), strand, and starting position of the alignment in genomic coordinates and calculates the lengths of overlaps with intersected exons as well as the lengths of spanned introns.
Accepts
- Reference to hash of arrays of arrays generated by subroutine
exons_bed_to_hoaoaor derivatives RNAME(i.e., transcript ID) of the transcript SAM entryPOS(i.e., starting position) of the transcript SAM entry- end position of the transcript alignment
- the allowed minimum overlap
Returns
Reference to array of
- chromosome,
- strand,
- starting position of the alignment in genomic coordinates,
- overlap with first exon (integer), for multiple fragments, alternating:
(A). length of spanned intron, (B). overlap with next fragment:
- N-1. length of spanned intron,
- N. overlap with last fragment
Dependencies
Subroutine exons_bed_to_hoaoa (Alexander Kanitz)
Type
Specialized
subroutine complement¶
Returns the complement of the input sequence.
Accepts
String (all characters but A, C, G, T and their lower case versions are ignored)
Returns
Complement of input sequence
Type
Generic
subroutine reverse_CIGAR¶
Returns the complement of the input sequence.
Accepts
String (all characters but A, C, G, T and their lower case versions are ignored)
Returns
Complement of input sequence
Type
Generic
subroutine reverse_complement_MD¶
Returns the complement of the input sequence.
Accepts
String (all characters but A, C, G, T and their lower case versions are ignored)
Returns
Complement of input sequence
Type
Generic
subroutine add_introns_to_CIGAR¶
For split/spliced alignments, includes introns (Ns) in a CIGAR string.
Accepts
- CIGAR string
- array of integers, containing the lengths of exon overlaps and, interspersed, the length(s) of the spanned intron(s), e.g.: 25 (length overlap exon 3), 10000 (length intron between exons 3 and 4), 50 (overlap exon 4), 5000 (length intron between exons 4 and 5), 25 (overlap exon 5)
Returns
Updated CIGAR string
Type
Generic
subroutine rc_bit¶
Extract the 0x10 (sequence reverse complemented) bit of the SAM FLAG field.
Accepts
SAM FLAG field value
Returns
Value of 0x10 bit
Type
Generic
subroutine print_trx¶
Debugging function that prints transcript information.
subroutine exons_hoaoa_remove_single_exons¶
Removes transcripts with only a single exon entry from exons hoaoa generated by
sub exon_bed_to_hoaoa.
Accepts
Reference to hash of arrays of arrays
Returns
Reference to hash of arrays of arrays
Dependencies
Subroutine exon_bed_to_hoaoa (Alexander Kanitz)
Type
Specialized