Skip to content

script sam_trx_to_sam_gen.pl

Re-maps a SAM file resulting from aligning a library of sequencing reads against a transcriptome to genomic coordinates.

All reads that do not cross an exon-exon boundary by a specified minimum overlap are discarded by default.

CAUTION

Only marginal validation of the input file type/format performed!

Comments

  • The script was written for SAM files produced by a recent (>= 0.1.4) version of segemehl. SAM files of other origin may or may not work.
  • Only the 0x10 bit (which informs whether a read sequence was reverse complemented for the alignment) of the FLAG field is considered; all other bits are ignored.

Usage

perl sam_trx_to_sam_gen.pl [OPTIONS] --in [FILE|SAM] --exons [FILE|BED] --out [FILE|SAM]

Arguments

  • --in [FILE|SAM] (required): Path to the input SAM file (transcript coordinates).
  • --exons [FILE|BED] (required): Path to the input BED file of exons (genomic coordinates).
  • --out [FILE|SAM] (required): Path to the output SAM file (genomic coordinates).

Options

  • --min-overlap [INT]: Minimum required overlap between read and any of the exons.
  • --tag [STRING]: Tag of the form "ZZ:Z:STRING" that is appended to the end of each line in the output file.
  • --no-strand-info: Used library preparation protocol does not preserve strand information (if unset, all reads mapping to the opposite strand of annotated transcripts - i.e., the appropriate SAM flag is set - are discarded.
  • --include-monoexonic: Do not discard alignments against single exons (default: skip).
  • --head: Print SAM header.
  • --print-report: Print statistics on the number of processed, printed and discarded alignments to STDOUT.
  • -h | --help: Show this information and die.
  • -u | --usage: Show this information and die.
  • --quiet: Shut up!

Requirements

  • Perl version: >= 5.40.2
  • Modules:
    • Getopt::Long: >= 2.58

subroutine usage

Returns usage information for current script

Accepts

N/A

Returns

String with usage information

Type

Specialized


subroutine exons_bed_to_hoaoa

Reads a BED file of exons and loads them into a hash (key: transcript ID) of arrays (chromosome, strand, exons) of arrays (exon genomic start, exon genomic stop).

Accepts

Sorted (name, start, end) BED6 file of exons (coordinates relative to genome; 0-based, open-ended)

Returns

Reference to hash of arrays of arrays

Type

Generic


subroutine exons_hoaoa_reorder_exons_on_minus_strand

Reverses the order of exons for transcripts annotated on the Crick/minus strand from exons hoaoa generated by subroutine exon_bed_to_hoaoa.

Accepts

Reference to hash of arrays of arrays

Returns

Reference to hash of arrays of arrays

Dependencies

Subroutine exon_bed_to_hoaoa (Alexander Kanitz)

Type

Specialized


subroutine exons_hoaoa_add_cumulative_length

Add exon length and cumulative exon length to exons hoaoa generated by sub exon_bed_to_hoaoa (added as third and fourth elements to inner arrays).

Accepts

Reference to hash of arrays of arrays

Returns

Reference to hash of arrays of arrays

Dependencies

Subroutine exon_bed_to_hoaoa (Alexander Kanitz)

Type

Specialized


subroutine exons_hoaoa_add_intron_length

Add distance between exon and previous exon to exons hoaoa generated by subroutine exon_bed_to_hoaoa (added as fifth element to inner arrays).

Accepts

Reference to hash of arrays of arrays

Returns

Reference to hash of arrays of arrays

Dependencies

Subroutine exon_bed_to_hoaoa (Alexander Kanitz)

Type

Specialized


subroutine trx_sam_to_gen_sam

Reads transcript alignments from a SAM file and maps them to genomic coordinates.

Accepts

  1. SAM input file
  2. Hash of arrays of arrays of exon genomic coordinates generated by subroutine exons_bed_to_hoaoa or derivatives
  3. Filename for SAM output file

Dependencies

Subroutine exons_bed_to_hoaoa (Alexander Kanitz)

Type

Specialized


subroutine get_coords_and_frags

Gets the reference sequence (i.e., chromosome), strand, and starting position of the alignment in genomic coordinates and calculates the lengths of overlaps with intersected exons as well as the lengths of spanned introns.

Accepts

  1. Reference to hash of arrays of arrays generated by subroutine exons_bed_to_hoaoa or derivatives
  2. RNAME (i.e., transcript ID) of the transcript SAM entry
  3. POS (i.e., starting position) of the transcript SAM entry
  4. end position of the transcript alignment
  5. the allowed minimum overlap

Returns

Reference to array of

  • chromosome,
  • strand,
  • starting position of the alignment in genomic coordinates,
  • overlap with first exon (integer), for multiple fragments, alternating: (A). length of spanned intron, (B). overlap with next fragment:
    • N-1. length of spanned intron,
    • N. overlap with last fragment

Dependencies

Subroutine exons_bed_to_hoaoa (Alexander Kanitz)

Type

Specialized


subroutine complement

Returns the complement of the input sequence.

Accepts

String (all characters but A, C, G, T and their lower case versions are ignored)

Returns

Complement of input sequence

Type

Generic


subroutine reverse_CIGAR

Returns the complement of the input sequence.

Accepts

String (all characters but A, C, G, T and their lower case versions are ignored)

Returns

Complement of input sequence

Type

Generic


subroutine reverse_complement_MD

Returns the complement of the input sequence.

Accepts

String (all characters but A, C, G, T and their lower case versions are ignored)

Returns

Complement of input sequence

Type

Generic


subroutine add_introns_to_CIGAR

For split/spliced alignments, includes introns (Ns) in a CIGAR string.

Accepts

  1. CIGAR string
  2. array of integers, containing the lengths of exon overlaps and, interspersed, the length(s) of the spanned intron(s), e.g.: 25 (length overlap exon 3), 10000 (length intron between exons 3 and 4), 50 (overlap exon 4), 5000 (length intron between exons 4 and 5), 25 (overlap exon 5)

Returns

Updated CIGAR string

Type

Generic


subroutine rc_bit

Extract the 0x10 (sequence reverse complemented) bit of the SAM FLAG field.

Accepts

SAM FLAG field value

Returns

Value of 0x10 bit

Type

Generic


subroutine print_trx

Debugging function that prints transcript information.


subroutine exons_hoaoa_remove_single_exons

Removes transcripts with only a single exon entry from exons hoaoa generated by sub exon_bed_to_hoaoa.

Accepts

Reference to hash of arrays of arrays

Returns

Reference to hash of arrays of arrays

Dependencies

Subroutine exon_bed_to_hoaoa (Alexander Kanitz)

Type

Specialized