Skip to content

script sam_remove_duplicates_inferior_alignments_multimappers.pl

From a sorted SAM file, first removes duplicate records (defined by identical entries for the fields QNAME, FLAG, RNAME, POS and CIGAR), then all QNAME duplicates except for the one(s) with the shortest edit distance. Finally, unless the --keep-mm is set, all alignments of queries with the same edit distance, but different coordinates ("multimappers") are discarded.

CAUTION

Only marginal validation of the input file type/format performed!

Comments

  • The script assumes the SAM file is sorted by QNAME.
  • The script requires NM tags (i.e., edit distances) to be present in all records of the input SAM file.

Usage

perl sam_remove_duplicates_inferior_alignments_multimappers.pl [OPTIONS] --in [FILE|SAM] --out [FILE|SAM]

Arguments

  • --in [FILE|SAM] (required): Path to the input SAM file sorted by QNAME.
  • --out [FILE|SAM] (required): Path to the output SAM file.

Options

  • --new-header [FILE]: Path to the file to be used as header.
  • --print-header: Print header. Keep input file header if --new-header is not specified.
  • --keep-mm [INT]: Keep queries with up to INT different alignments. Set INT to "0" to keep all alignments for each query. By default, all alignments of "multimappers" are removed.
  • --mm [TAB]: Print the QNAMEs and mapping counts of all "multimappers" to TAB (format: QNAME /TAB/ number of mappings; one entry per line).
  • --heavy-mm [TAB]: Like --mm, but only prints alignments for reads that map more than --keep-mm times.
  • -h | --help: Show this information and die.
  • -u | --usage: Show this information and die.
  • --quiet: Shut up!

Requirements

  • Perl version: >= 5.40.2
  • Modules:
    • Getopt::Long: >= 2.58

subroutine usage

Returns usage information for current script

Accepts

N/A

Returns

String with usage information

Type

Specialized


subroutine filter_sam

From a sorted SAM file, first removes duplicate records (i.e. same entry name and coordinates, specifically the fields: QNAME, FLAG, RNAME, POS & CIGAR), then all QNAME duplicates except for the one(s) with the shortest edit distance, then (optionally) all alignments of "multimappers" (same QNAME, same edit distance, but different coordinates).

Accepts

  1. Input file [FILE|SAM]
  2. Output file [FILE|SAM]
  3. Multimapper switch: 0 = remove multimappers, 1 = keep multimappers
  4. Output file for multimapper IDs/QNAMEs
  5. Header switch: FALSE = do not print header, TRUE = print header
  6. Header file (prepends SAM records in output)

Returns

N/A

Type

Specialized


subroutine sam_AoH_filter_records_w_indent_QNAME

Compares records of a SAM file that share the same QNAME/read ID: True duplicates (i.e., QNAME and coordinates are equal) are discarded first. Then all records but the ones with the lowest edit distances are discarded. The remaining entry or entries are returned in an array of hashes (array length of > 1 if read is a "multimapper").

Accepts

Array of hashes of SAM records with identical QNAME field (as generated by the subroutine filter_sam, written by Alexander Kanitz, 29-AUG-2013)

Returns

Reference to array of hashes

Type

Generic


subroutine sam_AoH_join_records

Joins the fields of SAM records stored in an array of hashes in the proper order (additional tags in alphanumerical order); re-computes or adds the NG tag field.

Accepts

Array of hashes of SAM records with identical QNAME field (as generated by the subroutine filter_sam, written by Alexander Kanitz, 29-AUG-2013)

Returns

Array of strings, one element for each record, appended with a newline character for easy printing

Type

Generic