script sam_remove_duplicates_inferior_alignments_multimappers.pl¶
From a sorted SAM file, first removes duplicate records (defined by identical
entries for the fields QNAME, FLAG, RNAME, POS and CIGAR), then all
QNAME duplicates except for the one(s) with the shortest edit distance.
Finally, unless the --keep-mm is set, all alignments of queries with the same
edit distance, but different coordinates ("multimappers") are discarded.
CAUTION
Only marginal validation of the input file type/format performed!
Comments
- The script assumes the SAM file is sorted by
QNAME. - The script requires NM tags (i.e., edit distances) to be present in all records of the input SAM file.
Usage¶
perl sam_remove_duplicates_inferior_alignments_multimappers.pl [OPTIONS] --in [FILE|SAM] --out [FILE|SAM]
Arguments¶
--in [FILE|SAM](required): Path to the input SAM file sorted byQNAME.--out [FILE|SAM](required): Path to the output SAM file.
Options¶
--new-header [FILE]: Path to the file to be used as header.--print-header: Print header. Keep input file header if--new-headeris not specified.--keep-mm [INT]: Keep queries with up toINTdifferent alignments. SetINTto "0" to keep all alignments for each query. By default, all alignments of "multimappers" are removed.--mm [TAB]: Print theQNAMEs and mapping counts of all "multimappers" toTAB(format:QNAME/TAB/ number of mappings; one entry per line).--heavy-mm [TAB]: Like--mm, but only prints alignments for reads that map more than--keep-mmtimes.-h|--help: Show this information and die.-u|--usage: Show this information and die.--quiet: Shut up!
Requirements¶
- Perl version:
>= 5.40.2 - Modules:
Getopt::Long:>= 2.58
subroutine usage¶
Returns usage information for current script
Accepts
N/A
Returns
String with usage information
Type
Specialized
subroutine filter_sam¶
From a sorted SAM file, first removes duplicate records (i.e. same entry name and coordinates, specifically the fields: QNAME, FLAG, RNAME, POS & CIGAR), then all QNAME duplicates except for the one(s) with the shortest edit distance, then (optionally) all alignments of "multimappers" (same QNAME, same edit distance, but different coordinates).
Accepts
- Input file [FILE|SAM]
- Output file [FILE|SAM]
- Multimapper switch:
0= remove multimappers,1= keep multimappers - Output file for multimapper IDs/
QNAMEs - Header switch:
FALSE= do not print header,TRUE= print header - Header file (prepends SAM records in output)
Returns
N/A
Type
Specialized
subroutine sam_AoH_filter_records_w_indent_QNAME¶
Compares records of a SAM file that share the same QNAME/read ID: True
duplicates (i.e., QNAME and coordinates are equal) are discarded first.
Then all records but the ones with the lowest edit distances are discarded.
The remaining entry or entries are returned in an array of hashes (array length
of > 1 if read is a "multimapper").
Accepts
Array of hashes of SAM records with identical QNAME field (as generated by the
subroutine filter_sam, written by Alexander Kanitz, 29-AUG-2013)
Returns
Reference to array of hashes
Type
Generic
subroutine sam_AoH_join_records¶
Joins the fields of SAM records stored in an array of hashes in the proper order (additional tags in alphanumerical order); re-computes or adds the NG tag field.
Accepts
Array of hashes of SAM records with identical QNAME field (as generated by
the subroutine filter_sam, written by Alexander Kanitz, 29-AUG-2013)
Returns
Array of strings, one element for each record, appended with a newline character for easy printing
Type
Generic