module validation_fasta.py¶
Filter FASTA files.
Process both uncompressed and gzip-compressed FASTA files by trimming,
filtering, and validating sequence records based on user-defined criteria.
Sequence IDs are trimmed at the first occurrence of any specified characters
in --trim to standardize naming conventions. If no character string is
provided, the first white space is used.
To filter the FASTA file by sequence IDs, a text file, with one (trimmed) ID
per line, has to be passed to --filter. Whether to keep (--mode k) or
discard (--mode d) the sequences with those IDs must be specified.
Sequences exceeding a given length threshold (--remove) are excluded.
If a path is provided to --idlist, the resulting sequence IDs are written
one per line in a separate text file.
function parse_and_validate_arguments¶
parse_and_validate_arguments()
Parse and validate command-line arguments.
function open_fasta¶
open_fasta(in_file: Path) → TextIO
Open a FASTA or FASTA.GZ for text‐mode reading.
function write_id_file¶
write_id_file(out_file: Path, id_list: List[str]) → None
Write the final sequence IDs, one per line.
Arguments:
out_file: Path to the file where to write the IDs.id_list: FASTA IDs to be written.
function compile_trim_pattern¶
compile_trim_pattern(trim_str: str) → Pattern[str]
Get a compiled regex pattern to trim at a character's first occurrence.
Arguments:
trim_str: Characters used to determine where trimming occurs. If empty, white space is used as the default delimiter.
Returns: A compiled regex pattern that captures (1) everything up to the first match and (2) the rest of the string.
function trim_id¶
trim_id(seq_rec: SeqRecord, _pattern: Pattern[str]) → SeqRecord
Trim a FASTA ID using the first-occurrence of any character in _pattern.
All parameters must be passed by keyword.
Arguments:
seq_rec: ABio.SeqRecord.SeqRecordto be trimmed in place._pattern: (internal) a pre-compiled regex fromget_trim_pattern.
Returns:
The same SeqRecord, with .id and .description possibly updated.
function main¶
main(arguments) → None
Filter and process a FASTA file.