Skip to content

module validation_fasta.py

Filter FASTA files.

Process both uncompressed and gzip-compressed FASTA files by trimming, filtering, and validating sequence records based on user-defined criteria.

Sequence IDs are trimmed at the first occurrence of any specified characters in --trim to standardize naming conventions. If no character string is provided, the first white space is used.

To filter the FASTA file by sequence IDs, a text file, with one (trimmed) ID per line, has to be passed to --filter. Whether to keep (--mode k) or discard (--mode d) the sequences with those IDs must be specified.

Sequences exceeding a given length threshold (--remove) are excluded.

If a path is provided to --idlist, the resulting sequence IDs are written one per line in a separate text file.


function parse_and_validate_arguments

parse_and_validate_arguments()

Parse and validate command-line arguments.


function open_fasta

open_fasta(in_file: Path)  TextIO

Open a FASTA or FASTA.GZ for text‐mode reading.


function write_id_file

write_id_file(out_file: Path, id_list: List[str])  None

Write the final sequence IDs, one per line.

Arguments:

  • out_file: Path to the file where to write the IDs.
  • id_list: FASTA IDs to be written.

function compile_trim_pattern

compile_trim_pattern(trim_str: str)  Pattern[str]

Get a compiled regex pattern to trim at a character's first occurrence.

Arguments:

  • trim_str: Characters used to determine where trimming occurs. If empty, white space is used as the default delimiter.

Returns: A compiled regex pattern that captures (1) everything up to the first match and (2) the rest of the string.


function trim_id

trim_id(seq_rec: SeqRecord, _pattern: Pattern[str])  SeqRecord

Trim a FASTA ID using the first-occurrence of any character in _pattern.

All parameters must be passed by keyword.

Arguments:

  • seq_rec: A Bio.SeqRecord.SeqRecord to be trimmed in place.
  • _pattern: (internal) a pre-compiled regex from get_trim_pattern.

Returns: The same SeqRecord, with .id and .description possibly updated.


function main

main(arguments)  None

Filter and process a FASTA file.