module validate_bedtools_intersect.py¶
Validation utilities for bedtools intersect -wo -s output.
The input must be the result of:
-a GFF3/GTF(annotated features, 1-based)-b BAM(alignments, 0-based)-wo(report only overlaps and include overlap length)-s(strand-aware intersections)
Expected column order (16 columns):
- 1-9 Feature (GFF3/GTF): chr, source, type, start, end, score, strand, phase/frame, attributes
- 10-15 Read (BAM-derived): chr, start, end, name, score, strand
- 16 Overlap length (bp)
Exposes:
FileFormatError: exception raised for malformed lines/fields.Record: a validated, parsed view of a single output line.validate_first_n: fail-fast validation of the first N lines of a file.parse_all: a streaming generator of(line_number, Record)tuples.
function validate_first_n¶
validate_first_n(intersect_file: Path, n: int = 10, sep: str = '\t') → None
Validate the first n lines of a bedtools intersect -wo -s out file.
Fails fast if any of the first n lines is invalid.
Arguments:
intersect_file: Path to the intersect output file.n: Number of leading lines to validate.sep: Field separator (default:TAB).
Raises:
FileFormatError: If any of the firstnlines has an invalid format.
function parse_all¶
parse_all(intersect_file: Path, sep: str = '\t') → Iterator[tuple[int, Record]]
Stream a file, yielding (line_number, Record) for each data line.
Arguments:
intersect_file: Path to the intersect output file.sep: Field separator (default:TAB).
Yields:
Tuples of (line_number, Record).
Raises:
FileFormatError: If any line has an invalid format (iteration stops at the first error).
class FileFormatError¶
Raised when a line or field does not match the expected format.
class Record¶
A single validated bedtools intersect -wo -s record.
The expected column order is listed in the module docstring.
Attributes:
feat_chr: Feature chromosome of scaffold (with or without thechrprefix).feat_source: Program or data source that produced the feature data source.feat_type: Feature type name.feat_start: Feature start (1-based).feat_end: Feature end (1-based).feat_score: Floating-point value or.if missing.feat_strand: Feature's strand defined as+(forward) or-(reverse).feat_phase_frame:0,1, or2indicating the feature's first base position within a codon, or.if missing.-
feat_attrs: Dictionary of parsed GFF3/GTF-style attributes (keys lower-cased). -
read_chr: Read chromosome of scaffold (with or without thechrprefix). read_start: Read start (0-based).read_end: Read end (0-based).read_name: Read name.read_score: Floating-point value or.if missing.-
read_strand: Read's strand defined as+(forward) or-(reverse). -
overlap_len: Number of overlapping base pairs between the feature and the read.
function __init__¶
__init__(
feat_chr: str,
feat_source: str,
feat_type: str,
feat_start: int,
feat_end: int,
feat_score: Union[float, Literal['']],
feat_strand: Literal['+', '-'],
feat_phase_frame: Union[int, Literal['']],
feat_attrs: Dict[str, str],
read_chr: str,
read_start: int,
read_end: int,
read_name: str,
read_score: Union[float, Literal['']],
read_strand: Literal['+', '-'],
overlap_len: int
) → None
Initialize a validated record.
All arguments are expected to be already coerced to their target types.
function cross_checks¶
cross_checks() → None
Validate relationships across fields.
Ensures:
- Feature and read are on the same chromosome and strand.
- start <= end for both feature and read.
- Overlap > 0 (required by
-wo) and equals the computed overlap.
Raises:
FileFormatError: If any rule is violated.
classmethod from_line¶
from_line(line: str, sep: str = '\t') → Record
Parse and validate a single bedtools intersect -wo -s line.
Arguments:
line: Raw line from the output file.sep: Field separator (default:TAB).
Returns:
A validated Record instance.
Raises:
FileFormatError: If the column count is wrong or any field is invalid.