Skip to content

module validate_bedtools_intersect.py

Validation utilities for bedtools intersect -wo -s output.

The input must be the result of:

  • -a GFF3/GTF (annotated features, 1-based)
  • -b BAM (alignments, 0-based)
  • -wo (report only overlaps and include overlap length)
  • -s (strand-aware intersections)

Expected column order (16 columns):

  • 1-9 Feature (GFF3/GTF): chr, source, type, start, end, score, strand, phase/frame, attributes
  • 10-15 Read (BAM-derived): chr, start, end, name, score, strand
  • 16 Overlap length (bp)

Exposes:

  • FileFormatError: exception raised for malformed lines/fields.
  • Record: a validated, parsed view of a single output line.
  • validate_first_n: fail-fast validation of the first N lines of a file.
  • parse_all: a streaming generator of (line_number, Record) tuples.

function validate_first_n

validate_first_n(intersect_file: Path, n: int = 10, sep: str = '\t')  None

Validate the first n lines of a bedtools intersect -wo -s out file.

Fails fast if any of the first n lines is invalid.

Arguments:

  • intersect_file: Path to the intersect output file.
  • n: Number of leading lines to validate.
  • sep: Field separator (default: TAB).

Raises:

  • FileFormatError: If any of the first n lines has an invalid format.

function parse_all

parse_all(intersect_file: Path, sep: str = '\t')  Iterator[tuple[int, Record]]

Stream a file, yielding (line_number, Record) for each data line.

Arguments:

  • intersect_file: Path to the intersect output file.
  • sep: Field separator (default: TAB).

Yields: Tuples of (line_number, Record).

Raises:

  • FileFormatError: If any line has an invalid format (iteration stops at the first error).

class FileFormatError

Raised when a line or field does not match the expected format.


class Record

A single validated bedtools intersect -wo -s record.

The expected column order is listed in the module docstring.

Attributes:

  • feat_chr: Feature chromosome of scaffold (with or without the chr prefix).
  • feat_source: Program or data source that produced the feature data source.
  • feat_type: Feature type name.
  • feat_start: Feature start (1-based).
  • feat_end: Feature end (1-based).
  • feat_score: Floating-point value or . if missing.
  • feat_strand: Feature's strand defined as + (forward) or - (reverse).
  • feat_phase_frame: 0, 1, or 2 indicating the feature's first base position within a codon, or . if missing.
  • feat_attrs: Dictionary of parsed GFF3/GTF-style attributes (keys lower-cased).

  • read_chr: Read chromosome of scaffold (with or without the chr prefix).

  • read_start: Read start (0-based).
  • read_end: Read end (0-based).
  • read_name: Read name.
  • read_score: Floating-point value or . if missing.
  • read_strand: Read's strand defined as + (forward) or - (reverse).

  • overlap_len: Number of overlapping base pairs between the feature and the read.

function __init__

__init__(
    feat_chr: str,
    feat_source: str,
    feat_type: str,
    feat_start: int,
    feat_end: int,
    feat_score: Union[float, Literal['']],
    feat_strand: Literal['+', '-'],
    feat_phase_frame: Union[int, Literal['']],
    feat_attrs: Dict[str, str],
    read_chr: str,
    read_start: int,
    read_end: int,
    read_name: str,
    read_score: Union[float, Literal['']],
    read_strand: Literal['+', '-'],
    overlap_len: int
)  None

Initialize a validated record.

All arguments are expected to be already coerced to their target types.


function cross_checks

cross_checks()  None

Validate relationships across fields.

Ensures:

  • Feature and read are on the same chromosome and strand.
  • start <= end for both feature and read.
  • Overlap > 0 (required by -wo) and equals the computed overlap.

Raises:

  • FileFormatError: If any rule is violated.

classmethod from_line

from_line(line: str, sep: str = '\t')  Record

Parse and validate a single bedtools intersect -wo -s line.

Arguments:

  • line: Raw line from the output file.
  • sep: Field separator (default: TAB).

Returns:

A validated Record instance.

Raises:

  • FileFormatError: If the column count is wrong or any field is invalid.