Output files¶

Here you can find an overview of the output files for the different workflows.

Outputs of ZARP¶

After running the ZARP workflow, you will find several output files in the specified output directory. The output directory is defined in the config.yaml file and it is normally called results/. Here are some of the key output files:

Quality control: ZARP generates comprehensive quality control reports that provide insights into the quality and composition of the sequencing experiments. These reports include metrics such as read quality, alignment statistics, and gene expression summaries.
Quantification files: These files contain the gene and transcript level expression values for each sample. They provide information about the abundance of each gene and transcript in the RNA-seq data.
Alignment files: These files contain the aligned reads for each sample in BAM format. They provide information about the mapping of the reads to the reference genome.

After a run you will, you will find all outputs within the results/ directory, arranged in the following general structure:

.
└── mus_musculus
    ├── multiqc_summary
    ├── samples
    ├── summary_kallisto
    ├── summary_salmon
    └── zpca

Here is a description of the different subdirectories:

mus_musculus/: A subdirectory for organism-specific results.
multiqc_summary/: Summary files generated by MultiQC.
samples/: Sample-specific outputs. One directory is created for each sample.
summary_kallisto/: Summary files for Kallisto quantifications.
summary_salmon/: Summary files for Salmon quantifications.
zpca/: Output files for ZARP's principal component analysis.

Quality Control outputs¶

Within the multiqc_summary/ directory, you will find an interactive HTML file multiqc_report.html with various quality control (QC) metrics that can help you interpret your results. An example file is shown below:

On the left you can find a navigation bar that directs you to different sections and subsections for various tools:

The General Statistics section contains a summary of most tools and you can find statistics on mapped reads, percent of duplicate reads, percent of adapters trimmed for various tools:

The FastQC: raw reads section contains plots and quality statistics of your input FASTQ files. Some examples are shown below, like the number of duplicate reads in an experiment, the average sequencing quality per position, or the percentage of GC content:

The Cutadapt: adapter removal and Cutadapt: polyA tails removal shows the number or the percentage of the reads trimmed:

The FastQC: trimmed reads section contains plots and quality statistics of the FASTQ files after adapter trimming. The plots are similar to the section FastQC: raw reads.
The STAR section shows the number and percentage of reads that are mapped using the STAR aligner:

The ALFA section shows the number of reads mapped to genomic categories (stop codon, 5'-UTR, CDS, intergenic, etc.) and gene biotypes (protein coding genes, miRNA , tRNA, etc.) for unique reads and multimappers:

The TIN section shows the Transcript Integrity Number of the samples:

The Salmon section shows the fragment length distribution of the reads:

The Kallisto section shows the number of reads that were aligned:

Finally the zpca Salmon and Kallisto sections show PCA plots for expression levels of genes and transcripts:

Quantification outputs¶

Within the summary_kallisto directory, you can find the following files:

genes_counts.tsv: A table with gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
genes_tpm.tsv: A table with gene TPM estimates.
transcripts_counts.tsv: A table with transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
transcripts_tpm.tsv: A table with the transcript TPM estimates.
tx2geneID.tsv: A table mapping transcript IDs to gene IDs.

Within the summary_salmon/quantmerge directory, you can find the following files:

genes_numreads.tsv: A table with gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
genes_tpm.tsv: Matrix with the gene TPM estimates.
transcripts_numreads.tsv: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
transcripts_tpm.tsv: Matrix with the transcript TPM estimates.

Alignment outputs¶

Within the samples directory, you can find a directory for each sample, and within these directories you can find the output files of the individual steps. Some alignment files can be easily used to open in a genome browser for other downstream analysis:

In the map_genome directory you can find a file with the suffix .Aligned.sortedByCoord.out.bam and the corresponding indexed (.bai) file. This is the output of the STAR aligner.
In the bigWig directory you can find two folders. UniqueMappers and MultimappersIncluded. Within these files you find the bigWig files for the plus and minus strand. These files are convenient to load in a genome browser (like IGV) to view the genome coverage of the mappings.

Outputs of the SRA download workflow¶

Once you run the workflow that downloads data from the Sequence Read Archive (SRA), you can find the following file structure:

results/
`-- sra_downloads
    |-- compress
    |   |-- ERR2248142
    |   |   |-- ERR2248142.fastq.gz
    |   |   `-- ERR2248142.se.tsv
    |   |-- SRR18549672
    |   |   |-- SRR18549672.pe.tsv
    |   |   |-- SRR18549672_1.fastq.gz
    |   |   `-- SRR18549672_2.fastq.gz
    |   `-- SRR18552868
    |       |-- SRR18552868.fastq.gz
    |       `-- SRR18552868.se.tsv
    |-- fasterq_dump
    |   `-- tmpdir
    |-- get_layout
    |   |-- ERR2248142
    |   |   `-- SINGLE.info
    |   |-- SRR18549672
    |   |   `-- PAIRED.info
    |   `-- SRR18552868
    |       `-- SINGLE.info
    |-- prefetch
    |   |-- ERR2248142
    |   |   `-- ERR2248142.sra
    |   |-- SRR18549672
    |   |   `-- SRR18549672.sra
    |   `-- SRR18552868
    |       `-- SRR18552868.sra
    `-- sra_samples.out.tsv

All results are stored under the output directory you have specified in your config.yaml file (results/ in this case). The sra_samples.out.tsv summarizes all the experiments that were fetched from SRA. The file contains the SRR experiment and the path to FASTQ file(s). An example output file looks like the following:

sample  fq1     fq2
SRR18552868     results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz 
SRR18549672     results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz       results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142      results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz

Single vs. paired-end sequencing

Some of the filenames indicate if the sample was sequenced in SINGLE (se)- or PAIRED (pe)-end mode.

Outputs of the metadata inference workflow¶

Once you run the workflow that infers metadata you can find the following file structure:

results/
|-- FVKEQ
|   |-- library_source_testpath1.1.fastq.json
|   |-- library_source_testpath1.2.fastq.json
|   |-- read_layout_testpath1.1.fastq.json
|   `-- read_layout_testpath1.2.fastq.json
|-- HGLR5
|   |-- library_source_testpath2.1.fastq.json
|   `-- read_layout_testpath2.1.fastq.json
|-- htsinfer_SRR1.json
|-- htsinfer_SRR2.json
`-- samples_htsinfer.tsv

All results are stored under the output directory you have specified in your config.yaml file (results/ in this case). A JSON file with the htsinfer_ prefix is generated containing the inferred metadata for each of the samples. All information that could be determined are stored in the file samples_htsinfer.tsv, which can later be used in the main ZARP workflow.