Output files¶
Here you can find an overview of the output files for the different workflows.
Outputs of ZARP¶
After running the ZARP workflow, you will find several output files in the
specified output directory. The output directory is defined in the
config.yaml
file and it is normally called results/
. Here are some of the
key output files:
-
Quality control: ZARP generates comprehensive quality control reports that provide insights into the quality and composition of the sequencing experiments. These reports include metrics such as read quality, alignment statistics, and gene expression summaries.
-
Quantification files: These files contain the gene and transcript level expression values for each sample. They provide information about the abundance of each gene and transcript in the RNA-seq data.
-
Alignment files: These files contain the aligned reads for each sample in BAM format. They provide information about the mapping of the reads to the reference genome.
After a run you will, you will find all outputs within the results/
directory, arranged in the following general structure:
.
└── mus_musculus
├── multiqc_summary
├── samples
├── summary_kallisto
├── summary_salmon
└── zpca
Here is a description of the different subdirectories:
mus_musculus/
: A subdirectory for organism-specific results.multiqc_summary/
: Summary files generated by MultiQC.samples/
: Sample-specific outputs. One directory is created for each sample.summary_kallisto/
: Summary files for Kallisto quantifications.summary_salmon/
: Summary files for Salmon quantifications.zpca/
: Output files for ZARP's principal component analysis.
Quality Control outputs¶
Within the multiqc_summary/
directory, you will find an interactive HTML
file multiqc_report.html
with various quality control (QC) metrics that can
help you interpret your results. An example file is shown below:

On the left you can find a navigation bar that directs you to different sections and subsections for various tools:
- The
General Statistics
section contains a summary of most tools and you can find statistics on mapped reads, percent of duplicate reads, percent of adapters trimmed for various tools:

- The
FastQC: raw reads
section contains plots and quality statistics of your input FASTQ files. Some examples are shown below, like the number of duplicate reads in an experiment, the average sequencing quality per position, or the percentage of GC content:



- The
Cutadapt: adapter removal
andCutadapt: polyA tails removal
shows the number or the percentage of the reads trimmed:

-
The
FastQC: trimmed reads
section contains plots and quality statistics of the FASTQ files after adapter trimming. The plots are similar to the sectionFastQC: raw reads
. -
The
STAR
section shows the number and percentage of reads that are mapped using the STAR aligner:

- The
ALFA
section shows the number of reads mapped to genomic categories (stop codon, 5'-UTR, CDS, intergenic, etc.) and gene biotypes (protein coding genes, miRNA , tRNA, etc.) for unique reads and multimappers:


- The
TIN
section shows the Transcript Integrity Number of the samples:

- The
Salmon
section shows the fragment length distribution of the reads:

- The
Kallisto
section shows the number of reads that were aligned:

- Finally the
zpca
Salmon and Kallisto sections show PCA plots for expression levels of genes and transcripts:

Quantification outputs¶
Within the summary_kallisto
directory, you can find the following files:
genes_counts.tsv
: A table with gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.genes_tpm.tsv
: A table with gene TPM estimates.transcripts_counts.tsv
: A table with transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.transcripts_tpm.tsv
: A table with the transcript TPM estimates.tx2geneID.tsv
: A table mapping transcript IDs to gene IDs.
Within the summary_salmon/quantmerge
directory, you can find the following
files:
genes_numreads.tsv
: A table with gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.genes_tpm.tsv
: Matrix with the gene TPM estimates.transcripts_numreads.tsv
: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.transcripts_tpm.tsv
: Matrix with the transcript TPM estimates.
Alignment outputs¶
Within the samples
directory, you can find a directory for each sample, and
within these directories you can find the output files of the individual
steps. Some alignment files can be easily used to open in a genome browser for
other downstream analysis:
- In the
map_genome
directory you can find a file with the suffix.Aligned.sortedByCoord.out.bam
and the corresponding indexed (.bai
) file. This is the output of the STAR aligner. - In the
bigWig
directory you can find two folders.UniqueMappers
andMultimappersIncluded
. Within these files you find the bigWig files for the plus and minus strand. These files are convenient to load in a genome browser (like IGV) to view the genome coverage of the mappings.
Outputs of the SRA download workflow¶
Once you run the workflow that downloads data from the Sequence Read Archive (SRA), you can find the following file structure:
results/
`-- sra_downloads
|-- compress
| |-- ERR2248142
| | |-- ERR2248142.fastq.gz
| | `-- ERR2248142.se.tsv
| |-- SRR18549672
| | |-- SRR18549672.pe.tsv
| | |-- SRR18549672_1.fastq.gz
| | `-- SRR18549672_2.fastq.gz
| `-- SRR18552868
| |-- SRR18552868.fastq.gz
| `-- SRR18552868.se.tsv
|-- fasterq_dump
| `-- tmpdir
|-- get_layout
| |-- ERR2248142
| | `-- SINGLE.info
| |-- SRR18549672
| | `-- PAIRED.info
| `-- SRR18552868
| `-- SINGLE.info
|-- prefetch
| |-- ERR2248142
| | `-- ERR2248142.sra
| |-- SRR18549672
| | `-- SRR18549672.sra
| `-- SRR18552868
| `-- SRR18552868.sra
`-- sra_samples.out.tsv
All results are stored under the output directory you have specified in your
config.yaml
file (results/
in this case). The sra_samples.out.tsv
summarizes all the experiments that were fetched from SRA. The file contains
the SRR experiment and the path to FASTQ file(s). An example output file looks
like the following:
sample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
Single vs. paired-end sequencing
Some of the filenames indicate if the sample was sequenced in SINGLE
(se)
- or PAIRED (pe)
-end mode.
Outputs of the metadata inference workflow¶
Once you run the workflow that infers metadata you can find the following file structure:
results/
|-- FVKEQ
| |-- library_source_testpath1.1.fastq.json
| |-- library_source_testpath1.2.fastq.json
| |-- read_layout_testpath1.1.fastq.json
| `-- read_layout_testpath1.2.fastq.json
|-- HGLR5
| |-- library_source_testpath2.1.fastq.json
| `-- read_layout_testpath2.1.fastq.json
|-- htsinfer_SRR1.json
|-- htsinfer_SRR2.json
`-- samples_htsinfer.tsv
All results are stored under the output directory you have specified in your
config.yaml
file (results/
in this case). A JSON file with the htsinfer_
prefix is generated containing the inferred metadata for each of the samples.
All information that could be determined are stored in the file
samples_htsinfer.tsv
, which can later be used in the main ZARP workflow.