Output files¶
Here you can find an overview of the output files for the different workflows.
Outputs of ZARP¶
After running the ZARP workflow, you will find several output files in the specified output directory. The output directory is defined in the config.yaml
file and it is normally called results
. Here are some of the key output files:
-
Quality control: ZARP generates comprehensive quality control reports that provide insights into the quality and composition of the sequencing experiments. These reports include metrics such as read quality, alignment statistics, and gene expression summaries.
-
Quantification files: These files contain the gene and transcript level expression values for each sample. They provide information about the abundance of each gene / transcript in the RNA-seq data.
-
Alignment files: These files contain the aligned reads for each sample in BAM format. They provide information about the mapping of the reads to the reference genome.
After a run you will find the following structure within the results
directory:
.
├── multiqc_config.yaml
└── mus_musculus
├── multiqc_summary
├── samples
├── summary_kallisto
├── summary_salmon
└── zpca
A descrpition of the different directories is shown below:
results
: The main output directory for the ZARP workflow.mus_musculus
: A subdirectory for the organism-specific results.multiqc_summary
: Summary files generated by MultiQC.samples
: Sample specific outputs. A directory is created for each sample.summary_kallisto
: Summary files for Kallisto quantifications.summary_salmon
: Summary files for Salmon quantifications.zpca
: Output files for ZARP's principal component analysis.
Quality Control (QC) outputs¶
Within the multiqc_summary
directory, you will find an interactive HTML file (multiqc_report.html
) with various QC metrics that can help you interpret your results. An example file is shown below
On the left you can find a navigation bar that takes you into different sections and subsections of the tools.
- The
General Statistics
section contains a summary of most tools and you can find statistics on mapped reads, percent of duplicate reads, percent of adapters trimmed for various tools.
- The
FastQC: raw reads
section contains plots and quality statistics of the fastq files. Some examples are shown below like the number of duplicate reads in an experiment, the average quality of the fastq files per position, or the percent of GC content.
- The
Cutadapt: adapter removal
andCutadapt: polyA tails removal
shows the number or the percentage of the reads trimmed
-
The
FastQC: trimmed reads
section contains plots and quality statistics of the fastq files after adapter trimming. The plots are similar to the sectionFastQC: raw reads
. -
The
STAR
section shows the number and percentage of reads that are mapped using the STAR aligner.
- The
ALFA
section shows the number of reads mapped to genomic categories (stop codon, 5'-UTR, CDS, intergenic, etc.) and gene biotypes (protein coding genes, miRNA , tRNA, etc.) for unique reads and multimappers.
- The
TIN
section shows the Transcript Integrity Number of the samples.
- The
Salmon
section shows the fragment length distribution of the reads
- The
Kallisto
section shows the number of reads that were aligned
- Finally the
zpca
salmon and kallisto sections show PCA plots for expression levels of genes and transcripts.
Quantification (Gene and transcript estimate) outputs¶
Within the summary_kallisto
directory, you can find the following files:
- genes_counts.tsv
: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
- genes_tpm.tsv
: Matrix with the gene TPM estimates.
- transcripts_counts.tsv
: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
- transcripts_tpm.tsv
: Matrix with the transcript TPM estimates.
- tx2geneID.tsv
: A table mapping transcript IDs to gene IDs.
Within the summary_salmon/quantmerge
directory, you can find the following files:
- genes_numreads.tsv
: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
- genes_tpm.tsv
: Matrix with the gene TPM estimates.
- transcripts_numreads.tsv
: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
- transcripts_tpm.tsv
: Matrix with the transcript TPM estimates.
Alignment outputs¶
Within the samples
directory, you can find a directory for each sample, and within these directories you can find the output files of the individual steps. Some alignment files can be easily used to open in a genome browser for other downstream analysis:
- In the map_genome
directory you can find a file with the suffix .Aligned.sortedByCoord.out.bam
and the corresponding indexed (.bai
) file. This is the output of the STAR aligner.
- In the bigWig
directory you can find two folders. UniqueMappers
and MultimappersIncluded
. Within these files you find the bigWig files for the plus and minus strand. These files are convenient to load in a genome browser (like igv) to view the genome coverage of the mappings.
Outputs of downnload SRA data¶
Once you run the pipeline that downloads data from the Sequence Read Archive (SRA) you can find the following file structure:
results/
`-- sra_downloads
|-- compress
| |-- ERR2248142
| | |-- ERR2248142.fastq.gz
| | `-- ERR2248142.se.tsv
| |-- SRR18549672
| | |-- SRR18549672.pe.tsv
| | |-- SRR18549672_1.fastq.gz
| | `-- SRR18549672_2.fastq.gz
| `-- SRR18552868
| |-- SRR18552868.fastq.gz
| `-- SRR18552868.se.tsv
|-- fasterq_dump
| `-- tmpdir
|-- get_layout
| |-- ERR2248142
| | `-- SINGLE.info
| |-- SRR18549672
| | `-- PAIRED.info
| `-- SRR18552868
| `-- SINGLE.info
|-- prefetch
| |-- ERR2248142
| | `-- ERR2248142.sra
| |-- SRR18549672
| | `-- SRR18549672.sra
| `-- SRR18552868
| `-- SRR18552868.sra
`-- sra_samples.out.tsv
All results are stored under the output directory you have specified in your config.yaml file (results
in this case). The sra_samples.out.tsv
summarizes all the experiments that were fetched from SRA. The file contains the SRR experiment and the path to fastq file(s). An example output file looks like the following:
sample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
SINGLE (se)
or PAIRED (pe)
end mode.
Outputs of HTSinfer¶
Once you run the pipeline that infers metadata you can find the following file structure:
results/
|-- FVKEQ
| |-- library_source_testpath1.1.fastq.json
| |-- library_source_testpath1.2.fastq.json
| |-- read_layout_testpath1.1.fastq.json
| `-- read_layout_testpath1.2.fastq.json
|-- HGLR5
| |-- library_source_testpath2.1.fastq.json
| `-- read_layout_testpath2.1.fastq.json
|-- htsinfer_SRR1.json
|-- htsinfer_SRR2.json
`-- samples_htsinfer.tsv
All results are stored under the output directory you have specified in your config.yaml file (results
in this case). A json file with the htsinfer_
prefix is generated containing the inferred metadata for each of the samples. All information that could be determined are stored in the file samples_htsinfer.tsv
that can be later used in the main ZARP pipeline.