Skip to content

Output files

Here you can find an overview of the output files for the different workflows.

Outputs of ZARP

After running the ZARP workflow, you will find several output files in the specified output directory. The output directory is defined in the config.yaml file and it is normally called results. Here are some of the key output files:

  • Quality control: ZARP generates comprehensive quality control reports that provide insights into the quality and composition of the sequencing experiments. These reports include metrics such as read quality, alignment statistics, and gene expression summaries.

  • Quantification files: These files contain the gene and transcript level expression values for each sample. They provide information about the abundance of each gene / transcript in the RNA-seq data.

  • Alignment files: These files contain the aligned reads for each sample in BAM format. They provide information about the mapping of the reads to the reference genome.

After a run you will find the following structure within the results directory:

.
├── multiqc_config.yaml
└── mus_musculus
    ├── multiqc_summary
    ├── samples
    ├── summary_kallisto
    ├── summary_salmon
    └── zpca

A description of the different directories is shown below:

  • results: The main output directory for the ZARP workflow.
    • mus_musculus: A subdirectory for the organism-specific results.
      • multiqc_summary: Summary files generated by MultiQC.
      • samples: Sample specific outputs. A directory is created for each sample.
      • summary_kallisto: Summary files for Kallisto quantifications.
      • summary_salmon: Summary files for Salmon quantifications.
      • zpca: Output files for ZARP's principal component analysis.

Quality Control (QC) outputs

Within the multiqc_summary directory, you will find an interactive HTML file (multiqc_report.html) with various QC metrics that can help you interpret your results. An example file is shown below

On the left you can find a navigation bar that takes you into different sections and subsections of the tools.

  • The General Statistics section contains a summary of most tools and you can find statistics on mapped reads, percent of duplicate reads, percent of adapters trimmed for various tools.
  • The FastQC: raw reads section contains plots and quality statistics of the fastq files. Some examples are shown below like the number of duplicate reads in an experiment, the average quality of the fastq files per position, or the percent of GC content.
  • The Cutadapt: adapter removal and Cutadapt: polyA tails removal shows the number or the percentage of the reads trimmed
  • The FastQC: trimmed reads section contains plots and quality statistics of the fastq files after adapter trimming. The plots are similar to the section FastQC: raw reads.

  • The STAR section shows the number and percentage of reads that are mapped using the STAR aligner.

  • The ALFA section shows the number of reads mapped to genomic categories (stop codon, 5'-UTR, CDS, intergenic, etc.) and gene biotypes (protein coding genes, miRNA , tRNA, etc.) for unique reads and multimappers.
  • The TIN section shows the Transcript Integrity Number of the samples.
  • The Salmon section shows the fragment length distribution of the reads
  • The Kallisto section shows the number of reads that were aligned
  • Finally the zpca Salmon and Kallisto sections show PCA plots for expression levels of genes and transcripts.

Quantification (Gene and transcript estimate) outputs

Within the summary_kallisto directory, you can find the following files: - genes_counts.tsv: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis. - genes_tpm.tsv: Matrix with the gene TPM estimates. - transcripts_counts.tsv: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis. - transcripts_tpm.tsv: Matrix with the transcript TPM estimates. - tx2geneID.tsv: A table mapping transcript IDs to gene IDs.

Within the summary_salmon/quantmerge directory, you can find the following files: - genes_numreads.tsv: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis. - genes_tpm.tsv: Matrix with the gene TPM estimates. - transcripts_numreads.tsv: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis. - transcripts_tpm.tsv: Matrix with the transcript TPM estimates.

Alignment outputs

Within the samples directory, you can find a directory for each sample, and within these directories you can find the output files of the individual steps. Some alignment files can be easily used to open in a genome browser for other downstream analysis: - In the map_genome directory you can find a file with the suffix .Aligned.sortedByCoord.out.bam and the corresponding indexed (.bai) file. This is the output of the STAR aligner. - In the bigWig directory you can find two folders. UniqueMappers and MultimappersIncluded. Within these files you find the bigWig files for the plus and minus strand. These files are convenient to load in a genome browser (like igv) to view the genome coverage of the mappings.

Outputs of download SRA data

Once you run the pipeline that downloads data from the Sequence Read Archive (SRA) you can find the following file structure:

results/
`-- sra_downloads
    |-- compress
    |   |-- ERR2248142
    |   |   |-- ERR2248142.fastq.gz
    |   |   `-- ERR2248142.se.tsv
    |   |-- SRR18549672
    |   |   |-- SRR18549672.pe.tsv
    |   |   |-- SRR18549672_1.fastq.gz
    |   |   `-- SRR18549672_2.fastq.gz
    |   `-- SRR18552868
    |       |-- SRR18552868.fastq.gz
    |       `-- SRR18552868.se.tsv
    |-- fasterq_dump
    |   `-- tmpdir
    |-- get_layout
    |   |-- ERR2248142
    |   |   `-- SINGLE.info
    |   |-- SRR18549672
    |   |   `-- PAIRED.info
    |   `-- SRR18552868
    |       `-- SINGLE.info
    |-- prefetch
    |   |-- ERR2248142
    |   |   `-- ERR2248142.sra
    |   |-- SRR18549672
    |   |   `-- SRR18549672.sra
    |   `-- SRR18552868
    |       `-- SRR18552868.sra
    `-- sra_samples.out.tsv

All results are stored under the output directory you have specified in your config.yaml file (results in this case). The sra_samples.out.tsv summarizes all the experiments that were fetched from SRA. The file contains the SRR experiment and the path to fastq file(s). An example output file looks like the following:

sample  fq1     fq2
SRR18552868     results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz 
SRR18549672     results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz       results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142      results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz 
Some of the filenames indicate if the experiment was sequenced with SINGLE (se) or PAIRED (pe) end mode.

Outputs of HTSinfer

Once you run the pipeline that infers metadata you can find the following file structure:

results/
|-- FVKEQ
|   |-- library_source_testpath1.1.fastq.json
|   |-- library_source_testpath1.2.fastq.json
|   |-- read_layout_testpath1.1.fastq.json
|   `-- read_layout_testpath1.2.fastq.json
|-- HGLR5
|   |-- library_source_testpath2.1.fastq.json
|   `-- read_layout_testpath2.1.fastq.json
|-- htsinfer_SRR1.json
|-- htsinfer_SRR2.json
`-- samples_htsinfer.tsv

All results are stored under the output directory you have specified in your config.yaml file (results in this case). A json file with the htsinfer_ prefix is generated containing the inferred metadata for each of the samples. All information that could be determined are stored in the file samples_htsinfer.tsv that can be later used in the main ZARP pipeline.