Execution of pipelines¶
Next to the ZARP workflow for RNA-Seq analysis, this project comes with two auxiliary workflows for fetching samples from the Sequence Read Archive and populating a sparse sample table with inferred sample metadata. This section describes how to run each of these workflows.
Prerequisites
All usage examples in this section assume that you have already installed ZARP.
How to analyze your RNA-Seq samples?¶
-
Assuming that your current directory is the workflow repository's root directory, create a directory for your workflow run and move into it with:
mkdir config/my_run cd config/my_run
-
Create an empty sample table and a workflow configuration file:
touch samples.tsv touch config.yaml
-
Use your editor of choice to populate these files with appropriate values. Have a look at the examples in the
tests/
directory to see what the files should look like, specifically: -
Create a runner script. Pick one of the following choices for either local or cluster execution. Before execution of the respective command, you need to remember to update the argument of the
--singularity-args
option of a respective profile (file:profiles/{profile}/config.yaml
) so that it contains a comma-separated list of all directories containing input data files (samples and any annotation files etc) required for your run.Runner script for local execution:
cat << "EOF" > run.sh #!/bin/bash snakemake \ --profile="../../profiles/local-singularity" \ --configfile="config.yaml" EOF
OR
Runner script for Slurm cluster execution (note that you may need to modify the arguments to
--jobs
and--cores
in the file:profiles/slurm-singularity/config.yaml
depending on your HPC and workload manager configuration):cat << "EOF" > run.sh #!/bin/bash mkdir -p logs/cluster_log snakemake \ --profile="../profiles/slurm-singularity" \ --configfile="config.yaml" EOF
Note: When running the pipeline with Conda you should use
local-conda
andslurm-conda
profiles instead.Note: The slurm profiles are adapted to a cluster that uses the quality-of-service (QOS) keyword. If QOS is not supported by your slurm instance, you have to remove all the lines with "qos" in
profiles/slurm-config.json
. -
Start your workflow run:
bash run.sh
-
To find out more information on the output files please go to the output files section.
How to fetch sequencing samples from SRA?¶
An independent Snakemake workflow workflow/rules/sra_download.smk
is included
for the download of sequencing libraries from the Sequence Read Archive and
conversion into FASTQ.
The workflow expects the following parameters in the configuration file:
samples
, a sample table (tsv) with column sample containing SRR identifiers (ERR and DRR are also supported), as in this example samples.tsv file.outdir
, an output directorysamples_out
, a pointer to a modified sample table with the locations of the corresponding FASTQ filescluster_log_dir
, the cluster log directory.
For executing the example with Conda environments, one can use the following
command (from within the activated zarp
Conda environment):
snakemake --snakefile="workflow/rules/sra_download.smk" \
--profile="profiles/local-conda" \
--config samples="tests/input_files/sra_samples.tsv" \
outdir="results/sra_downloads" \
samples_out="results/sra_downloads/sra_samples.out.tsv" \
log_dir="logs" \
cluster_log_dir="logs/cluster_log"
Alternatively, change the argument to --profile
from local-conda
to
local-singularity
to execute the workflow steps within Singularity
containers.
After successful execution, results/sra_downloads/sra_samples.out.tsv
should
contain:
sample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
How to infer sample metadata?¶
An independent Snakemake workflow workflow/rules/htsinfer.smk
that populates the samples.tsv
required by ZARP with the sample specific parameters seqmode
, f1_3p
, f2_3p
, organism
, libtype
and index_size
. Those parameters are inferred from the provided fastq.gz
files by HTSinfer.
Note: The workflow uses the implicit temporary directory from snakemake, which is called with [resources.tmpdir].
The workflow expects the following config:
samples
, a sample table (tsv) with column sample containing sample identifiers, as well as columns fq1 and fq2 containing the paths to the input fastq files see example here. If the table contains further ZARP compatible columns (see pipeline documentation), the values specified there by the user are given priority over htsinfer's results.outdir
, an output directorysamples_out
, path to a modified sample table with inferred parametersrecords
, set to 100000 per default
For executing the example one can use the following (with activated zarp environment):
cd tests/test_htsinfer_workflow
snakemake \
--snakefile="../../workflow/rules/htsinfer.smk" \
--restart-times=0 \
--profile="../../profiles/local-conda" \
--config outdir="results" \
samples="../input_files/htsinfer_samples.tsv" \
samples_out="samples_htsinfer.tsv" \
log_dir="logs" \
cluster_log_dir="logs/cluster_log" \
--notemp \
--keep-incomplete
However, this call will exit with an error, as not all parameters can be inferred from the example files. The argument --keep-incomplete
makes sure the samples_htsinfer.tsv
file can nevertheless be inspected.
After successful execution - if all parameters could be either inferred or were specified by the user - [OUTDIR]/[SAMPLES_OUT]
should contain a populated table with parameters seqmode
, f1_3p
, f2_3p
, organism
, libtype
and index_size
.