Skip to content

Execution of pipelines

ZARP consists of three different pipelines. The main pipeline processes the data, the second allows you to download the sequencing libraries from the Sequence Read Archive (SRA), and the third populates a file with the samples and determines sample specific parameters.

If you can create a samples.tsv file and fill in the metadata for the different sequencing experiments then the main pipeline can analyze your data.

Prerequisites

The code below assume that you have already installed ZARP.

How to run ZARP

  1. Assuming that your current directory is the workflow repository's root directory, create a directory for your workflow run and move into it with:

    mkdir config/my_run
    cd config/my_run
    
  2. Create an empty sample table and a workflow configuration file:

    touch samples.tsv
    touch config.yaml
    
  3. Use your editor of choice to populate these files with appropriate values. Have a look at the examples in the tests/ directory to see what the files should look like, specifically:

  4. Create a runner script. Pick one of the following choices for either local or cluster execution. Before execution of the respective command, you need to remember to update the argument of the --singularity-args option of a respective profile (file: profiles/{profile}/config.yaml) so that it contains a comma-separated list of all directories containing input data files (samples and any annotation files etc) required for your run.

    Runner script for local execution:

    cat << "EOF" > run.sh
    #!/bin/bash
    
    snakemake \
        --profile="../../profiles/local-singularity" \
        --configfile="config.yaml"
    
    EOF
    

    OR

    Runner script for Slurm cluster exection (note that you may need to modify the arguments to --jobs and --cores in the file: profiles/slurm-singularity/config.yaml depending on your HPC and workload manager configuration):

    cat << "EOF" > run.sh
    #!/bin/bash
    mkdir -p logs/cluster_log
    snakemake \
        --profile="../profiles/slurm-singularity" \
        --configfile="config.yaml"
    EOF
    

    Note: When running the pipeline with Conda you should use local-conda and slurm-conda profiles instead.

    Note: The slurm profiles are adapted to a cluster that uses the quality-of-service (QOS) keyword. If QOS is not supported by your slurm instance, you have to remove all the lines with "qos" in profiles/slurm-config.json.

  5. Start your workflow run:

    bash run.sh
    
  6. To find out more information on the output files please go to the output files section.

How to download data from SRA?

An independent Snakemake workflow workflow/rules/sra_download.smk is included for the download of sequencing libraries from the Sequence Read Archive and conversion into FASTQ.

The workflow expects the following parameters in the configuration file: * samples, a sample table (tsv) with column sample containing SRR identifiers (ERR and DRR are also supported), as in this example samples.tsv file. * outdir, an output directory * samples_out, a pointer to a modified sample table with the locations of the corresponding FASTQ files * cluster_log_dir, the cluster log directory.

For executing the example with Conda environments, one can use the following command (from within the activated zarp Conda environment):

snakemake --snakefile="workflow/rules/sra_download.smk" \
          --profile="profiles/local-conda" \
          --config samples="tests/input_files/sra_samples.tsv" \
                   outdir="results/sra_downloads" \
                   samples_out="results/sra_downloads/sra_samples.out.tsv" \
                   log_dir="logs" \
                   cluster_log_dir="logs/cluster_log"

Alternatively, change the argument to --profile from local-conda to local-singularity to execute the workflow steps within Singularity containers.

After successful execution, results/sra_downloads/sra_samples.out.tsv should contain:

sample  fq1     fq2
SRR18552868     results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz 
SRR18549672     results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz       results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142      results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz 

How to determine sample information (HTSinfer)?

An independent Snakemake workflow workflow/rules/htsinfer.smk that populates the samples.tsv required by ZARP with the sample specific parameters seqmode, f1_3p, f2_3p, organism, libtype and index_size. Those parameters are inferred from the provided fastq.gz files by HTSinfer.

Note: The workflow uses the implicit temporary directory from snakemake, which is called with [resources.tmpdir].

The workflow expects the following config: * samples, a sample table (tsv) with column sample containing sample identifiers, as well as columns fq1 and fq2 containing the paths to the input fastq files see example here. If the table contains further ZARP compatible columns (see pipeline documentation), the values specified there by the user are given priority over htsinfer's results. * outdir, an output directory * samples_out, path to a modified sample table with inferred parameters * records, set to 100000 per default

For executing the example one can use the following (with activated zarp environment):

cd tests/test_htsinfer_workflow
snakemake \
    --snakefile="../../workflow/rules/htsinfer.smk" \
    --restart-times=0 \
    --profile="../../profiles/local-conda" \
    --config outdir="results" \
             samples="../input_files/htsinfer_samples.tsv" \
             samples_out="samples_htsinfer.tsv" \
             log_dir="logs" \
             cluster_log_dir="logs/cluster_log" \
    --notemp \
    --keep-incomplete

However, this call will exit with an error, as not all parameters can be inferred from the example files. The argument --keep-incomplete makes sure the samples_htsinfer.tsv file can nevertheless be inspected.

After successful execution - if all parameters could be either inferred or were specified by the user - [OUTDIR]/[SAMPLES_OUT] should contain a populated table with parameters seqmode, f1_3p, f2_3p, organism, libtype and index_size.

Execution with docker

ZARP is optimised for Linux users as all packages are available via Conda or Apptainer (Singularity). For other systems like Mac OS X, they don't work especially due to the current transition from Intel to ARM processors (M series). Nevertheless we built a Docker container that can be used to run ZARP in such environments.

  1. Install Docker following the instructions here

  2. Pull the Docker image the contains the necessary dependencies

    docker pull zavolab/zarp:1.0.0-rc.1
    

  3. Create a directoty (e.g. data) and store all the files required for a run:

    • The genome sequence fasta file
    • The annotation gtf file
    • The fastq files of your experiments
    • The rule_config.yaml for the parameters
    • The samples.tsv containing the metadata of your samples
    • The config.yaml file with parameters. Below you can find an example file where you can see that it points to files in the data directory.
      ---
        # Required fields
        samples: "data/samples_docker.tsv"
        output_dir: "data/results"
        log_dir: "data/logs"
        cluster_log_dir: "data/logs/cluster"
        kallisto_indexes: "data/results/kallisto_indexes"
        salmon_indexes: "data/results/salmon_indexes"
        star_indexes: "data/results/star_indexes"
        alfa_indexes: "data/results/alfa_indexes"
        # Optional fields
        rule_config: "data/rule_config.yaml"
        report_description: "No description provided by user"
        report_logo: "../../images/logo.128px.png"
        report_url: "https://zavolan.biozentrum.unibas.ch/"
        author_name: "NA"
        author_email: "NA"
      ...
      
  4. Execute ZARP as following:

    docker run \
        --platform linux/x86_64 \
        --mount type=bind,source=$PWD/data,target=/data \
        zavolab/zarp:1.0.0-rc.1 \
        snakemake \
        -p \
        --snakefile /workflow/Snakefile \
        --configfile data/config.yaml \
        --cores 4 \
        --use-conda \
        --verbose
    
    The command runs the Docker container zavolab/zarp:1.0.0-rc.1 that we have pulled. It executes it as it would be done on a Linux platform --platform linux/x86_64. We use the --mount option to bind the local data directory that contains the input files with the data directory in the container. The pipeline is stored in the container in the path /workflow/Snakefile. Once ZARP is complete, the results will be stored in the data/results directory.