Usage

General concepts

SNS consists of multiple routes (or workflows). Each route contains multiple segments (or steps).

The sample sheets and most output are in a CSV format for macOS Quick Look (spacebar file preview) compatibility.

Brief overview

Download the code in an empty project directory.

git clone --depth 1 https://github.com/igordot/sns

Specify the reference genome.

sns/generate-settings <genome>

Generate a sample sheet based on a directory of FASTQ files.

sns/gather-fastqs <fastq_dir>

Run the analysis using a specific route.

sns/run <route>

Check for problems.

grep "ERROR:" logs-sbatch/*

Detailed description for new users

Add git to the environment (git is not available on UltraViolet/BigPurple by default).

module add git

Navigate to a clean new project directory. This is where all the results will end up.

cd <project_dir>

Download the code from GitHub, which will create the sns sub-directory with all the pipeline code.

git clone --depth 1 https://github.com/igordot/sns

Each project directory should have its own copy of the code for reproducibility. If you modify the code for one project, the changes will not affect other projects. If you repeat the analysis with additional samples at a different time, same code will be used.

Specify the reference genome (such as hg38 or hg19 for human, mm10 for mouse, dm6 or dm3 for fly).

sns/generate-settings <genome>

This will create settings.txt, which contains the information about the reference files and certain project settings.

Search a directory of FASTQ files to be used as input and generate a sample sheet.

sns/gather-fastqs <fastq_dir>

All found files will be added to the samples.fastq-raw.csv file. This command can be run multiple times if there are FASTQs in different directories. The file can be edited to add/remove samples or change the sample names. The sample names are automatically set based on FASTQ file names. All downstream file names will be set based on the sample names specified here, so it’s helpful to set them to something easily interpretable. The first column is the sample name, the second column is the R1 FASTQ, and the third column is the R2 FASTQ (if available). Each line contains a single FASTQ (or a pair of FASTQs for paired-end experiments). If a single sample has multiple FASTQs, each one will be on a different line. Multiple FASTQs for the same sample will be merged based on the sample name.

Run the analysis using a specific route (a set of analysis steps).

sns/run <route>

If the sample sheet includes previously processed samples, those will be skipped.

Check if the jobs are submitted and running.

squeue -u $USER

Check for potential problems.

grep "ERROR:" logs-sbatch/*

Checking for errors can be started as soon as the pipeline starts running. It needs to be done after all the jobs complete. There should be no output from this command if everything ran without problems. If any errors are detected, examine the full log file where they are found to see the full context.

If there is a problem with any of the results, delete the problematic files and re-run SNS. The pipeline will skip existing files and will generate any missing output. Similarly, you can add additional entries to the sample sheet and only the new ones will be processed when the route is re-run.