Example Cluster Jobs

From UMass GHPCC User Wiki
Jump to: navigation, search

Example BWA job

Fruitfly example

bwa_script.sh

#!/bin/bash
# Load BWA
module load bwa/0.7.5a

# Create an index (files) for the aln of BWA, this may already exist if it is common genome type
# example: we are using a fruitfly.fasta file (you would need to download this).  
# It is assumed that this fasta file exists  there already
# also note that this file will be overwritten so keep a backup of it
cd ~
mkdir -p index
cd index
cp LOCATION_OF_REF/fruitfly.fasta .
bwa index fruitfly.fasta

# Now aln it to a sai file, using 8 cores
bwa -t 8 aln fruitfly.fasta YOUR_FILES.fastq > YOUR_FILES.sai

# Now convert to a SAM file
bwa sampe fruitfly.fasta YOUR_FILES.sai YOUR_FILES.fastq > YOUR_FILES.sam

Executing with LSF

To do this with LSF the following command would be good for the script above (8 cores on one host) with 16GB total RAM (2GB per core):

 bsub -n 8 -q long -W 8:0 -R "rusage[mem=2000]" -R "span[hosts=1]" < bwa_script.sh

Faster Fastq alignment using jobs array

Demultiplexing qseq files with barcodes

In the Bustard directory barcoded single end (SE), and paired end (PE) are represented in 2 (SE) or 3 (PE) files per tile.

 s_1_1_0001_qseq.txt  reads
 s_1_2_0001_qseq.txt  barcodes
 s_1_3_0001_qseq.txt  reads only with paired ends

There are 120 tiles so the 0001 runs through 0120.

In /share/bin/hpctools there is a utility called demultiplexSeqs, which reads the SE or PE files and sorts reads by barcodes into as many files as there are barcodes, twice that for paired ends.

demultiplexSeqs -1/-2 lane# barcodes.lst
 -1 for SE
 -2 for PE

The barcodes.lst file is just a list of barcodes, one per line, like this. Unknown just creates a bin for sequences with other or damaged barcodes

ATCACG
CGATGT
TTAGGC
TGACCA
ACAGTG
GCCAAT
Unknown

Run this program in the directory with the *qseq.txt files (using qlogin or qsub ). After the program runs there will be a set of files like this, indicating the barcode, the lane, and paired/single end (r1 or r2).

ACAGTG_L003_r1_qseq.txt  CGATGT_L003_r1_qseq.txt  TGACCA_L003_r1_qseq.txt  Unknown_L003_r1_qseq.txt
ACAGTG_L003_r2_qseq.txt  CGATGT_L003_r2_qseq.txt  TGACCA_L003_r2_qseq.txt  Unknown_L003_r2_qseq.txt
ATCACG_L003_r1_qseq.txt  GCCAAT_L003_r1_qseq.txt  TTAGGC_L003_r1_qseq.txt
ATCACG_L003_r2_qseq.txt  GCCAAT_L003_r2_qseq.txt  TTAGGC_L003_r2_qseq.txt


Concatenating fastq files from HiSeq runs

The HiSeq returns results as compressed fastq files split into several parts. To uncompress and concatenate them into one file, try this simple one line in the directory where you have placed the *.fastq.gz files. Note if there are paired ends you'll run this twice, once for each set. adjust the R1_ to R2_ for the second run with the other paired end files. Change the output file name accordingly ls -1|grep R1_|sort|xargs -i zcat {} >>BASE_NAME_R1.fastq


where BASE_NAME is the filename of the experiment, e.g. PFA2_CGTATT_L003

Example

$ls -1
PFA2_CGTATT_L003_R1_001.fastq.gz 
PFA2_CGTATT_L003_R1_002.fastq.gz 
PFA2_CGTATT_L003_R1_003.fastq.gz
 ... 
PFA2_CGTATT_L003_R1_040.fastq.gz 
PFA2_CGTATT_L003_R2_001.fastq.gz 
PFA2_CGTATT_L003_R2_002.fastq.gz 
PFA2_CGTATT_L003_R2_003.fastq.gz 
  ...
PFA2_CGTATT_L003_R2_040.fastq.gz 
$ls -1|grep R1_|sort|xargs -i zcat {} >>PFA2_CGTATT_L003_R1.fastq
$ls -1|grep R2_|sort|xargs -i zcat {} >>PFA2_CGTATT_L003_R2.fastq


Bowtie Example script

The script below is an example when running Bowtie. In this example we are requesting 8 CPUs (8 bowtie threads), and 16GB of available RAM for the processing of the script.

--bowtie.sh--

#!/bin/bash
#BSUB -n 8
#BSUB -q long
#BSUB -w 8:00
#BSUB -R "rusage[mem=2000]" # request 2000MB per job slot, or 16000MB total
#BSUB -R "span[hosts=1]" 

# Load Bowtie
module load bowtie/1.0.0
  
# Some options we are requesting
# -a     = report all alignments per read
# --best = hits guaranteed best stratum; ties broken by quality
# -v 2   = report end-to-end hits w/ <=2 mismatches; ignore qualities
# -p 4   = number of alignment threads to launch.  4 in our case we are requesting
# 
# ebwt_file
# Examples:
#bowtie -p 4 -a --best -v 2 e_coli $HOME/sequence.fastq  $HOME/map_file.map >> output
#bowtie -p 4 -a --best -v 2 e_coli -c ATGCATCATGCGCCAT >> output

# Next example (see the bowtie man page for more information)
# -S       - Output in SAM Format
# -f       - The query input files are FASTA files (usually having extension .fa, .mfa, .fna or similar).
# -p       - Use 8 cores/CPUs
# --best   - Hits guaranteed best stratum; ties broken by quality
# --fulref - Print the full refernce sequence name, including whitespace, in alignment output. 
# /share/data/umw_biocore/genome_data/human/hg19/ location of the EBWT files associated with the HG19
# /share/data/umw_biocore/genome_data/ - location of all genomes
# test.fa  - Our test fasta file (see below)
 
bowtie -S -f -p 8 --best --fullref /share/data/umw_biocore/genome_data/human/hg19 $HOME/test.fa >> bowtie_output.sam

test.fa file

test.fa - file contains some random len BP seqs from chr5,17,18, and chr20

>test1
taaacctcttttctttataaatt
>test2
tatctttgccaactag
>test3
ATTCTATGTGGATTGTTTTCCTT
>testchr20 
agaatagattttctaattctgtgaaaaatga
>test2chr20
gtcagcttagagagctaatttgactttctct

Submitting bowtie script with LSF

bsub < bowtie.sh

R example script

An example script to create a PNG image file:

$ module load R/3.0.1
$ R --slave --args test.txt < test.R 2&>1 

test.R - Script

$ R --slave --args test.txt < test.R 2&>1
Args=commandArgs();
FileName=Args[4]
data=read.table(FileName,header=TRUE,sep=",")

# PNG name
png(filename="test.png",width=500,height=500) 

# order file
dor=order(data[,2],decreasing=FALSE)

v1=paste("Test data from file: ",Args[4],sep=" ")
# create chart
dotchart(data[dor,2],labels=data[dor,1],main=v1)

# Create a title
title(xlab="Total", col.lab=rgb(0,0.5,0)) 

# turn back off and save
dev.off()

R test.txt

Sequence,Percent
AAAACCCC,22.25
CTAGCTAG,18.75
CTAGACTG,18.00
CTAGCCTG,17.00
ACTGACGT,8.75
CATGAGAG,7.25
CATCATAG,3.75
CCACACAC,1.25
CCCCCCCA,0.95
CCCCCACT,0.90
CCTAGGAT,0.90
TTTATTAT,0.25

Faster FASTQ alignment with jobs array

This example requires 2 scripts, one that executes bowtie on a FASTQ file and reference genome, and the other that splits a FASTQ file into small pieces and aligns them separately using the first script, then merges the BAMs together into a single BAM.

--Script1: bowtie-align.sh (processes single FASTQ file)--

 #!/bin/bash
 # Usage: This script can be executed locally in an interactive shell, or as a job, to process a single FASTQ file.
 # Notice the command line arguments, 1=reference genome, 2=reads file
 # Interactive node submission:
 #./bowtie-align.sh ref reads.fq
 # Compute node submission:
 # bsub -W 4:00 -q short -R "rusage[mem=4096]" -J "bowtie-job" -o ngs.out -e ngs.err ./bowtie-align.sh ref reads.fq
 START=`date +%s`                # record start time
 # Process command line arguments:
 refGenomeIndex=$1               # should be bowtie index prefix
 readfile=$2                     # should be single end filename
 # Load modules
 module load bowtie2/2-2.1.0
 module load samtools/1.2
 # Align these reads with Bowtie
 bowtie2 -p 8 -x $refGenomeIndex $readfile -S $readfile.sam
 if [ -f $readfile.sam ]; then
       # Convert SAM to BAM
       samtools view -b $readfile.sam -o $readfile.bam
       rm -f $readfile.sam
 else
       echo "SAM file not created"
       exit 127
 fi
 END=`date +%s`                  # record end time
 ELAPSED=$((END-START))          # calculate elapsed time for benchmarking purposes
 echo "Alignment took $ELAPSED seconds to complete"

--Script2: bowtie-align-jobs-array.sh (processes single FASTQ file by splitting into chunks)--

 #!/bin/bash
 # Usage:
 # bsub < ngs-hpc-workflow.sh
 #BSUB -J "SeqAlignJob"
 #BSUB -R rusage[mem=4096]
 #BSUB -q short
 #BSUB -W 4:00
 #BSUB -o ngs.out
 #BSUB -e ngs.err
 START=`date +%s`
 readfile=reads.fq
 prefix=$readfile.
 # Split the main reads FASTQ file into batches of 500k read files
 split -l 2000000 -a 1 -d $readfile $prefix
 NoOfBatches=`ls -l $readfile.* | wc -l`
 echo "$NoOfBatches batches of 500K reads created."
 if [ $NoOfBatches -gt "100" ]; then
       echo "Max number of batches (100) exceeded; program will now exit"
       exit 127
 fi
 echo "Submitting parallel jobs"
 # Create jobs array to align these reads with Bowtie
 bsub -q short -W 4:00 -R "rusage[mem=2048]" -J "bowtie-split[1-$NoOfBatches]" -o "ngs.align.log.\$LSB_JOBINDEX" -e "ngs.align.err.\$LSB_JOBINDEX" "./bowtie-align.sh ref $readfile.\$((\$LSB_JOBINDEX-1))"
 # Wait for all BAM files to come back
 wait
 # Load samtools
 module load samtools/1.2
 # Merge the BAMs
 samtools merge $readfile.bam $readfile.*.bam
 # Clean up
 rm -f $readfile.[0-9]
 rm -f $readfile.[0-9].bam
 END=`date +%s`
 ELAPSED=$(($END-$START))
 echo "Alignment took $ELAPSED seconds to complete"

Example NAMD MPI+GPU Job

namdgpu.sh

 #!/bin/bash

 #BSUB -n 4
 #BSUB -q gpu
 #BSUB -R "rusage[mem=1000]"
 #BSUB -W 0:30
 #BSUB -R "span[ptile=2]"
 #BSUB -R "rusage[ngpus_excl_p=1]"

 module load NAMD/2.9_Linux-x86_64-ibverbs-smp-CUDA
 module load openmpi/1.6.5

 charmrun ++mpiexec +p4 ++ppn 2 /share/pkg/NAMD/2.9_Linux-x86_64-ibverbs-smp-CUDA/namd2 +idlepoll /home/yourhome/some/file.namd

This script will allocate a total of 2 GPUs, 4000MB of ram, and 4 cpu cores. It will be split into two nodes. Each will have 1 GPU, 2 cores, and 2000MB of ram.

The charmrun parameters of +p4 ++ppn 2 tell NAMD to create 4 threads with 2 threads per process.

'/home/yourhome/some/file.namd' needs to be replaced with a valid namd file.

Submit with:

 bsub < namdgpu.sh