span

Introduction: SPAN Semi-supervised Peak Analyzer

Tags:

      __.-'|'-.__.-'|'-.__
    ='=====|========|====='=
    ~_^~-^~~_~^-^~-~~^_~^~^~^

SPAN Peak Analyzer version 2 is a universal HMM-based peak caller capable of processing a broad range of ChIP-seq, ATAC-seq, and single-cell ATAC-seq datasets of different quality.

NOTICE

Future development of SPAN Peak Analyzer version 2 continues under the OmniPeak peak caller.

Features

Supports both narrow and broad footprint experiments (ChIP-seq, ATAC-seq, DNAse-seq)
Produces robust results on datasets of different signal-to-noise ratio, including Ultra-Low-Input ChIP-seq
Produces highly consistent results in multiple-replicates experiment setup
Tolerates missing control experiment
Integrated into the JetBrains Research ChIP-seq analysis pipeline from raw reads to visualization and peak calling
Integrated with the JBR Genome Browser, uploaded data model allows for interactive visualization and fine-tuning
Experimentally supports multi-replicated mode and differential peak calling mode
In semi-supervised mode it is capable to robustly handle multiple replicates and noise by leveraging limited manual annotation information.

Latest release

See releases section for actual information.

SPAN 2.0 enhancements compared to version 1.0

SPAN version 2.0 introduces several key improvements over the original semi-supervised SPAN 1.0, most notably eliminating the need for manual markup annotations. It now operates in a fully unsupervised mode with robust default parameters.
Key changes include:

Automated Setup: SPAN 2.0 no longer requires semi-supervised markup to function. It runs directly with improved default settings.
Enhanced Preprocessing: The data preprocessing pipeline has been redesigned, featuring better control regression and smarter initialization of HMM parameters.
Constraint-Driven Model Fitting: The HMM now includes adaptive constraints for noise floor and signal-to-noise ratio, enhancing robustness across datasets with variable quality.
New Peak Detection Framework: Peak identification now leverages post-model analysis and a unified strategy for extracting peaks from HMM output.
Improved Replicates-model: These enhancements significantly boost performance in replicate-based analyses.
Expanded Applicability: SPAN 2.0 is more effective for diverse data types, including ATAC-seq, CUT&RUN, and CUT&Tag, and supports explicit input format declaration, BigWig signal visualization, and summit calling for fine resolution.

The original SPAN 1.0, which required semi-supervised input, is described in:
Shpynov O, Dievskii A, Chernyatchik R, Tsurinov P, Artyomov MN. Semi-supervised peak calling with SPAN and JBR Genome Browser. Bioinformatics. 2021 May 21. https://doi.org/10.1093/bioinformatics/btab376

Requirements

Download and install Java 8+.

Peak calling

To analyze a single (possibly replicated) biological condition use analyze command. See details with command:

$ java -jar span.jar analyze --help

The <output.bed> file will contain predicted and FDR-controlled peaks in the ENCODE broadPeak (BED 6+3) format:

<chromosome> <peak start offset> <peak end offset> <peak_name> <score> . <coverage or fold/change> <-log p-value> <-log Q-value>

Examples:

Regular peak calling
java -Xmx8G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -p Results.peak
Semi-supervised peak calling
java -Xmx8G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -l Labels.bed -p Results.peak
Model fitting only
java -Xmx8G -jar span.jar analyze -t ChIP.bam -c Control.bam --cs Chrom.sizes -m Model.span

Differential peak calling

Experimental! To compare two (possibly replicated) biological conditions use the compare. See help for details:

$ java -jar span.jar compare --help

Command line options

Parameter	Description
`-t, --treatment TREATMENT` required	Treatment file. Supported formats: BAM, BED, or BED.gz file. If multiple files are provided, they are treated as replicates. Multiple files should be separated by commas: `-t A,B,C`. Multiple files are processed as replicates on the model level.
`-c, --control CONTROL`	Control file. Multiple files should be separated by commas. A single control file, or a separate file per each treatment file is required. Follow the instructions for `-t`, `--treatment`.
`-cs, --chrom.sizes CHROMOSOMES_SIZES` required	Chromosome sizes file for the genome build used in TREATMENT and CONTROL files. Can be downloaded at UCSC.
`-b, --bin BIN_SIZE`	Peak analysis is performed on read coverage tiled into consequent bins of configurable size.
`-f, --fdr FDR`	False Discovery Rate cutoff to call significant regions.
`-p, --peaks PEAKS`	Resulting peaks file in ENCODE broadPeak* (BED 6+3) format. If omitted, only the model fitting step is performed.
`-chr, --chromosomes CHROMOSOMES_LIST`	Chromosomes to process, multiple chromosomes should be separated by commas.
`--format FORMAT`	Reads file format. Supported: BAM, SAM, CRAM, BED. Text format can be in zip or gzip archive. If not provided, guessed from file extensions.
`--fragment FRAGMENT`	Fragment size. If provided, reads are shifted appropriately. If not provided, the shift is estimated from the data. `--fragment 0` is recommended for ATAC-Seq data processing.
`-kd, --keep-duplicates`	Keep duplicates. By default, SPAN filters out redundant reads aligned at the same genomic position. Recommended for bulk single cell ATAC-Seq data processing.
`--blacklist BLACKLIST_BED`	Blacklisted regions of the genome to be excluded from peak calling results.
`--labels LABELS`	Labels BED file. Used in semi-supervised peak calling.
`-m, --model MODEL`	This option is used to specify SPAN model path. Required for further semi-supervised peak calling.
`-w, --workdir PATH`	Path to the working directory. Used to save coverage and model cache.
`--bigwig`	Create beta-control corrected counts per million normalized track.
`--hmm-snr SNR`	Fraction of coverage to estimate and guard signal to noise ratio, `0` to disable constraint check.
`--hmm-low LOW`	Minimal low state mean threshold, guards against too broad peaks, `0` to disable constraint check.
`--sensitivity SENSITIVITY`	Configures log PEP threshold sensitivity for candidates selection. Automatically estimated from the data, or during semi-supervised peak calling.
`--gap GAP`	Configures minimal gap between peaks. Generally, not required, but used in semi-supervised peak calling.
`--summits`	Calls summits within peaks. Recommended for ATAC-seq and single-cell ATAC-seq analysis.
`--f-light LIGHT`	Lightest fragmentation threshold to apply compensation gap. Not available when `gap` is explicitly provided.
`--f-hard HARD`	Hardest fragmentation threshold to apply compensation gap. Not available when `gap` is explicitly provided.
`--f-speed SPEED`	Fragmentation acceleration threshold to compute gap. Not available when `gap` is explicitly provided.
`--clip CLIP_TRESHOLD`	Clip max threshold for fine-tune boundaries according to local signal, `0` to disable.
`--multiple TEST`	Method applied for multiple hypothesis testing. `BH` for Benjamini-Hochberg, `BF` for Bonferroni.
`-i, --iterations`	Maximum number of iterations for Expectation Maximisation (EM) algorithm.
`--tr, --threshold`	Convergence threshold for EM algorithm, use `--debug` option to see detailed info.
`--ext`	Save extended states information to model file. Required for model visualization in JBR Genome Browser.
`--deep-analysis`	Perform additional track analysis - coverage (roughness) and creates multi-sensitivity bed track.
`--threads THREADS`	Configure the parallelism level.
`-l, --log LOG`	Path to log file, if not provided, it will be created in working directory.
`-d, --debug`	Print debug information, useful for troubleshooting.
`-q, --quiet`	Turn off standard output.
`-kc, --keep-cache`	Keep cache files. By default SPAN creates cache files in working directory and cleans up.

Build from sources

Clone bioinf-commons library under the project root.

git clone git@github.com:JetBrains-Research/bioinf-commons.git

Launch the following command line to build SPAN jar:

./gradlew shadowJar

The SPAN jar file will be generated in the folder build/libs.

FAQ

Q: What is the average running time?
A: SPAN is capable of processing a single ChIP-Seq track in less than 10 minutes on an average laptop.
Q: Which operating systems are supported?
A: SPAN is developed in modern Kotlin programming language and can be executed on any platform supported by Java.
Q: Where did you get this lovely span picture?
A: From ascii.co.uk, the original author goes by the name jgs.

Errors Reporting

Use GitHub issues to suggest new features or report bugs.

Authors

JetBrains Research BioLabs

Apps

Android Developer Tools

Android Developer Tools Pro

About Me

Tools: TimeShining

GitHub: Trinea

Facebook: Dev Tools

JSON Format, Support error correction

MD5/SHA Encode, Support batch

Text Process

CSS Format and Compress