Friday, February 6, 2015

Software: 3C-analyzer--A New Computational Method for Discovery of Long-range Chromatin Interactions from 3C-seq



Abstract: 3C-analyzer, fulfilled all analytic workflows from raw sequencing data to significance detection of chromatin interactions, and provided a user friendly interface for data management and analysis.



Chromosome conformation capture (3C) technology has been widely used to map physical proximity between two genomic regions in the nucleus. Initially reported by Dekker et al. in 2002 , the 3C procedure is involved in restriction enzyme digestion, inter-/intra-molecular ligation of cross-linked chromatin and quantification of ligation frequencies between two genomic loci by quantitative polymerase chain reaction (qPCR). The ligation frequency or cross-link frequency reveals DNA contact possibility between non-neighbouring genomic regions and gives insight into chromosome topology. Characterization of cross-link frequency in conventional 3C technology requires prior knowledge of the interacting partners between two genomic regions (one vs. one). Only those interactions between two pre-selected genomic loci can be tested for interactions due to low throughput nature of qPCR. To overcome 3C-qPCR limitations, various 3C-derived technologies have been developed to explore unknown interactions across whole genome including 4C (chromosome conformation capture-on-chip and circular chromosome conformation capture ), 5C (chromosome conformation capture carbon copy) , Hi-C , Capture-C , and 3C-MTS (3C-based multiple target sequencing) , T2C (Targeted Chromatin Capture) , and Capture Hi-C . For the ability on the detection of cross-linking ligation events in an experiment, Hi-C is able to detect all chromatin interactions theoretically (all vs. all), but inadequate sequencing depth of Hi-C often results in loss of resolution or coverage due to the huge chromatin interactome. 4C-seq has demonstrated an excellent resolution on the genome-wide interactions, but only one specific locus can be screened in a single experiment (one vs. all). 3C-MTS, Capture-C, T2C or Capture Hi-C are recently developed to detect chromatin interactions of many genomic loci with other regions through the whole genome (many vs. all). To date, no 3C-based technology exceeds others on all aspects of the detection of chromatin interactions.

To facilitate 3C-seq data analysis, we developed the graphic user interface (GUI)-based 3C-analyzer. The user manual was packaged into the published 3C-analyzer package. Compared with the previous software packages, 3C-analyzer provided some unique features including user-friendly experience and the ability of high-throughput processing.

User-friendly experience
The 3C-analyzer is able to process the 3C-seq data in a user-friendly environment. This package integrates all the pipelines required for 3C-MTS/Capture-C and 4C-seq data analysis, and includes the full workflow required for 3C-seq data analysis including raw data processing, genome mapping, co-localization detection, and significance analysis. All the analytic work can be operated through graphic user interfaces (GUIs) in 3C-analyzer. The GUIs-based pipeline is divided into three modules: 'Pre-processing', 'One-to-All', 'Many-to-All' (Figure1). 


Figure 1: GUI of 3C-analyzer

After 'Pre-processing', users can follow different modules depending on 3C-seq technologies used. The module 'One-to-All' and 'Many-to-All' were used for 4C-seq and 3C-MTS/Capture-C, respectively. Figure 5 showed the layout of the sub-modules 'Lock Viewpoints', 'Trim FASTQ', 'Detect Co-localization', and 'Count Co-localization' in the 'Many-to-All' module. Following the pipeline by arrows, users can finish all operations required for 3C-MTS data analysis. Another feature of 3C-analyzer is that only a few hand-on steps are required. 3C-analyzer can automatically recognize raw data and establish the connections between the FASTQ format files and 3C libraries in the sub-module 'Sample management' (Figure 2A). All options related to locations of viewpoints and detection of co-localization were set in the sub-module 'Lock Viewpoints' (Figure 2B). 3C-analyzer provided multiple data outputs including text files of RC and Tscore in CSV format, and R data files for all statistical results in RData format.
Figure 2: GUIs of sample management and probe design

High-throughput processing
3C-seq data analysis is usually computation-intensive but often varied dependent on complexity of 3C-seq libraries such as the numbers of 3C libraries, GWs and viewpoints as well as technologies used and the scale of reference genome (cis-/trans-interactions). 3C-analyzer is able to perform genome scale 3C-seq analysis with high-throughput ability. There are no limits on the size of raw data or number of FASTQ files in the pipelines as long as within the capacity of hardware in the users' computer. Our testing showed that preparation for the entire workflow can be easily done even with many sequencing datasets before triggering 3C-seq data analysis. To expedite computation speed, we applied multi-threads technology in 3C-analyzer. During parallel processing, 3C-analyzer first splits raw data into multiple partitions, and then applied multi-threads to compute these partitions simultaneously. The analytic process of the 3C-MTS data showed that the computational time was shorten five times when comparing 16 CPU cores (4 CPUs) to one CUP core. Under the 16 CPU cores, it took ~ 8 minutes per million read pairs per one hundred viewpoints on average . However, users are not encouraged to unlimitedly increase the number of multi-threads in one computer because the I/O ability of hardware restricted the analytic speed beyond 4 cores per CPU at full system load in our computational environment.

Figure3: Multi-threading in 3C-analyzer

Writing date: 2014.11.20, 2015.02.06




 

No comments:

Post a Comment