Wednesday, February 4, 2015

NGS: Coverage of sequencing



Abstract: Coverage of sequencing is one important issue of evaluating sequencing depth in the design of high-throughput sequencing experiments.


Coverage of sequencing (CS) or redundancy of coverage is the times on average that a base in the reference is covered by an aligned read determined by sequencing. The general equation is:
C = L * N / G
C: coverage of depth
L: Length of reads
N: Number of reads
G: Total length of haploid genome references

CS is an overall standard of how many raw reads we should get from a reliable sequencing experiment. There is no absolute recommendation for the level of CS. As well as CS, coverage of breadth (CB) is the percentage of genome bases beyond a given times. CS and CB combined can used for telling how many reads we should sequence.

A suitable CS is required for reliable sequencing because reads are not distributed evenly over an entire genome references. The calculation of CS only tell us the average coverage. Some bases would covered by less reads than the average. So higher CS would indicate more reliable sequencing. However, high CS inevitably requires high costs. Therefore, the best coverage depends on the applications of your sequencing experiments.

1. Previous publications: Low-quality genome sequencing requires 10x depth, and high-quality would be at 30x depth or higher. Resequencing for the detection of mutations/SNPs/rearrangements requires 10-30x depth.

2. RNA-seq: Saturation analysis for RNA-seq as evaluating sequencing depth is better than calculation of CS. Both the detection of RNAs and expression levels should be assured from RNA-seq. Therefore, determining rarely expressed genes require an increase in CS. Second, mRNA-seq, miRNA-seq, or exome sequencing usually utilize different methods of RNA enrichment, and come to different CS.

3. Length of reads: Length of reads depend the sequencing analyzer. Hiseq2000 is 100bp output, and Hiseq2500/3000/4000 increase to 150bp. Longer reads would be good for sequencing repetitive sequences, sequencing DNA variants or RNA splicing.




This Perl script allows to analyze SAM format files, and calculate exact number of sequencing times of each base along the chromosome, and finally export into a text file with three columns(chromosome position, sequenced bases, sequencing times). Here is the Perl code:
#! /usr/bin/perl -w
use strict;
use warnings;
use List::Util;
use DB_File;

my %hash;
#open DBM database
dbmopen (%hash, 'CSdb', 0666) or die;
my $sam_file="/home/yuan/data_1/ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00105/alignment/HG00105.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.sam";
open my($IN), "<", $sam_file or die;
while(my $line=<$IN>){
chomp($IN);
my @items=split("\t", $line);
my $chr=$items[2];
my $pos_start=$items[3];
my @seq=split("", $items[9]);
#only export coverage of sequencing of chromosome 1.
if($chr eq '1'){
print "$pos_start\n";
my $pos=$pos_start;
for my $base(@seq){
if (exists $hash{$base}) {
$hash{$pos} .= $base;
}
else{
$hash{$pos} = $base;
}
$pos++;
}
}
}
close($IN);

print "export %hash into cs file\n";
open my($CS), ">", '/home/yuan/mysql_pre/cookbooks/result_sequencing_coverage.txt' or die;
foreach my $pos(List::Util::min(keys %hash)..List::Util::max(keys %hash)){#1
if (exists $hash{$pos}){#2
print "export:$pos\n";
print $CS join("\t", $pos, $hash{$pos}, length($hash{$pos})), "\n";
}
else{
print $CS join("\t", $pos, 'N', 0), "\n";
}#2
}#1
close($CS);

#close DBM
dbmclose(%hash);

print "OK\n";

No comments:

Post a Comment