Abstract:
Coverage of sequencing is one important issue of evaluating
sequencing depth in the design of high-throughput sequencing
experiments.
Coverage of
sequencing (CS) or redundancy of coverage is the times on average
that a base in the reference is covered by an aligned read determined
by sequencing. The general equation is:
C = L * N / G
C: coverage of
depth
L: Length of
reads
N: Number of
reads
G: Total length
of haploid genome references
CS is an overall
standard of how many raw reads we should get from a reliable
sequencing experiment. There is no absolute recommendation for the
level of CS. As well as CS, coverage of breadth (CB) is the
percentage of genome bases beyond a given times. CS and CB combined
can used for telling how many reads we should sequence.
A suitable CS is
required for reliable sequencing because reads are not distributed
evenly over an entire genome references. The calculation of CS only
tell us the average coverage. Some bases would covered by less reads
than the average. So higher CS would indicate more reliable
sequencing. However, high CS inevitably requires high costs.
Therefore, the best coverage depends on the applications of your
sequencing experiments.
1. Previous
publications: Low-quality genome sequencing requires 10x depth,
and high-quality would be at 30x depth or higher. Resequencing for
the detection of mutations/SNPs/rearrangements requires 10-30x depth.
2. RNA-seq:
Saturation analysis for RNA-seq as evaluating sequencing depth is
better than calculation of CS. Both the detection of RNAs and
expression levels should be assured from RNA-seq. Therefore,
determining rarely expressed genes require an increase in CS. Second,
mRNA-seq, miRNA-seq, or exome sequencing usually utilize different
methods of RNA enrichment, and come to different CS.
3. Length of
reads: Length of reads depend the sequencing analyzer. Hiseq2000
is 100bp output, and Hiseq2500/3000/4000 increase to 150bp. Longer
reads would be good for sequencing repetitive sequences, sequencing
DNA variants or RNA splicing.
This Perl script
allows to analyze SAM format files, and calculate exact number of
sequencing times of each base along the chromosome, and finally
export into a text file with three columns(chromosome position,
sequenced bases, sequencing times). Here is the Perl code:
#! /usr/bin/perl -w
use strict;
use warnings;
use List::Util;
use DB_File;
my %hash;
#open DBM database
dbmopen (%hash, 'CSdb', 0666) or die;
my
$sam_file="/home/yuan/data_1/ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00105/alignment/HG00105.mapped.ILLUMINA.bwa.GBR.low_coverage.20130415.sam";
open my($IN), "<", $sam_file or die;
while(my $line=<$IN>){
chomp($IN);
my @items=split("\t", $line);
my $chr=$items[2];
my $pos_start=$items[3];
my @seq=split("", $items[9]);
#only export coverage of sequencing of chromosome 1.
if($chr eq '1'){
print "$pos_start\n";
my $pos=$pos_start;
for my $base(@seq){
if (exists $hash{$base}) {
$hash{$pos} .= $base;
}
else{
$hash{$pos} = $base;
}
$pos++;
}
}
}
close($IN);
print "export %hash into cs file\n";
open my($CS), ">",
'/home/yuan/mysql_pre/cookbooks/result_sequencing_coverage.txt' or
die;
foreach my $pos(List::Util::min(keys %hash)..List::Util::max(keys
%hash)){#1
if (exists $hash{$pos}){#2
print "export:$pos\n";
print $CS join("\t", $pos, $hash{$pos},
length($hash{$pos})), "\n";
}
else{
print $CS join("\t", $pos, 'N', 0), "\n";
}#2
}#1
close($CS);
#close DBM
dbmclose(%hash);
print "OK\n";
No comments:
Post a Comment