Tiezheng Yuan Ph.D.: NGS: Saturation analysis for RNA-seq

Abstract: How is the expression profiling from RNA-seq deep enough? Or How many multiplexing samples can be in one lane?

How is the expression profiling from RNA-seq deep enough? Or How many multiplexing samples can be in one lane? These questions for RNA-seq often arise.

Sequencing depth of RNA-seq has serious effects on both the detection of RNAs and the reliability of expression levels in expression profiling, especially rarely expressed genes. Sequencing more would improve sequencing depth. However, it is unpractical for us to sequence as much as possible. It is the balance between the optimal sequencing depth and the lowest costs. There is no fixed levels on sequencing depther of RNA-seq. One thing is what RNA you want? There are many RNA species namely mRNAs or ncRNAs in RNA extractions, and their abundance might vary a lot. For mRNA only, the expression levels of genes vary a lot. Another thing is the objective of your study. The depth is so different on overall profiling and exact quantitations of as many as transcripts. The third one is the source of RNAs. The RNA distributions from tissue or blood would be so different.

Saturation curve of RNA-seq (number of raw pairs vs . Number of target references) can help the optimal sequencing depth. As the saturation curve from a mRNA-seq experiment (Figure 1), the mapped references quickly increased with the increase with raw pairs at first level, and increased slower, and finally level off to be asymptotic to a upper limit on the number of reference. Such ‘Saturation’ revealed that the optimal number of raw pairs determined by RNA-seq for the best detection of RNAs.

Figure 1: Saturation analysis of a mRNA-seq experiment

The next question is how many samples can be multiplexed into a lane once the optimal number of raw pairs is certified. HiSeq2000 can produce up to 1.5 billion pairs at the most level, but less raw pairs would be produced in some cases. Generally for human samples, 8-10 samples for mRNA sequencing in a lane would be fine, or 4 samples for exon sequencing, or 30-40 samples for miRNA-seq. However, saturation analysis is crucial for the best multiplexing design.

Here, I present a Perl script for analyzing SAM file (Sequence records should be ordered by query names) for saturation analysis of mRNA-seq.

#! /usr/bin/perl -w

use strict;

use warnings;

use List::Util;

use DB_File;

#subroutines

sub read_gff{

my $gff_file=$_[0];

print "read genome annotations\n";

my %annot_info;

open my($IN), "<", $gff_file or die;

while(<$IN>){

chomp($_);

my @items=split("\t", $_);

my $RNA_chr=$items[0];

my $RNA_specie=$items[2];

my $RNA_start=$items[3];

my $RNA_end=$items[4];

my $str_len=int(length($RNA_start)/1e4);

my %hash;

my @annot=split(/;\s/, $items[8]);

foreach my $ann(@annot){

my($type, $id)=split(/\s/, $ann);

$hash{$type}=$id;

}

if($RNA_specie eq 'gene' ){

my $key=$hash{gene_id};

$annot_info{$RNA_chr}->{$str_len}->{$key}->{start}=$RNA_start;

$annot_info{$RNA_chr}->{$str_len}->{$key}->{end}=$RNA_end;

$annot_info{$RNA_chr}->{$str_len-1}->{$key}->{start}=$RNA_start;

$annot_info{$RNA_chr}->{$str_len-1}->{$key}->{end}=$RNA_end;

$annot_info{$RNA_chr}->{$str_len+1}->{$key}->{start}=$RNA_start;

$annot_info{$RNA_chr}->{$str_len+1}->{$key}->{end}=$RNA_end;

}

close($IN);

return(\%annot_info);

}

#subroutines

sub RNA_saturation{

my ($sam_file, $annot_info_pointer)=@_;

my %annot_info=%$annot_info_pointer;

my $n=1;

my %saturation;

my $query="";

print "read sam file\n";

open my($SAM), "<", $sam_file or die;

while(<$SAM>){#2

chomp($_);

my @items=split("\t", $_);

#get alignment record

my %sam_info;

$sam_info{query}=$items[0];

$sam_info{mate_chr}=$items[6];

$sam_info{flagment_size}=$items[8];

for (my $i=11; $i<@items; $i++){

my($head, $middle, $value)=split(':', $items[$i]);

$sam_info{$head}=$value;

}

#matches between chr-offset and genome annotation

if($sam_info{mate_chr} eq '=' and $sam_info{NH}==1 and $sam_info{flagment_size}>0){#3

$sam_info{chr}=$items[2];

$sam_info{offset}=$items[3];

$sam_info{end}=$sam_info{offset}+$sam_info{flagment_size};

$sam_info{str_len}=int(length($sam_info{offset})/1e4);

my $found=0;

my $annot_pointer=$annot_info{$sam_info{chr}}->{$sam_info{str_len}};

my %annot=%$annot_pointer;

foreach my $id(keys %annot){#5

if($sam_info{offset}>=$annot{$id}->{start}){#6

if( $sam_info{end}<=$annot{$id}->{end} or $sam_info{offset}<=($annot{$id}->{end}-10) ){#7

$found=1;

}#7

}#6

else{#6

if ($sam_info{end}>=($annot{$id}->{start}+10) ){#7

$found=1;

}#7

}#6

if($found==1){

print "$n:$id\n";

$saturation{$id}=$n unless exists $saturation{$id};

last;

}

}#5

}#3

#renew number of raw data

if($query cmp $sam_info{query}){#3

$query=$sam_info{query};

$n++;

#last if $n==1e4;

}#3

}#2

close($SAM);

print "export to the txt file\n";

my $file='/home/yuan/mysql_pre/cookbooks/result_RNA_saturation_gene.txt';

open my($OUT), ">", $file or die;

my $ref_num=0;

foreach my $ref(sort {$saturation{$a}<=> $saturation{$b}} (keys %saturation)){

my $raw_num=$saturation{$ref};

$ref_num++;

print $OUT "$raw_num\t", "$ref_num\n";

}

close($OUT);

}

###############################

#main programm

#############################

#genome annotation

my $gff_file="/home/yuan/eRNA/ref_seq/human_Ensembl_GRCh38.gtf";

my $annot_info_pointer=read_gff($gff_file);

my %annot_info=%$annot_info_pointer;

#RNA satuation analysis

my $sam_file="/home/yuan/data_2/test_accepted_hits.sorted.sam";

RNA_saturation($sam_file, \%annot_info);

print "OK\n";

Writing date: 2014.08.10, 2015.02.05

Tiezheng Yuan Ph.D.

Wednesday, February 11, 2015

NGS: Saturation analysis for RNA-seq

1 comment: