Thursday, January 29, 2015

NGS: Get sequences from raw data



Abstract: Sequencing raw data in FASTQ format is a crucial starting point of NGS data analysis. Here, I show how to extract read sequences from FASTQ files.


The output of high-throughput sequencing instruments namely Illumina Genome Analyzer is stored into the file in FASTQ format. FASTQ format is originally developed at the Welcome Trust Sanger Institute. FASTQ file is a text-based file storing both nucleotide sequences and corresponding quality scores with ASCII characters, where the sequence letter and its quality score are matched one on one.

The FASTQ file is usually named with the file tail ‘.fq’ or ‘.fastq’. You can open the file directly using vi/vim editor in Linux system. It might be hard to open it in Window system because they are usually very big up to many Gbs in file size.
[yuan@localhost ~]$ gunzip test_R1_001.fq.gz #decompress the file
[yuan@localhost ~]$ vi test_R1_001.fq #open the file using vi editor

Therefore, we would see the contents of the file like this:
@R0230412_0111:5:1101:1552:1997#GAGTGG/1
NCTTCATGTGATTGAGGATTTCATTGACCAGGACACCCCCTTGGAGCCTA
+R0230412_0111:5:1101:1552:1997#GAGTGG/1
BP\ceeeegggggiiiiiiiiiiiihiiiiiiiiiiiiiiiiiiiiihhi
@R0230412_0111:5:1101:3216:1997#GAGTGG/1
NTGGAGCCAAGTGCTAACATGCCTTGGTTCAAGGGATGGAAAGTCACCCG
+R0230412_0111:5:1101:3216:1997#GAGTGG/1
BP\cceeefggegfghhhfgeghhhhhebgdghhhhgfhdhhhffgfg_f
@R0230412_0111:5:1101:4059:1994#GAGTGG/1
NAGTGGTTCTTCAATTGCTATTTGCATTTCTTTAAAGTACCCCAGTCCTT
+R0230412_0111:5:1101:4059:1994#GAGTGG/1
BP\cceccfggggiiiiiiiiiiiiiiiiiiiihiiihhiiiiiigghii
A FASTQ file contains tons of sequence records, which have 4 lines. The above example show first three sequence records in the file. For a sequence record:
Line 1: begins with @, for example,
@R0230412_0111:5:1101:1552:1997#GAGTGG/1
R0230412_0111
The unique instrument name
5
Flowcell lane
1101
Tile number within the flowcell lane No.5.
1552:1997
The coordinates (x,y) of the cluster within the tile No.1101. They are unique for this read across this lane, but are identical for R1 and R2 reads under paired-end sequencing.
#GAGTGG
barcode for a multiplexed sample
/1
The member of a pair, which is consistent with ‘R1’ in the file name, and /2 is mate-pair read known as ‘R2’ under paired-end sequencing.
Line 2: raw read sequences, which length depends on the circles sequencing. The usual circles for Hiseq2000 is 50-base or 100-base length.
Line 3: begin with +.
Line 4: quality scores with ASCII characters, which must contain the same number of symbols as the letters in the sequencing.

Our first job is to extract sequences, or quality scores sometimes are also required. I showed how to extract read sequences and their quality scores from the FASTQ file using the perl-bioperl module. Here is the Perl script:
#! /usr/bin/perl -w
use strict;
use warnings;
use Bio::SeqIO;
use Bio::Seq::Quality;

#read R1 raw data
my $fastq_file='/home/yuan/data_1/test_R1_001.fastq';
my $in_obj = Bio::SeqIO->new(-file => $fastq_file, -format => 'fastq');
while (my $seq_obj = $in_obj->next_seq() ) {
my @display_id=split(/:/, $seq_obj->display_id());
my $coordinate=$display_id[-2]."_".$display_id[-1]; #the coordinate
my $seq=$seq_obj->seq(); # the sequences
my $quality_pointer=$seq_obj->Bio::Seq::Quality::qual;
my @quality=@$quality_pointer; # the quality scores
print "$coordinate: $seq\n @quality\n";
}
print "Done! OK\n";

References

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71.

Writing: 2015.01.28

No comments:

Post a Comment