Abstract: Sequencing raw data in FASTQ format is a crucial
starting point of NGS data analysis. Here, I show how to extract read
sequences from FASTQ files.
The output of high-throughput sequencing instruments namely Illumina
Genome Analyzer is stored into the file in FASTQ format. FASTQ format
is originally developed at the Welcome Trust Sanger Institute. FASTQ
file is a text-based file storing both nucleotide sequences and
corresponding quality scores with ASCII characters, where the
sequence letter and its quality score are matched one on one.
The FASTQ file is usually named with the file tail ‘.fq’ or
‘.fastq’. You can open the file directly using vi/vim editor in
Linux system. It might be hard to open it in Window system because
they are usually very big up to many Gbs in file size.
[yuan@localhost
~]$ gunzip test_R1_001.fq.gz
#decompress the file
[yuan@localhost
~]$ vi test_R1_001.fq
#open the file using vi editor
Therefore, we would see the contents of the file like this:
@R0230412_0111:5:1101:1552:1997#GAGTGG/1
NCTTCATGTGATTGAGGATTTCATTGACCAGGACACCCCCTTGGAGCCTA
+R0230412_0111:5:1101:1552:1997#GAGTGG/1
BP\ceeeegggggiiiiiiiiiiiihiiiiiiiiiiiiiiiiiiiiihhi
@R0230412_0111:5:1101:3216:1997#GAGTGG/1
NTGGAGCCAAGTGCTAACATGCCTTGGTTCAAGGGATGGAAAGTCACCCG
+R0230412_0111:5:1101:3216:1997#GAGTGG/1
BP\cceeefggegfghhhfgeghhhhhebgdghhhhgfhdhhhffgfg_f
@R0230412_0111:5:1101:4059:1994#GAGTGG/1
NAGTGGTTCTTCAATTGCTATTTGCATTTCTTTAAAGTACCCCAGTCCTT
+R0230412_0111:5:1101:4059:1994#GAGTGG/1
BP\cceccfggggiiiiiiiiiiiiiiiiiiiihiiihhiiiiiigghii
A FASTQ file contains tons of sequence records, which have 4 lines.
The above example show first three sequence records in the file. For
a sequence record:
Line 1: begins with @, for example,
@R0230412_0111:5:1101:1552:1997#GAGTGG/1
- R0230412_0111The unique instrument name5Flowcell lane1101Tile number within the flowcell lane No.5.1552:1997The coordinates (x,y) of the cluster within the tile No.1101. They are unique for this read across this lane, but are identical for R1 and R2 reads under paired-end sequencing.#GAGTGGbarcode for a multiplexed sample/1The member of a pair, which is consistent with ‘R1’ in the file name, and /2 is mate-pair read known as ‘R2’ under paired-end sequencing.
Line 2: raw read sequences, which length depends on the circles
sequencing. The usual circles for Hiseq2000 is 50-base or 100-base
length.
Line 3: begin with +.
Line 4: quality scores with ASCII characters, which must contain the
same number of symbols as the letters in the sequencing.
Our first job is to extract sequences, or quality scores sometimes
are also required. I showed how to extract read sequences and their
quality scores from the FASTQ file using the perl-bioperl module.
Here is the Perl script:
#!
/usr/bin/perl -w
use strict;
use
warnings;
use
Bio::SeqIO;
use
Bio::Seq::Quality;
#read R1
raw data
my
$fastq_file='/home/yuan/data_1/test_R1_001.fastq';
my $in_obj
= Bio::SeqIO->new(-file => $fastq_file, -format => 'fastq');
while (my
$seq_obj = $in_obj->next_seq() ) {
my
@display_id=split(/:/, $seq_obj->display_id());
my
$coordinate=$display_id[-2]."_".$display_id[-1];
#the coordinate
my
$seq=$seq_obj->seq();
# the sequences
my
$quality_pointer=$seq_obj->Bio::Seq::Quality::qual;
my
@quality=@$quality_pointer;
# the quality scores
print
"$coordinate: $seq\n @quality\n";
}
print "Done! OK\n";
References
The Sanger FASTQ file format
for sequences with quality scores, and the Solexa/Illumina FASTQ
variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71.
Writing: 2015.01.28
No comments:
Post a Comment