Tiezheng Yuan Ph.D.: NGS: Get sequences from raw data

Abstract: Sequencing raw data in FASTQ format is a crucial starting point of NGS data analysis. Here, I show how to extract read sequences from FASTQ files.

The output of high-throughput sequencing instruments namely Illumina Genome Analyzer is stored into the file in FASTQ format. FASTQ format is originally developed at the Welcome Trust Sanger Institute. FASTQ file is a text-based file storing both nucleotide sequences and corresponding quality scores with ASCII characters, where the sequence letter and its quality score are matched one on one.

The FASTQ file is usually named with the file tail ‘.fq’ or ‘.fastq’. You can open the file directly using vi/vim editor in Linux system. It might be hard to open it in Window system because they are usually very big up to many Gbs in file size.

[yuan@localhost ~]$ gunzip test_R1_001.fq.gz #decompress the file

[yuan@localhost ~]$ vi test_R1_001.fq #open the file using vi editor

Therefore, we would see the contents of the file like this:

@R0230412_0111:5:1101:1552:1997#GAGTGG/1

NCTTCATGTGATTGAGGATTTCATTGACCAGGACACCCCCTTGGAGCCTA

+R0230412_0111:5:1101:1552:1997#GAGTGG/1

BP\ceeeegggggiiiiiiiiiiiihiiiiiiiiiiiiiiiiiiiiihhi

@R0230412_0111:5:1101:3216:1997#GAGTGG/1

NTGGAGCCAAGTGCTAACATGCCTTGGTTCAAGGGATGGAAAGTCACCCG

+R0230412_0111:5:1101:3216:1997#GAGTGG/1

BP\cceeefggegfghhhfgeghhhhhebgdghhhhgfhdhhhffgfg_f

@R0230412_0111:5:1101:4059:1994#GAGTGG/1

NAGTGGTTCTTCAATTGCTATTTGCATTTCTTTAAAGTACCCCAGTCCTT

+R0230412_0111:5:1101:4059:1994#GAGTGG/1

BP\cceccfggggiiiiiiiiiiiiiiiiiiiihiiihhiiiiiigghii

A FASTQ file contains tons of sequence records, which have 4 lines. The above example show first three sequence records in the file. For a sequence record:

Line 1: begins with @, for example,

@R0230412_0111:5:1101:1552:1997#GAGTGG/1

R0230412_0111	The unique instrument name
5	Flowcell lane
1101	Tile number within the flowcell lane No.5.
1552:1997	The coordinates (x,y) of the cluster within the tile No.1101. They are unique for this read across this lane, but are identical for R1 and R2 reads under paired-end sequencing.
#GAGTGG	barcode for a multiplexed sample
/1	The member of a pair, which is consistent with ‘R1’ in the file name, and /2 is mate-pair read known as ‘R2’ under paired-end sequencing.

Line 2: raw read sequences, which length depends on the circles sequencing. The usual circles for Hiseq2000 is 50-base or 100-base length.

Line 3: begin with +.

Line 4: quality scores with ASCII characters, which must contain the same number of symbols as the letters in the sequencing.

Our first job is to extract sequences, or quality scores sometimes are also required. I showed how to extract read sequences and their quality scores from the FASTQ file using the perl-bioperl module. Here is the Perl script:

#! /usr/bin/perl -w

use strict;

use warnings;

use Bio::SeqIO;

use Bio::Seq::Quality;

#read R1 raw data

my $fastq_file='/home/yuan/data_1/test_R1_001.fastq';

my $in_obj = Bio::SeqIO->new(-file => $fastq_file, -format => 'fastq');

while (my $seq_obj = $in_obj->next_seq() ) {

my @display_id=split(/:/, $seq_obj->display_id());

my $coordinate=$display_id[-2]."_".$display_id[-1]; #the coordinate

my $seq=$seq_obj->seq(); # the sequences

my $quality_pointer=$seq_obj->Bio::Seq::Quality::qual;

my @quality=@$quality_pointer; # the quality scores

print "$coordinate: $seq\n @quality\n";

}

print "Done! OK\n";

References

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71.

Writing: 2015.01.28

Tiezheng Yuan Ph.D.

Thursday, January 29, 2015

NGS: Get sequences from raw data

References

No comments:

Post a Comment