Friday, January 30, 2015

Perl: Extract sequences from FASTA format files



Abstract: extract sequences from FASTA format file using the Bio-perl module.


One straightforward way is utilize previous Perl models published in CPAN (search.cpan.org), such as Bioperl. There are tons of usages in Bio-Perl. The Bio-perl manual and example scripts are available at www.bioperl.org.

FASTA format file is so popular in bioinformatics study. Here is an example of FASTA file, contains form at least one sequence record to million s of records in a file. One sequence record have two lines. One line begins with ‘>’ including sequence-related information (namely accession number), and the other line is nucleotide or amino acid sequences.
>gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37.p9 Primary Assembly
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCTGGCGCAGGCGCAGAGACGCAAGCCTACGGGCGGGGGTTGGGGGGGCGTGTGTTGCAGGA

The code listed here allow you to extract sequences from a FASTA file using Bio-perl.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in_obj = Bio::SeqIO->new(-file => "/data/test.fasta", -format => "fasta");
while ( my $seq_obj=$in_obj->next_seq() ) {
my $display_id=$seq_obj->display_id();
my $description=$seq_obj->desc();
my $sequence=$seq_obj->seq();
print "$display_id\t", "$description\n";
}

The whole process would beopen test.fasta file firstlyand then extract sequences records one by one. For each record, we can get
display_id(“gi|224514618|ref|NT_077402.2”GI and accession number )
description(“Homo sapiens chromosome 1 genomic contig, GRCh37.p9 Primary Assembly”
DNA sequences

Writing data: 2013.01.24

No comments:

Post a Comment