Tiezheng Yuan Ph.D.: Perl: Extract sequences from FASTA format files

Abstract: extract sequences from FASTA format file using the Bio-perl module.

One straightforward way is utilize previous Perl models published in CPAN (search.cpan.org), such as Bioperl. There are tons of usages in Bio-Perl. The Bio-perl manual and example scripts are available at www.bioperl.org.

FASTA format file is so popular in bioinformatics study. Here is an example of FASTA file, contains form at least one sequence record to million s of records in a file. One sequence record have two lines. One line begins with ‘>’ including sequence-related information (namely accession number), and the other line is nucleotide or amino acid sequences.

>gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37.p9 Primary Assembly
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCTGGCGCAGGCGCAGAGACGCAAGCCTACGGGCGGGGGTTGGGGGGGCGTGTGTTGCAGGA

The code listed here allow you to extract sequences from a FASTA file using Bio-perl.

#!/usr/bin/perl

use strict;

use warnings;

use Bio::SeqIO;

my $in_obj = Bio::SeqIO->new(-file => "/data/test.fasta", -format => "fasta");

while ( my $seq_obj=$in_obj->next_seq() ) {

my $display_id=$seq_obj->display_id();

my $description=$seq_obj->desc();

my $sequence=$seq_obj->seq();

print "$display_id\t", "$description\n";

}

The whole process would be：open test.fasta file firstly，and then extract sequences records one by one. For each record, we can get

display_id（“gi|224514618|ref|NT_077402.2”，GI and accession number )

description（“Homo sapiens chromosome 1 genomic contig, GRCh37.p9 Primary Assembly”）

DNA sequences

Writing data: 2013.01.24

Tiezheng Yuan Ph.D.

Friday, January 30, 2015

Perl: Extract sequences from FASTA format files

No comments:

Post a Comment