Sunday, August 16, 2015

Bioinformatics in Python: filter DNA sequence

Abstract: filter DNA sequences from the sequence from the GenBank format

The DNA sequence would be like this in GenBank file:
1 ttaagatttg cgctttgcca actgtacacc caacctcggt
41 ttattgtcga acctcccgct tgtgccgcca tctgcatata
81 gatcccggtc agtccgtcac attctgccaa ttgagtatcc
..........................
The output should be like this:
TTAAGATTTGCGCTTTGCCAACTGTACACCCAACCTCGGTTTATTGTCGAACCTCCCGCTTGTGCCGCCATCTGCATATAGATCCCGGTCAGTCCGTCACATTCTGCCAATTGAGTATCCTCGAAGTCTTATTCCACGTGCTCAAAGCAAGGGTATCGTACAGTGATAACCGCCTCGTGCAGATCCAAATTCTCGATTAACACTCAAGTACTGA
...........................

The script:
# -*- coding: utf-8 -*-
"""
Created on Sun Aug 16 20:10:12 2015

@author: yuan
"""

import re

#filter DNA sequence
def filter_DNA(in_file):
   
    #read txt file
    in_obj=open(in_file, 'r')
    DNA_seq=[ line.rstrip("\n") for line in in_obj]
    DNA_seq=''.join(DNA_seq)
   
    #remove numeric character and white space
    DNA_seq=re.sub("[0-9]|\s","", DNA_seq)
    #replace not A/T/C/G with N
    DNA_seq=re.sub("[A|T|G|C]","N", DNA_seq)
    DNA_seq=DNA_seq.upper()
   
    return DNA_seq
       
if __name__=="__main__":
    in_file='/home/yuan/mysql_pre/cookbooks/mixed_DNA.txt'
    DNA_seq=filter_DNA(in_file)
    print DNA_seq
   
    print 'ok'

No comments:

Post a Comment