Friday, August 7, 2015
Bioinformatics in Python: calculate Kmer
Abstract: k-mer counting of a DNA sequence
The term k-mer typically refers to all the possible substrings of length k that are contained in a string.
The result:
{'ATGTGT': 1, 'TATTGT': 1, 'TGTGTA': 1, 'GTCGTA': 1, 'CGTAGC': 1, 'TACGTA': 1, 'TGTACT': 1, 'ATGTCT': 1, 'ACCAGT': 1, 'GCGTGT': 1, 'TACCAG': 1, 'GCTGTA': 1, 'TGTACG': 1, 'TGTCGT': 1, 'TGTACC': 1, 'TCGTAT': 1, 'GTGTCG': 1, 'TGTAGC': 1, 'CTGTAG': 1, 'TCATGT': 2, 'TAGCGT': 1, 'TTGTAC': 1, 'CGTGTC': 1, 'AGTCAT': 2, 'CATGTC': 1, 'GTGTAC': 1, 'CATGTG': 1, 'ATTGTA': 1, 'CGTATT': 1, 'AGCGTG': 1, 'GTAGCG': 1, 'CTGTAC': 1, 'GTCATG': 2, 'TAGCTG': 1, 'ACGTAG': 1, 'GTACGT': 1, 'CCAGTC': 1, 'CAGTCA': 1, 'GTACTG': 1, 'GTAGCT': 1, 'GTACCA': 1, 'AGCTGT': 1, 'GTCTGT': 1, 'GTATTG': 1, 'TCTGTA': 1, 'TGTCTG': 1}
The script:
# -*- coding: utf-8 -*-
"""
Created on Fri Aug 7 22:22:32 2015
@author: yuan
"""
def kmer(seq, kmer_size=4):
kmer_counting={}
for start in range(len(seq)-(kmer_size-1)):
#print start
kmer = seq[start:start+kmer_size]
if kmer_counting.has_key(kmer):
kmer_counting[kmer] +=1
else:
kmer_counting[kmer] =1
#print(kmer)
return kmer_counting
if __name__=="__main__":
DNA_seq = "AGTCATGTCTGTAGCTGTACGTAGCGTGTCGTATTGTACCAGTCATGTGTACTG"
print DNA_seq
kmer=kmer(DNA_seq,6)
print kmer
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment