Friday, August 7, 2015

Bioinformatics in Python: calculate Kmer


Abstract: k-mer counting of a DNA sequence

The term k-mer typically refers to all the possible substrings of length k that are contained in a string.

The result:
{'ATGTGT': 1, 'TATTGT': 1, 'TGTGTA': 1, 'GTCGTA': 1, 'CGTAGC': 1, 'TACGTA': 1, 'TGTACT': 1, 'ATGTCT': 1, 'ACCAGT': 1, 'GCGTGT': 1, 'TACCAG': 1, 'GCTGTA': 1, 'TGTACG': 1, 'TGTCGT': 1, 'TGTACC': 1, 'TCGTAT': 1, 'GTGTCG': 1, 'TGTAGC': 1, 'CTGTAG': 1, 'TCATGT': 2, 'TAGCGT': 1, 'TTGTAC': 1, 'CGTGTC': 1, 'AGTCAT': 2, 'CATGTC': 1, 'GTGTAC': 1, 'CATGTG': 1, 'ATTGTA': 1, 'CGTATT': 1, 'AGCGTG': 1, 'GTAGCG': 1, 'CTGTAC': 1, 'GTCATG': 2, 'TAGCTG': 1, 'ACGTAG': 1, 'GTACGT': 1, 'CCAGTC': 1, 'CAGTCA': 1, 'GTACTG': 1, 'GTAGCT': 1, 'GTACCA': 1, 'AGCTGT': 1, 'GTCTGT': 1, 'GTATTG': 1, 'TCTGTA': 1, 'TGTCTG': 1}

The script:
# -*- coding: utf-8 -*-
"""
Created on Fri Aug  7 22:22:32 2015

@author: yuan
"""

def kmer(seq, kmer_size=4):
    kmer_counting={}
    for start in range(len(seq)-(kmer_size-1)):
        #print start
        kmer = seq[start:start+kmer_size]
        if kmer_counting.has_key(kmer):
            kmer_counting[kmer] +=1
        else:
            kmer_counting[kmer] =1
        #print(kmer)
    return kmer_counting

if __name__=="__main__":
    DNA_seq  = "AGTCATGTCTGTAGCTGTACGTAGCGTGTCGTATTGTACCAGTCATGTGTACTG"
    print DNA_seq
    kmer=kmer(DNA_seq,6)
    print kmer

   

No comments:

Post a Comment