Forums

Using two steps Markov chains to generate DNA sequences.

Hi everyone,

I've already know how to generate DNA sequences (A,T,C,G) by giving a transitional matrix for one step Markov Chain, which means latter nucleotide frequency depends on former one. But I wanna choose latter nucleotide based on former two nucleotides, can anyone help me on the code?

Thanks a lot.

EDIT: just realised that you want differing probabilities, so a simple map won't do. Still, my main point still stands - just use the same approach as currently but index on a 2-char string instead on 1 (or a 2-tuple if you're already using a list instead of a string).

So you want to generate strings of letters which represent DNA sequences, and currently you're using a rule giving a list of possible nucleotides based on just the previous one. However, you want to amend this to base the choice on both previous nucleotides. Correct?

I would treat the whole thing as a string and use slice notation to get the string representing the last two characters. Then just set up a dictionary which maps the 2-character string to a set of nucleotide letters, and use the choice() method from the random module to select from the set. You'll need to initialise the string with a pair of nucleotides to begin with, of course.

If that's not clear, I can explain in more detail, but there's a limit to how verbose I can be when bottle-feeding my baby daughter in the middle of the night! :-)

The issue of how best to implement Markov Chains piqued my interest, so here's a little script I crashed out off the top of my head. I spent about 5 minutes or so writing it, so don't expect the cleanest code, but hopefully it illustrates the point (I didn't use nucleotide sequences, I just invented a random sequence of X, Y and Z):

import random

# Define the chain.
chain = {
    "XX": {"X": 8, "Y": 1, "Z": 1},
    "XY": {"X": 1, "Y": 2, "Z": 5},
    "XZ": {"X": 1, "Y": 3, "Z": 2},
    "YX": {"X": 2, "Y": 4, "Z": 5},
    "YY": {"X": 4, "Y": 1, "Z": 4},
    "YZ": {"X": 3, "Y": 2, "Z": 1},
    "ZX": {"X": 2, "Y": 2, "Z": 4},
    "ZY": {"X": 1, "Y": 4, "Z": 2},
    "ZZ": {"X": 4, "Y": 4, "Z": 1}
}

def chain_gen(chain, start, length):
    prev = start[-2:]
    for i in start:
        yield i
    for i in xrange(length):
        try:
            choice = random.randint(1, sum(chain[prev].values()))
            cumulative = 0
            for item, chance in sorted(chain[prev].items()):
                cumulative += chance
                if choice <= cumulative:
                    yield item
                    prev = prev[-1] + item
                    break
        except KeyError:
            break

print "".join(chain_gen(chain, "ZZ", 40))

The chain is stored as a dictionary mapping the previous two characters to another dictionary mapping the next character to the relative chance of that character being chosen. These are relative to the other choices in the dictionary, so if the chances are 1, 3 and 2 then the corresponding probabilities will be 1 in 6, 3 in 6 and 2 in 6.

@husky113: Welcome to PA, I just emailed you my DNA profile...let's keep it private though okay?...☺

I think trading DNA on the forums is against the TOS.

Creating DNA sequences from markov chains is fascinating stuff. So then you would run each chain through a genetic fitness algorithm that decided whether or not they would actually produce anything interesting if used?