As we can see from the task, the talk is about DNA replication, as said my friends from team, we should find the most popular string in the given file.
First of all, I should say, that I'm using python. In the start I just tried to build index where keys are string that can be found in file and values are there frequency. But that didn't work. Because in python index always saved in the RAM, and for my counting, I should have more than 16GB (I think something like 32 or 64). That numbers is reachable, but I guessed that it should be better solution.
So there is two solutions:
First I think just install some datebase with indexies that could use disk for storage. But in this case you should do more admin stuff I think.
Second: I wrote variant just for lulz, with no math calculation and it worked:
At the begiging script builds index reading stings from random places. For me 10 000 000 strings was enough. After that I just take 10 00 most popular string, and count there frequency only:
from collections import defaultdict from operator import itemgetter from random import randint index = defaultdict(int) with open("word2") as f: s = f.read() for i in xrange(10000000): j = randint(0, len(s)) index[s[j:j+32]] += 1 index = dict(sorted(index.iteritems(), key=itemgetter(1), reverse=True)[:10000]) for i, el in enumerate(s): try: curS = s[i:i+32] except: break if i % 100000 == 0: print i if curS in index: index[curS] += 1 print max(index.iteritems(), key=itemgetter(1))
The answer is GGAACAAGTTACATGGGCCGAATGCTATTGTC, and it could be found 64 times in the file. Script works for 120 second.