Forums

Server performance question - string processing

Hello Guys!

I would like to ask you about a performance related question.

I have a code for finding repeating words in a huge string. My question is that can be any issues (performance related) with running the below code? (Of course I would like to utilize it with quite long strings. It works perfectly on my local environment when the string contains a few thousand chars, but with bigger strings the processing time is increasing a lot.)

Basically this is what I want (finding repeating words and counting them):

mystring = "abcdthisisatextwithsampletextforasampleabcd"

'text' 2 times
'sample' 2 times
'abcd' 2 times

And this is the code snippet (from Stackoverflow) what I would like to use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# import module
from collections import Counter

# get the indices
def getIndices(length):
    # holds the indices
    specific_range = []; all_sets = []

    # start building the indices
    for i in range(0, length - 2):

        # build a set of indices of a specific range
        for j in range(1, length + 2):
            specific_range.append([j - 1, j + i + 3])

            # append 'specific_range' to 'all_sets', reset 'specific_range'
            if specific_range[j - 1][1] == length:
                all_sets.append(specific_range)
                specific_range = []
                break

    # return all of the calculated indices ranges
    return all_sets

# store search strings
tmplst = []; combos = []; found = []

# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"

# get length of string
length = len(mystring)

# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)

# get the search string combinations
for sublst in all_sets:
    for subsublst in sublst:
        tmplst.append(mystring[subsublst[0]: subsublst[1]])
    combos.append(tmplst)
    tmplst = []

# search for matching string patterns
for sublst in all_sets:
    for subsublst in sublst:
        for sublstitems in combos:
            if mystring[subsublst[0]: subsublst[1]] in sublstitems:
                found.append(mystring[subsublst[0]: subsublst[1]])

# make a dictionary containing the strings and their counts
d1 = Counter(found)

# filter out counts of 2 or more and print them
for k, v in d1.items():
    if v > 1:
        print k, v

So can I run this code without thinking about the size of the source string? I understand it can be a time consuming thing with long strings. I rather think about not eating all of the RAM's of the server or etc... Is there anything I should take care of related to the above code?

You have multiple, nested for loops that are dependent on the length of the string. That means you spend a lot of time in repeated loops. You algorithm is O(n^2). Read up on Big O notation to understand what that means.

Thanks Glenn! I have read about it. It is a quite new thing for me. So the main issue is the time and not using all of the RAM's for example. Am I right?

Memory is another issue, but the big O notation is about the time complexity of your algorithm.

I will try to do it with hadoop or something like that, maybe that can be a good idea. Thank you for the help!