Server performance question - string processing : Forums : PythonAnywhere

Hello Guys!

I would like to ask you about a performance related question.

I have a code for finding repeating words in a huge string. My question is that can be any issues (performance related) with running the below code? (Of course I would like to utilize it with quite long strings. It works perfectly on my local environment when the string contains a few thousand chars, but with bigger strings the processing time is increasing a lot.)

Basically this is what I want (finding repeating words and counting them):

mystring = "abcdthisisatextwithsampletextforasampleabcd"

'text' 2 times
'sample' 2 times
'abcd' 2 times

And this is the code snippet (from Stackoverflow) what I would like to use:

# import module
from collections import Counter

# get the indices
def getIndices(length):
    # holds the indices
    specific_range = []; all_sets = []

    # start building the indices
    for i in range(0, length - 2):

        # build a set of indices of a specific range
        for j in range(1, length + 2):
            specific_range.append([j - 1, j + i + 3])

            # append 'specific_range' to 'all_sets', reset 'specific_range'
            if specific_range[j - 1][1] == length:
                all_sets.append(specific_range)
                specific_range = []
                break

    # return all of the calculated indices ranges
    return all_sets

# store search strings
tmplst = []; combos = []; found = []

# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"

# get length of string
length = len(mystring)

# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)

# get the search string combinations
for sublst in all_sets:
    for subsublst in sublst:
        tmplst.append(mystring[subsublst[0]: subsublst[1]])
    combos.append(tmplst)
    tmplst = []

# search for matching string patterns
for sublst in all_sets:
    for subsublst in sublst:
        for sublstitems in combos:
            if mystring[subsublst[0]: subsublst[1]] in sublstitems:
                found.append(mystring[subsublst[0]: subsublst[1]])

# make a dictionary containing the strings and their counts
d1 = Counter(found)

# filter out counts of 2 or more and print them
for k, v in d1.items():
    if v > 1:
        print k, v

So can I run this code without thinking about the size of the source string? I understand it can be a time consuming thing with long strings. I rather think about not eating all of the RAM's of the server or etc... Is there anything I should take care of related to the above code?

deleted-user-2031213 | 6 posts | June 30, 2018, 1:24 p.m. | permalink