[python] Upper memory limit?

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.

Depending on the size of the file I sometimes encounter a MemoryError.

Can more memory be assigned to the python so I don't encounter the error?


EDIT: Code now below

NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb

file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")

files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]

for u in files:
    line = u.readlines()
    list_of_lines = []
    for i in line:
        values = i.split('\t')
        list_of_lines.append(values)

    count = 0
    for j in list_of_lines:
        count +=1

    for k in range(0,count):
        list_of_lines[k].remove('\n')

    length = len(list_of_lines[0])
    print_counter = 4

    for o in range(0,length):
        total = 0
        for p in range(0,count):
            number = float(list_of_lines[p][o])
            total = total + number
        average = total/count
        print average
        if print_counter == 4:
            file_write.write(str(average)+'\n')
            print_counter = 0
        print_counter +=1
file_write.write('\n')

This question is related to python memory

The answer is


You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.

Better iterate over each line:

for current_line in u:
    do_something_with(current_line)

is the recommended approach.

Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.

This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.

Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.


No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.

In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.

Edit:

Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.

A better approach would be to read the files one line at a time:

for u in files:
     for line in u: # This will iterate over each line in the file
         # Read values from the line, do necessary calculations

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.

You have a secondary problem: your choices of variable names severely obfuscate what you are doing.

Here is your script rewritten with the readlines() caper removed and with meaningful names:

file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
    table = []
    for aline in afile:
        values = aline.split('\t')
        values.remove('\n') # why?
        table.append(values)
    row_count = len(table)
    row0length = len(table[0])
    print_counter = 4
    for column_index in range(row0length):
        column_total = 0
        for row_index in range(row_count):
            number = float(table[row_index][column_index])
            column_total = column_total + number
        column_average = column_total/row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1
file_write.write('\n')

It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.

As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.

Here is a revised version of the outer loop code:

for afile in files:
    for row_count, aline in enumerate(afile, start=1):
        values = aline.split('\t')
        values.remove('\n') # why?
        fvalues = map(float, values)
        if row_count == 1:
            row0length = len(fvalues)
            column_index_range = range(row0length)
            column_totals = fvalues
        else:
            assert len(fvalues) == row0length
            for column_index in column_index_range:
                column_totals[column_index] += fvalues[column_index]
    print_counter = 4
    for column_index in column_index_range:
        column_average = column_totals[column_index] / row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1

Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about

1959167 [MiB]

On jython 2.5 it crashes earlier:

 239000 [MiB]

probably I can configure Jython to use more memory (it uses limits from JVM)

Test app:

import sys

sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
    fill_size = 1003
if sys.version.startswith('3'):
    fill_size = 497
print(fill_size)
MiB = 0
while True:
    s = str(i).zfill(fill_size)
    sl.append(s)
    if i == 0:
        try:
            sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
        except AttributeError:
            pass
    i += 1
    if i % 1024 == 0:
        MiB += 1
        if MiB % 25 == 0:
            sys.stderr.write('%d [MiB]\n' % (MiB))

In your app you read whole file at once. For such big files you should read the line by line.