Upper memory limit

Question

Is there a limit to memory for python  I ve been using a python script to calculate the average values from a file which is a minimum of 150mb big   Depending on the size of the file I sometimes encounter a MemoryError   Can more memory be assigned to the python so I don t encounter the error     EDIT  Code now below  NOTE  The file sizes can vary greatly  up to 20GB  the minimum size of the a file is 150mb  file A1 B1   open  A1 B1 100000 txt    r   file A2 B2   open  A2 B2 100000 txt    r   file A1 B2   open  A1 B2 100000 txt    r   file A2 B1   open  A2 B1 100000 txt    r   file write   open   average generations txt    w   mutation average   open  mutation average    w    files    file A2 B2 file A2 B2 file A1 B2 file A2 B1   for u in files      line   u readlines       list of lines          for i in line          values   i split   t           list of lines append values       count   0     for j in list of lines          count   1      for k in range 0 count           list of lines k  remove   n        length   len list of lines 0       print counter   4      for o in range 0 length           total   0         for p in range 0 count               number   float list of lines p  o               total   total   number         average   total count         print average         if print counter    4              file write write str average    n               print counter   0         print counter   1 file write write   n

User · Answer

Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.

You have a secondary problem: your choices of variable names severely obfuscate what you are doing.

Here is your script rewritten with the readlines() caper removed and with meaningful names:

file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
    table = []
    for aline in afile:
        values = aline.split('\t')
        values.remove('\n') # why?
        table.append(values)
    row_count = len(table)
    row0length = len(table[0])
    print_counter = 4
    for column_index in range(row0length):
        column_total = 0
        for row_index in range(row_count):
            number = float(table[row_index][column_index])
            column_total = column_total + number
        column_average = column_total/row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1
file_write.write('\n')

It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.

As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.

Here is a revised version of the outer loop code:

for afile in files:
    for row_count, aline in enumerate(afile, start=1):
        values = aline.split('\t')
        values.remove('\n') # why?
        fvalues = map(float, values)
        if row_count == 1:
            row0length = len(fvalues)
            column_index_range = range(row0length)
            column_totals = fvalues
        else:
            assert len(fvalues) == row0length
            for column_index in column_index_range:
                column_totals[column_index] += fvalues[column_index]
    print_counter = 4
    for column_index in column_index_range:
        column_average = column_totals[column_index] / row_count
        print column_average
        if print_counter == 4:
            file_write.write(str(column_average)+'\n')
            print_counter = 0
        print_counter +=1

User · Answer

You re reading the entire file into memory  line   u readlines    which will fail of course if the file is too large  and you say that some are up to 20 GB   so that s your problem right there   Better iterate over each line   for current line in u      do something with current line    is the recommended approach   Later in your script  you re doing some very strange things like first counting all the items in a list  then constructing a for loop over the range of that count  Why not iterate over the list directly  What is the purpose of your script  I have the impression that this could be done much easier   This is one of the advantages of high-level languages like Python  as opposed to C where you do have to do these housekeeping tasks yourself   Allow Python to handle iteration for you  and only collect in memory what you actually need to have in memory at any given time   Also  as it seems that you re processing TSV files  tabulator-separated values   you should take a look at the csv module which will handle all the splitting  removing of  ns etc  for you

User · Answer

No  there s no Python-specific limit on the memory usage of a Python application  I regularly work with Python applications that may use several gigabytes of memory  Most likely  your script actually uses more memory than available on the machine you re running on   In that case  the solution is to rewrite the script to be more memory efficient  or to add more physical memory if the script is already optimized to minimize memory usage   Edit   Your script reads the entire contents of your files into memory at once  line   u readlines     Since you re processing files up to 20 GB in size  you re going to get memory errors with that approach unless you have huge amounts of memory in your machine   A better approach would be to read the files one line at a time   for u in files       for line in u    This will iterate over each line in the file            Read values from the line  do necessary calculations

User · Answer

This is my third answer because I misunderstood what your code was doing in my original  and then made a small but crucial mistake in my second   hopefully three s a charm   Edits  Since this seems to be a popular answer  I ve made a few modifications to improve its implementation over the years   most not too major  This is so if folks use it as template  it will provide an even better basis   As others have pointed out  your MemoryError problem is most likely because you re attempting to read the entire contents of huge files into memory and then  on top of that  effectively doubling the amount of memory needed by creating a list of lists of the string values from each line   Python s memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available  Even if you don t use it all up and your program  works   using it may be impractical because it takes too long   Anyway  the most obvious way to avoid that is to process each file a single line at a time  which means you have to do the processing incrementally   To accomplish this  a list of running totals for each of the fields is kept  When that is finished  the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read  Once that is done  these averages can be printed out and some written to one of the output files  I ve also made a conscious effort to use very descriptive variable names to try to make it understandable   try      from itertools import izip longest except ImportError       Python 3     from itertools import zip longest as izip longest  GROUP SIZE   4 input file names     A1 B1 100000 txt    A2 B2 100000 txt    A1 B2 100000 txt                        A2 B1 100000 txt   file write   open  average generations txt    w   mutation average   open  mutation average    w      left in  but nothing written  for file name in input file names      with open file name   r   as input file          print  processing file      format file name            totals              for count  fields in enumerate  line split   t   for line in input file   1               totals    sum values  for values in                         izip longest totals  map float  fields   fillvalue 0           averages    total count for total in totals           for print counter  average in enumerate averages               print      9 4f   format average               if print counter   GROUP SIZE    0                  file write write str average    n    file write write   n   file write close   mutation average close

User · Answer

Python can use all memory available to its environment  My simple  memory test  crashes on ActiveState Python 2 6 after using about  1959167  MiB    On jython 2 5 it crashes earlier    239000  MiB    probably I can configure Jython to use more memory  it uses limits from JVM   Test app   import sys  sl      i   0   some magic 1024 - overhead of string object fill size   1024 if sys version startswith  2 7        fill size   1003 if sys version startswith  3        fill size   497 print fill size  MiB   0 while True      s   str i  zfill fill size      sl append s      if i    0          try              sys stderr write  size of one string  d n     sys getsizeof s            except AttributeError              pass     i    1     if i   1024    0          MiB    1         if MiB   25    0              sys stderr write   d  MiB  n     MiB       In your app you read whole file at once  For such big files you should read the line by line

[python] Upper memory limit?

Examples related to python

Examples related to memory