Fri Feb 01 2019

Incremental Processing of data [Python]

One of the major problem faced during big data processing is memory management. A lot of memory is consumed if one stores the entire file as a variable for analysis. The efficiency of the code decreases and takes a lot of time for the complete execution.

So to deal with this, incremental processing is done. This means that the rows of data need to be released from the memory before it proceeds to the next data row. This incremental processing is more memory efficient and theoretically unlimited in how much data it can handle.

with open('file_name.txt') as file:
    for line in file:
        line = line.rstrip()
        line = line.split(",")
        lines_all.append(line)
        if len(line) == 4:      #header of each block
            ID = line[0]        # first column of header row (header name)
            file[ID]=[]          
        else:
            file[ID].append(line) #data inside the headers.

In the above code, I am storing blocks of data into a dictionary name file and each time I do a calculation, it is done on each block seperately. Releasing the previous block of data from the memory.