File Buffering in Python
File Buffering in Python: In the builtin open() function, there is an optional argument, called buffering. This argument is used to specify the file's desired buffer size i.e.
- 1: line buffered
- 0: unbuffered
- any other positive value: a buffer of that size in bytes
- negative value: use the system default which is usually line buffered for tty (teletypewriter) devices and fully buffered for other files. This is default value of the buffering argument.
We'll look at buffers in detail after this snippet.
>>> fh1 = open('coding.py', 'r', 1) >>> fh1.line_buffering True >>> contents = fh1.buffer >>> for line in contents: print(line) # OUTPUT b'# -*- coding: utf-8 -*-\r\n' b"variableOne = 'Ethan'\r\n" b'print(variableOne)\r\n' >>> fh1.close() >>> fh1 = open('coding.py', 'r', 0) Traceback (most recent call last): File "<pyshell#55>", line 1, in <module> fh1 = open('coding.py', 'r', 0) ValueError: can't have unbuffered text I/O >>> fh1.close() >>> fh1 = open('coding.py', 'r', 5) >>> fh1.line_buffering False >>> contents = fh1.buffer >>> for line in contents: print(line) b'# -*- coding: utf-8 -*-\r\n' b"variableOne = 'Ethan'\r\n" b'print(variableOne)\r\n' >>> fh1.close() >>> fh1 = open('coding.py', 'r', -1) >>> fh1.line_buffering False >>> contents = fh1.buffer >>> for line in contents: print(line) b'# -*- coding: utf-8 -*-\r\n' b"variableOne = 'Ethan'\r\n" b'print(variableOne)\r\n' >>> fh1.close()
A buffer stores a chunk of data from the Operating System's file stream until it is consumed, at which point more data is brought into the buffer. The reason that is good practice to use buffers is that interacting with the raw stream might have high latency i.e. considerable time is taken to fetch data from it and also to write to it. Let's take an example.
Let's say you want to read 100 characters from a file every 2 minutes over a network. Instead of trying to read from the raw file stream every 2 minutes, it is better to load a portion of the file into a buffer in memory, and then consume it when the time is right. Then, next portion of the file will be loaded in the buffer and so on.
Note that the size of the buffer will depend on the rate at which the data is being consumed. For the example above, 100 characters are required after 2 minutes. So, anything less than 100 will result in increase in latency, 100 itself will do just fine, and anything more than a hundred will be swell.
Another reason for using a buffer is that it can be used to read large files (or files with uncertain size), one chunk at a time. While dealing with files, there might be occasions when you are not be sure of the size of the file that you are trying to read. Say, in an extremely unlikely scenario, if the file size was greater than the computer memory, it will cause a problem for the processing unit of your computer. Therefore, it is always regarded a safe option to pro-actively define maximum size that can be read. You can use several buffer instalments to read and manipulate the entire file, as demonstrated below:
# The following code snippet reads a file containing 196 bytes, with a buffer of 20 bytes, and writes to a file, 20 bytes at a time. # A practical example will have large-scale values of buffer and file size. buffersize = 20 # maximum number of bytes to be read in one instance inputFile = open('fileToBeReadFrom.txt', 'r') outputFile = open('fileToBeWrittenInto.txt', 'a') # opening a file in append mode; creates a file if it doesn't exist buffer = inputFile.read(buffersize) # buffer contains data till the specified cursor position # Writing the contents of a buffer another file 20 bytes at a time counter = 0 # a counter variable for us to see the instalments of 20 bytes while len(buffer): counter = counter + 1 outputFile.write(buffer) print( str(counter) + " ") buffer = inputFile.read(buffersize) # next set of 20 bytes from the input file outputFile.close() inputFile.close()
In actuality, there are two types of buffers:
- Internal Buffers
- Operating System Buffers
The internal buffers are created by language or runtime library that you are using, for the purpose of speeding things up, by preventing system calls for every write operation. So, when you write to a file, you write into its buffer, and whenever the buffer is brimming, so to speak, the data is written to the actual file using system calls. That said, due to the operating system buffers, this does not necessarily mean that the data is written to the file itself. It may mean that the data has been copied from the internal buffers into the Operating System buffers.
So, when you perform a write operation, the data is still only in the buffer until the file is closed, and if your machine gets disconnected from power, the data is not in the file. To help you with this, there are 2 functions in Python: fileHandler.flush() and os.fsync(fileHandler) where os is an imported module for performing operating system tasks.
The flush() writes data from the internal buffer to the operating system buffer without having to close it. What this means is that if another process is performing a read operation from the same file, it will be able to read the data you just flushed to the file. However, this does not necessarily mean that the data has been written to the file, it could be or could not be. To ensure this, the os.fsync(fileHandler) function needs to be called which copies the data from operating system buffers to the file.
If you are uncertain whether what you are trying to write is actually being written when you think it is being written, you can use these function calls in the manner below.
>>> fh = open('fileToBeWrittenInto.txt', 'w+') >>> fh.write('Output line # 1') 15 >>> fh.write('\n') 1 # open the file in a text editor, you will not see any data in it. >>> fh.flush() # re-open the file in a text editor, you will see the contents as below: # Contents of fileToBeWrittenInto.txt Output line # 1 # This data can now be read by any other process attempting to read it. >>> fh.write('Output line # 2') # open the file in a text editor, you will see the contents as below: # Contents of fileToBeWrittenInto.txt Output line # 1 >>> fh.flush() # open the file in a text editor, you will see the contents as below: # Contents of fileToBeWrittenInto.txt Output line # 1 Output line # 2 >>> fh.close()
# Using the fsync(fileHandler) function. >>> fh = open('fileToBeWrittenInto2.txt', 'w+') >>> fh.write('Output line # 1') 15 >>> fh.write('\n') 1 # open file in text-editor, it will be empty. >>> fh.flush() # open file in text-editor, it will have the following contents # Contents of fileToBeWrittenInto2.txt Output line # 1 >>> fh.write('Output line # 2') 15 # check file contents, they will be unchanged as flush() hasn't been called yet # Contents of fileToBeWrittenInto2.txt Output line # 1 # Now let's use the fsync() function >>> import os >>> help(os.fsync) Help on built-in function fsync in module nt: fsync(...) fsync(fildes) force write of file with filedescriptor to disk. >>> os.fsync(fh) # check file contents, they will be unchanged. As we know, fsync() copies data from operating system buffers to file( i.e. in the disk). In this case, there is no pending data in the operating system buffers because flush has not been called. Once flush() is called, the data will be in operating system buffers which may or may not copy data to the file. If it is not copied, then fsync() will force the write to the file when it is called. # Contents of fileToBeWrittenInto2.txt Output line # 1 >>> fh.flush() # check file contents # Contents of fileToBeWrittenInto2.txt Output line # 1 Output line # 2 >>> fh.close() # In this interactive example, we can see that as soon as flush() is called, the data is being written to the file itself, so we don't really feel the need of fsync() right now. But in a script containing hundreds of lines, it is not viable to check the contents of file after each statement, so it is safe to call the fsync(fileHandler) function, to err on the side of caution.
Default Buffer Size
You can check the default buffer size of your platform by importing the io module and checking its DEFAULT_BUFFER_SIZE attribute. The returned size is in bytes.
>>> import io >>> io.DEFAULT_BUFFER_SIZE 8192