Python 101: File Manipulation

Buffer this pageShare on FacebookPrint this pageTweet about this on TwitterShare on Google+Share on LinkedInShare on StumbleUpon
Reading Time: 11 minutes

 

Hi, and welcome back to Python 101. In this chapter, we will look at how to manipulate files in Python.

Manipulate Files in Python

Python, like any other programming language, allows the user to manipulate files on your computer. There are three basic actions you can perform on a files: read, write and append. Based on these three actions, Python has provided 12 file access modes, which are listed a little further down below. Let's see what all we will be covering by the end of this chapter:

Before we begin, note that the terms methods and functions are synonymous but distinct. Function is a Procedural term, whereas method is an Object-Oriented term. That is, a function belonging to an object is called a method.

Programs to whet your appetite

#1 Emulate 'tail' command of Unix in Python

# In Unix, the 'tail -n fileName' command displays the last n lines of the file 'fileName'.
# Write a program in Python to emulate the same functionality.
# Ask the user for the filename whose tail is to be displayed, and the number of lines comprising the tail.
# Output the last whatever-the-input lines.

fileName = input('Please enter the name of the file you wish view the tail of: ')
numberOfLines = int(input('Please enter the number of lines in the tail that you wish to view: '))

fileHandler = open( fileName, 'r' )

# list to store the tail
lines = []

# maintaining a n-line window using pop() method of lists. pop(0) removes the first element in the list i.e. least recent of the n lines
for line in fileHandler.readlines():
lines.append(line)
if len(lines) > numberOfLines:
lines.pop(0)

fileHandler.close()

# Some text to replicate Unix command syntax.
print('\nExecuting command: tail -' + str(numberOfLines) + " " + fileName + "\n")

# displaying the tail
for line in lines:
print(line, end="")

# If you are trying this on an online interpreter such as repl.it, you will have to create the file first using the open() function, and fill it with some sample lines, such as below.
# fileHandler2 = open('log_file.log', 'w')
# fileHandler2.write('Checking environment variables...\n')
# fileHandler2.write('Connecting to database...\n')
# fileHandler2.write('Connected to database.\n')
# fileHandler2.write('Performing some pseudo-work...\n')
# fileHandler2.write('Pseudo-work done.\n')
# fileHandler2.write('Script executed successfully.\n')
# fileHandler2.write('Check errors.log for errors.\n')
# fileHandler2.close()
Try it here.

#2 Removing comments from a file

# Removing comments from a file
# Write a program to remove all the comments from a file. Create a new file and leave the original untampered.
# Such an activity is usually performed while creating minified version of a script, to reduce file size.

# ask for file names
inputFileName = input('Please enter the name of the file you wish to remove comments from: ')
outputFileName = input('Please enter the name of the new file: ')

# open both files
inputFileHandler = open(inputFileName, 'r')
outputFileHandler = open(outputFileName, 'w')

# read the file line by line; write to new file as it is if no comment in line; modify and then write to new file if comment present
# find() returns the position of a character/substring in a string; returns -1 if not found
# line[0: positionOfHash] slices the line till the character just behind the # symbol. In the process, the new line character after the comment is also deleted, so we manually add it here
for line in inputFileHandler.readlines():
positionOfHash = line.find('#')

if positionOfHash != -1:
line = line[0 : positionOfHash]
line = line + "\n"

outputFileHandler.write(line)

inputFileHandler.close()
outputFileHandler.close()

# printing a success message
print(outputFileName,"has been created without comments.")

######## END OF PROGRAM ########

# I should admit that it is not the best solution to remove bits and pieces from a file. For starters, it does not have any handling for the event in which the specified does not exist, I am sure you will be able to do that once you go through Exception Handling in the next chapter. Another case that the program does not handle is that the pound/hash symbol may be a part of a string such as 'Et#han', which will be troublesome if treated the same way as comment lines. You can overcome this by using regular expressions. I implore you to research a bit on that.

# If you are running on an online environment such as repl.it, and don't have a sample file at hand, uncomment the following to create a sample file.

##fileHandler2 = open('dev_script.py', 'w+')
##fileHandler2.write('myInt = 4						# an integer variable\n')
##fileHandler2.write('myString = \"Ethan\"					# a string variable\n\n')
##fileHandler2.write('myList = [1, 2, 3, 4, 5, 6]				# a list variable\n')
##fileHandler2.write('mySet = set(myList)			# creating a set from a list\n')
##fileHandler2.write('myTuple = 2, 3				# creating a tuple without parentheses\n')
##fileHandler2.write('myDict = {1: \'one\', 2: \'two\'}		# creating a dictionary variable')
##fileHandler2.seek(0)
##print(fileHandler2.read())
Try it here.

Let's see how to perform elementary operations of reading from a file, and writing to a file.

Reading from a file

The builtin open() function is used for opening files. This function returns a file object, which has methods like read() and write() associated with it, which are self-explanatory. When we are done manipulating the file, the close() function is called to close it.

# fileToBeReadFrom.txt
Python is an extremely versatile language.
It is not limited to desktop applications or applications on the web.
People who code in Python are often referred to as "Pythonistas" or "Pythoneers ".

# readingFromAFile.py
# situated in the same directory as fileToBeReadFrom.txt
fileHandler = open('fileToBeReadFrom.txt') 		# or open('fileToBeReadFrom.txt', 'r')
contents = fileHandler.read()
print(contents)
fileHandler.close()

# OUTPUT in Shell
# Run Menu > Run Module
Python is an extremely versatile language.
It is not limited to desktop applications or applications on the web.
People who code in Python are often referred to as "Pythonistas" or "Pythoneers".

Writing to a file

# writingToAFile.py
fileHandler = open('fileToBeWrittenInto.txt', 'w') # The specified file need not be created prior to this statement
fileHandler.write("Line 1: I am learning how to write to a file using Python.")
fileHandler.write("\n")
fileHandler.write("Line 2: It seems like such an easy thing to do. \nLine 3: Cool!")
fileHandler.close()

# fileToBeWrittenInto.txt (in the same directory) upon executing above, will have the following contents. If the file already existed, then the original contents will be erased and replaced by the following.
Line 1: I am learning how to write to a file using Python.
Line 2: It seems like such an easy thing to do.
Line 3: Cool!

Different Access Modes

The optional mode argument in the open() function has 12 valid values, as listed below. The default value for this argument is 'r'. You can specify these modes as open('file', 'r') or open('file', mode='r') where 'r' can be replaced with the following modes.

r Opens an existing file for reading only.
File cursor is placed at the beginning of the file.
Throws FileNotFoundError if the specified file doesn't exist.
r+ Opens an existing file for both reading and writing.
File cursor is placed at the beginning of the file.
Throws FileNotFoundError if the specified file doesn't exist.
rb Opens an existing file for reading only in binary format.
File cursor is placed at the beginning of the file.
Throws FileNotFoundError if the specified file doesn't exist.
rb+ Opens an existing file for both reading and writing in binary format.
File cursor is placed at the beginning of the file.
Throws FileNotFoundError if the specified file doesn't exist.
w Opens the file for writing only.
Overwrites the contents of the specified file if it already exists.
Creates the file specified if it doesn't exist.
w+ Opens the file for reading and writing.
Overwrites the contents of the specified file if it already exists.
Creates the file specified if it doesn't exist.
wb Opens the file for writing only in binary format.
Overwrites the contents of the specified the file if it already exists.
Creates the file specified if it doesn't exist.
wb+ Opens the file for writing and reading in binary format.
Overwrites the contents of the specified the file if it already exists.
Creates the file specified if it doesn't exist.
a Opens an existing file for appending i.e. writing with file cursor position at the end of the file.
Throws io.UnsupportedOperation exception when an attempt to read from it is made.
Creates a new file if it doesn't exist.
a+ Opens an existing file for appending (i.e. writing with file cursor position at the end of the file) as well as reading.
The file cursor position is situated at the end of the file.
Creates a new file if it doesn't exist.
ab Opens an existing file for appending (i.e. writing with file cursor position at the end of the file) in binary format.
Throws io.UnsupportedOperation exception when an attempt to read from it is made.
Creates a new file if it doesn’t exist.
ab+ Opens an existing file for appending (i.e. writing with file cursor position at the end of the file) as well as reading in binary format.
The file cursor position is situated at the end of the file.
Creates a new file if it doesn't exist.

Methods and Attributes of a file object

The builtin open() function returns a file object. Just like any object in Python, the file object has certain methods and attributes. Let's do quick dir() on a file handler.

>>> fH = open('fileToBeReadFrom.txt', 'r')
>>> dir(fH)
['_CHUNK_SIZE', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_finalizing', 'buffer', 'close', 'closed', 'detach', 'encoding', 'errors', 'fileno', 'flush', 'isatty', 'line_buffering', 'mode', 'name', 'newlines', 'read', 'readable', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']

# Note that the directory of methods and attributes remain same even when you change the access mode.
>>> fH1 = open('fileToBeReadFrom.txt', 'r')
>>> fH2 = open('fileToBeReadFrom.txt', 'w')
>>> fH3 = open('fileToBeReadFrom.txt', 'r+')
>>> fH4 = open('fileToBeReadFrom.txt', 'w+')
>>> dir(fH1) == dir(fH2) == dir(fH3) == dir(fH4)
True

## Back to the listing
>>> dir(fH)
['_CHUNK_SIZE', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_finalizing', 'buffer', 'close', 'closed', 'detach', 'encoding', 'errors', 'fileno', 'flush', 'isatty', 'line_buffering', 'mode', 'name', 'newlines', 'read', 'readable', 'readline', 'readlines', 'seek', 'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']
## Putting it in a readable form
>>> print(       "\n".join(  [element for element in dir(fH) if element.startswith('__')] )       )
__class__		__del__
__delattr__		__dict__
__dir__			__doc__
__enter__		__eq__
__exit__		__format__
__ge__			__getattribute__
__getstate__		__gt__
__hash__		__init__
__iter__		__le__
__lt__			__ne__
__new__		__next__
__reduce__		__reduce_ex__
__repr__		__setattr__
__sizeof__		__str__
__subclasshook__
>>> print(       "\n".join(  [element for element in dir(fH) if not element.startswith('__')] )       )
_CHUNK_SIZE		_checkClosed
_checkReadable	_checkSeekable
_checkWritable		_finalizing
buffer			close
closed			detach
encoding		errors
fileno 			flush
isatty 			line_buffering
mode			name
newlines 		read
readable		readline
readlines		seek
seekable		tell
truncate		writable
write			writelines

We will look at a few chosen ones, I'll leave the rest to you, as per your interest.

# close() method
# closes an opened file.

# closed attribute
# tells if the file has been closed or not
>>> fh = open('fileOne.txt', 'w')
>>> fh.closed
False
>>> fh.close()
>>> fh.closed
True

# flush() method
# The flush() writes data from the internal buffer(program buffer) to the operating system buffer. What this means is that if another process is performing a read operation from the same file, it will be able to read the data you just flushed to the file. However, this does not necessarily mean that the data has been written to the file, it could be or could not be. To ensure this, the os.fsync(fileHandler) function needs to be called which copies the data from operating system buffers to the file.
# View File Buffering section below for further explanation and example code snippet.

# mode attribute
# gives the access mode with which the file has been opened.
>>> fh = open('fileToBeWrittenInto.txt', 'w+')
>>> fh.mode
'w+'

# name attribute
# gives the name of the file.
>>> fh = open('fileToBeWrittenInto.txt', 'w+')
>>> fh.name
'fileToBeWrittenInto.txt'

# read(size) method
#  reads from the file as many bytes as specified in the size argument. If no argument provided, reads till end-of-file.
# contents of coding.py, 1st line of which contains 23 characters in all.
# -*- coding: utf-8 -*-
variableOne = 'Ethan'
print(variableOne)

>>> fh = open('coding.py', 'r')
>>> fh.read(10)
'# -*- codi'
>>> fh.read()
"ng: utf-8 -*-\nvariableOne = 'Ethan'\nprint(variableOne)\n"
>>> fh.close()

# readline([size]) method
# reads a file by a line each time it is called. The trailing new line character '\n' is kept in the string.
# when the optional 'size' argument is present, it reads the line up to the size provided
# when the optional 'size' argument is not present, it reads the entire line.
# contents of coding.py, 1st line of which contains 23 characters in all.
# -*- coding: utf-8 -*-
variableOne = 'Ethan'
print(variableOne)

# when the optional 'size' argument is not present
>>> fh = open('coding.py', 'r')
>>> fh.readline()
'# -*- coding: utf-8 -*-\n'
>>> fh.readline()
"variableOne = 'Ethan'\n"
>>> fh.readline()
'print(variableOne)\n'
>>> fh.readline()
''

# when the optional 'size' argument is present
>>> fh = open('coding.py', 'r')
>>> fh.readline(15)
'# -*- coding: u'
>>> fh.readline(8)
'tf-8 -*-'
>>> fh.readline()
'\n'
>>> fh.readline()
"variableOne = 'Ethan'\n"
>>> fh.close()

# readlines([size]) method
# when the optional 'size' argument is not present, it returns a list of lines in the file.
# when the optional 'size' argument is present, it returns the list of lines till the line which contains the character placed at the position denoted by 'size', and removes these lines from the list containing these lines.; so the next time readline() or readlines() is called, it starts from the next line.

# contents of coding.py, 1st line of which contains 23 characters in all.
# -*- coding: utf-8 -*-
variableOne = 'Ethan'
print(variableOne)

# when the optional 'size' argument is not present
>>> fh = open('coding.py', 'r')
>>> fh.readlines()
['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n", 'print(variableOne)\n']
>>> fh.close()

# when the optional 'size' argument is present
>>> fh = open('coding.py', 'r')
>>> fh.readlines(27)			# fetches lines until the line containing the 27th character; also, removes these lines from the list returned by readlines(); so the next time readlines() is called, cursor will be placed in the beginning of the next line i.e. 3rd line in this example.
['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n"]
>>> fh.readlines(20)			# fetches lines until the line containing the 20th character
['print(variableOne)\n']
>>> fh.readlines(20)
[]
>>> fh.close()

>>> fh = open('coding.py', 'r')
>>> fh.readlines(27)
['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n"]
>>> fh.readlines()					# fetches rest of the lines
['print(variableOne)\n']

# using readlines() and readline() in tandem
>>> fh = open('coding.py', 'r')
>>> fh.readlines(27)
['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n"]
>>> fh.readline()
'print(variableOne)\n'
>>> fh.close()

# readable() method
# tells us if the file is readable or not.
>>> fh = open('coding.py', 'r')
>>> fh.readable()
True
>>> fh.close()

>>> fh = open('coding.py', 'w')
>>> fh.readable()
False
>>> fh.close()

# seek() method
# The seek(offset [,from_where]) changes the file cursor position by as many characters as specified in the first argument, from the position specified in the second argument. The second argument has only 3 acceptable values: 0(default), 1, 2. 0 denotes the beginning of the file, 1 denotes the current position of the cursor, and 2 denotes the end of file.
# View the 'Ways to read a file' section for elaborate details and examples.

# tell() method
# The tell() method tells us the current position of the cursor in the file.
# View the 'Ways to read a file' section for elaborate details and examples.

# truncate([size]) method
# When the optional 'size' argument is not present, then this method truncates (or deletes) the file contents beginning from the current cursor position till the end of file. The method returns a number which signifies the number of characters which remain in the file after truncating.
# If the optional 'size' argument is provided while calling the function, file contents beginning from the 1st position of file till the number of bytes provided in the 'size' argument, are preserved, while anything that follows is deleted.

# So, either you go to say 9th position in the file using seek(9) and then call truncate() OR you call truncate(9), the result is same i.e. contents till 9th position remain while rest get deleted.

>>> fh = open('toBeWrittenInto4.txt', 'w+')
>>> fh.write('123456789')
9
>>> fh.flush()
>>> fh.tell()
9
>>> fh.truncate()		# does nothing because file cursor position is already in the end of the file
9
>>> fh.seek(0)
0
>>> fh.read()
'123456789'
>>> fh.seek(0)			# let's truncate again with cursor at the beginning
0
>>> fh.truncate()
0
>>> fh.seek(0)
0
>>> fh.read()
''
>>> fh.close()

# when the optional size argument is provided
>>> fh = open('toBeWrittenInto4.txt', 'w+')
>>> fh.write('123456789')
9
>>> fh.flush()
>>> fh.truncate(7)		# deletes everything from 8th character till end-of-file
7
>>> fh.seek(0)
0
>>> fh.read()
'1234567'
>>> fh.close()

# seek(size) then truncate() is as good as truncate(size)
>>> fh = open('toBeWrittenInto4.txt', 'w+')
>>> fh.write('123456789')
9
>>> fh.flush()
>>> fh.seek(3)
3
>>> fh.truncate()
3
>>> fh.seek(0)
0
>>> fh.read()
'123'
>>> fh.close()
>>>
>>> fh = open('toBeWrittenInto4.txt', 'w+')
>>> fh.write('123456789')
9
>>> fh.flush()
>>> fh.truncate(3)
3
>>> fh.seek(0)
0
>>> fh.read()
'123'
>>> fh.close()

# writable() method
# tells if the file can be written to or not.
>>> fh = open('toBeWrittenInto.txt', 'w')
>>> fh.writable
<built-in method writable of _io.TextIOWrapper object at 0x03091DB0>
>>> fh.writable()
True
>>> fh.close()

>>> fh = open('toBeWrittenInto.txt', 'r')
>>> fh.writable()
False
>>> fh.close()

# write(string) method
# writes the string to the file
>>> fh = open('toBeWrittenInto.txt', 'w')
>>> fh.write('Heidi.')
6
>>> fh.close()
>>> # contents of toBeWrittenInto.txt
Heidi

# writelines() method
# writes a sequence of strings to the file. This sequence is typically a list of strings (such as the one produced by readlines() method), but can be any iterable object containing strings such as a tuple.
>>> fh = open('toBeWrittenInto2.txt', 'w')
>>> fh.writelines(('Line 1 \n', 'Line 2 \n', 'Line 3 \n'))			# a tuple of strings
>>> fh.close()

>>> fh = open('toBeWrittenInto2.txt', 'w')
>>> fh.writelines(['Line 1 \n', 'Line 2 \n', 'Line 3 \n'])		# a list of strings.
>>> fh.close()

You can check out the documentation for File Objects here.

File Buffering

As a disclaimer, I would like to state that this section might be a little too difficult to comprehend for absolute beginners. I have included it here for the sake of completeness. You can skip this section altogether if you like.
In the builtin open() function, there is an optional argument, called buffering. This argument is used to specify the file's desired buffer size i.e.

  • 1: line buffered
  • 0: unbuffered
  • any other positive value: a buffer of that size in bytes
  • negative value: use the system default which is usually line buffered for tty (teletypewriter) devices and fully buffered for other files. This is default value of the buffering argument.

We'll look at buffers in detail after this snippet.

>>> fh1 = open('coding.py', 'r', 1)
>>> fh1.line_buffering
True
>>> contents = fh1.buffer
>>> for line in contents:
print(line)

# OUTPUT
b'# -*- coding: utf-8 -*-\r\n'
b"variableOne = 'Ethan'\r\n"
b'print(variableOne)\r\n'
>>> fh1.close()

>>> fh1 = open('coding.py', 'r', 0)
Traceback (most recent call last):
File "<pyshell#55>", line 1, in <module>
fh1 = open('coding.py', 'r', 0)
ValueError: can't have unbuffered text I/O
>>> fh1.close()

>>> fh1 = open('coding.py', 'r', 5)
>>> fh1.line_buffering
False
>>> contents = fh1.buffer
>>> for line in contents:
print(line)

b'# -*- coding: utf-8 -*-\r\n'
b"variableOne = 'Ethan'\r\n"
b'print(variableOne)\r\n'
>>> fh1.close()

>>> fh1 = open('coding.py', 'r', -1)
>>> fh1.line_buffering
False
>>> contents = fh1.buffer
>>> for line in contents:
print(line)

b'# -*- coding: utf-8 -*-\r\n'
b"variableOne = 'Ethan'\r\n"
b'print(variableOne)\r\n'
>>> fh1.close()

A buffer stores a chunk of data from the Operating System's file stream until it is consumed, at which point more data is brought into the buffer. The reason that is good practice to use buffers is that interacting with the raw stream might have high latency i.e. considerable time is taken to fetch data from it and also to write to it. Let's take an example.

Let's say you want to read 100 characters from a file every 2 minutes over a network. Instead of trying to read from the raw file stream every 2 minutes, it is better to load a portion of the file into a buffer in memory, and then consume it when the time is right. Then, next portion of the file will be loaded in the buffer and so on.

Note that the size of the buffer will depend on the rate at which the data is being consumed. For the example above, 100 characters are required after 2 minutes. So, anything less than 100 will result in increase in latency, 100 itself will do just fine, and anything more than a hundred will be swell.

Another reason for using a buffer is that it can be used to read large files (or files with uncertain size), one chunk at a time. While dealing with files, there might be occasions when you are not be sure of the size of the file that you are trying to read. Say, in an extremely unlikely scenario, if the file size was greater than the computer memory, it will cause a problem for the processing unit of your computer. Therefore, it is always regarded a safe option to pro-actively define maximum size that can be read. You can use several buffer instalments to read and manipulate the entire file, as demonstrated below:

# The following code snippet reads a file containing 196 bytes, with a buffer of 20 bytes, and writes to a file, 20 bytes at a time.
# A practical example will have large-scale values of buffer and file size.
buffersize = 20			                    # maximum number of bytes to be read in one instance
inputFile = open('fileToBeReadFrom.txt', 'r')
outputFile = open('fileToBeWrittenInto.txt', 'a')   # opening a file in append mode; creates a file if it doesn't exist
buffer = inputFile.read(buffersize)		    # buffer contains data till the specified cursor position

# Writing the contents of a buffer another file 20 bytes at a time
counter = 0		                            # a counter variable for us to see the instalments of 20 bytes
while len(buffer):
counter = counter + 1
outputFile.write(buffer)
print( str(counter) + " ")
buffer = inputFile.read(buffersize)		    # next set of 20 bytes from the input file

outputFile.close()
inputFile.close()

In actuality, there are two types of buffers:

  • Internal Buffers
  • Operating System Buffers

The internal buffers are created by language or runtime library that you are using, for the purpose of speeding things up, by preventing system calls for every write operation. So, when you write to a file, you write into its buffer, and whenever the buffer is brimming, so to speak, the data is written to the actual file using system calls. That said, due to the operating system buffers, this does not necessarily mean that the data is written to the file itself. It may mean that the data has been copied from the internal buffers into the Operating System buffers.

So, when you perform a write operation, the data is still only in the buffer until the file is closed, and if your machine gets disconnected from power, the data is not in the file. To help you with this, there are 2 functions in Python: fileHandler.flush() and os.fsync(fileHandler) where os is an imported module for performing operating system tasks.

The flush() writes data from the internal buffer to the operating system buffer without having to close it. What this means is that if another process is performing a read operation from the same file, it will be able to read the data you just flushed to the file. However, this does not necessarily mean that the data has been written to the file, it could be or could not be. To ensure this, the os.fsync(fileHandler) function needs to be called which copies the data from operating system buffers to the file.

As I said leading into the topic, you might never have to use either of these functions. I read this piece on stackoverflow and thought of including it here because it will help you understand buffers better.

If you are uncertain whether what you are trying to write is actually being written when you think it is being written, you can use these function calls in the manner below.

>>> fh = open('fileToBeWrittenInto.txt', 'w+')
>>> fh.write('Output line # 1')
15
>>> fh.write('\n')
1
# open the file in a text editor, you will not see any data in it.
>>> fh.flush()
# re-open the file in a text editor, you will see the contents as below:
# Contents of fileToBeWrittenInto.txt
Output line # 1

# This data can now be read by any other process attempting to read it.
>>> fh.write('Output line # 2')
# open the file in a text editor, you will see the contents as below:
# Contents of fileToBeWrittenInto.txt
Output line # 1

>>> fh.flush()
# open the file in a text editor, you will see the contents as below:
# Contents of fileToBeWrittenInto.txt
Output line # 1
Output line # 2
>>> fh.close()
Do not be misled that write() doesn't actually 'write' data to a file, it does, but only when the close() is called. In other words, the close() method flushes the data to the file before closing it. If you wish to write to a file without having to close it, you can use the flush() method.
# Using the fsync(fileHandler) function.
>>> fh = open('fileToBeWrittenInto2.txt', 'w+')
>>> fh.write('Output line # 1')
15
>>> fh.write('\n')
1
# open file in text-editor, it will be empty.
>>> fh.flush()
# open file in text-editor, it will have the following contents
# Contents of fileToBeWrittenInto2.txt
Output line # 1

>>> fh.write('Output line # 2')
15
# check file contents, they will be unchanged as flush() hasn't been called yet
# Contents of fileToBeWrittenInto2.txt
Output line # 1

# Now let's use the fsync() function
>>> import os
>>> help(os.fsync)
Help on built-in function fsync in module nt:

fsync(...)
fsync(fildes)

force write of file with filedescriptor to disk.

>>> os.fsync(fh)
# check file contents, they will be unchanged. As we know, fsync() copies data from operating system buffers to file( i.e. in the disk). In this case, there is no pending data in the operating system buffers because flush has not been called. Once flush() is called, the data will be in operating system buffers which may or may not copy data to the file. If it is not copied, then fsync() will force the write to the file when it is called.
# Contents of fileToBeWrittenInto2.txt
Output line # 1

>>> fh.flush()
# check file contents
# Contents of fileToBeWrittenInto2.txt
Output line # 1
Output line # 2

>>> fh.close()

# In this interactive example, we can see that as soon as flush() is called, the data is being written to the file itself, so we don't really feel the need of fsync() right now. But in a script containing hundreds of lines, it is not viable to check the contents of file after each statement, so it is safe to call the fsync(fileHandler) function, to err on the side of caution.

 

Default Buffer Size

 
You can check the default buffer size of your platform by importing the io module and checking its DEFAULT_BUFFER_SIZE attribute. The returned size is in bytes.

>>> import io
>>> io.DEFAULT_BUFFER_SIZE
8192

Ways to read a file

We saw basic examples of reading and writing, let's dig a little deeper into the reading operation. The write operation is quite fundamental, so we won't dwell much on it here.

  • read(), tell() & seek()
  • fH.read(size) reads some quantity of data and returns it as a string or bytes object, where fH is a file object obtained from calling the open function. The size argument is optional, and when omitted, the entire contents of the file are read. When the size argument is specified, only those many bytes/characters will be read from the current cursor position of the file. Let's look at this practically.

    # fileToBeReadFrom.txt
    abcdefghijklmnopqrstuvwxyz
    
    # readingFromAFile.py (in the same directory as the above text file)
    fH = open('fileToBeReadFrom.txt ', 'r ')
    contents = fH.read()
    print(contents)
    
    # F5 / Run Menu > Run Module
    abcdefghijklmnopqrstuvwxyz
    

    The tell() method tells us the current position of the cursor in the file.
    The seek(offset [,from_where]) changes the file cursor position by as many characters as specified in the first argument, from the position specified in the second argument. The second argument has only 3 acceptable values: 0(default), 1, 2. 0 denotes the beginning of the file, 1 denotes the current position of the cursor, and 2 denotes the end of file.

    # fileToBeReadFrom.txt
    abcdefghijklmnopqrstuvwxyz
    
    # readingFromAFile.py (in the same directory as the above text file)
    fH = open('fileToBeReadFrom.txt ', 'r ')
    
    # F5 / Run Menu > Run Module
    >>> fH.tell()
    0
    >>> contents = fH.read(5)		# read 5 bytes/characters
    >>> contents
    'abcde'
    >>> fH.tell()
    5
    >>> contents2 = fH.read(6)		# next six bytes/characters
    >>> contents2
    'fghijk'
    >>> fH.tell()
    11
    >>> fH.seek(5)				# move the cursor to 5 characters/bytes from the beginning.
    5
    >>> fH.tell()
    5
    >>> fH.read(100)			# When a size greater than the length of the file is specified, it is automatically reduced to the size of the file.
    'fghijklmnopqrstuvwxyz'
    >>> fH.tell()
    26
    

    Note that in text files opened without 'b' in access mode, Python only allows seeks relative to the beginning. The only exception to this is seeking to the end of the file using seek(0, 2).

    >>> fH.tell()
    26
    >>> fH.seek(0)
    0
    >>> fH.tell()
    0
    >>> fH.seek(5, 0)
    5
    >>> fH.tell()
    5
    >>> fH.seek(6, 1)          	# 6 bytes/characters from the current position i.e. 5
    Traceback (most recent call last):
    File "<pyshell#28>", line 1, in <module>
    fH.seek(6, 1)          	# 6 bytes/characters from the current position i.e. 5
    io.UnsupportedOperation: can't do nonzero cur-relative seeks
    >>> fH.tell()
    5
    >>> fH.seek(6, 2)		# 6 bytes/characters from the end of the file
    Traceback (most recent call last):
    File "<pyshell#30>", line 1, in <module>
    fH.seek(6, 2)		# 6 bytes/characters from the end of the file
    io.UnsupportedOperation: can't do nonzero end-relative seeks
    >>> fH.tell()
    5
    >>> fH.close()
    
    
  • readline() & readlines()
  • # readline([size]) method
    # reads a file by a line each time it is called. The trailing new line character '\n' is kept in the string.
    # when the optional 'size' argument is present, it reads the line up to the size provided
    # when the optional 'size' argument is not present, it reads the entire line.
    # contents of coding.py, 1st line of which contains 23 characters in all.
    # -*- coding: utf-8 -*-
    variableOne = 'Ethan'
    print(variableOne)
    
    # when the optional 'size' argument is not present
    >>> fh = open('coding.py', 'r')
    >>> fh.readline()
    '# -*- coding: utf-8 -*-\n'
    >>> fh.readline()
    "variableOne = 'Ethan'\n"
    >>> fh.readline()
    'print(variableOne)\n'
    >>> fh.readline()
    ''
    
    # when the optional 'size' argument is present
    >>> fh = open('coding.py', 'r')
    >>> fh.readline(15)
    '# -*- coding: u'
    >>> fh.readline(8)
    'tf-8 -*-'
    >>> fh.readline()
    '\n'
    >>> fh.readline()
    "variableOne = 'Ethan'\n"
    >>> fh.close()
    
    # readlines([size]) method
    # when the optional 'size' argument is not present, it returns a list of lines in the file.
    # when the optional 'size' argument is present, it returns the list of lines till the line which contains the character placed at the position denoted by 'size', and removes these lines from the list containing these lines.; so the next time readline() or readlines() is called, it starts from the next line.
    
    # contents of coding.py, 1st line of which contains 23 characters in all.
    # -*- coding: utf-8 -*-
    variableOne = 'Ethan'
    print(variableOne)
    
    # when the optional 'size' argument is not present
    >>> fh = open('coding.py', 'r')
    >>> fh.readlines()
    ['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n", 'print(variableOne)\n']
    >>> fh.close()
    
    # when the optional 'size' argument is present
    >>> fh = open('coding.py', 'r')
    >>> fh.readlines(27)			# fetches lines until the line containing the 27th character; also, removes these lines from the list returned by readlines(); so the next time readlines() is called, cursor will be placed in the beginning of the next line i.e. 3rd line in this example.
    ['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n"]
    >>> fh.readlines(20)			# fetches lines until the line containing the 20th character
    ['print(variableOne)\n']
    >>> fh.readlines(20)
    []
    >>> fh.close()
    
    >>> fh = open('coding.py', 'r')
    >>> fh.readlines(27)
    ['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n"]
    >>> fh.readlines()					# fetches rest of the lines
    ['print(variableOne)\n']
    
    # using readlines() and readline() in tandem
    >>> fh = open('coding.py', 'r')
    >>> fh.readlines(27)
    ['# -*- coding: utf-8 -*-\n', "variableOne = 'Ethan'\n"]
    >>> fh.readline()
    'print(variableOne)\n'
    >>> fh.close()
    
  • a simple for loop with line variable
  • Lines are the first descendants of a file, and characters are the first descendants of these lines. We can use this fact to display the contents of a file using a for loop.

    fileHandler = open('fileToBeReadFrom.txt')
    lineCounter = 0
    for line in fileHandler:
    lineCounter = lineCounter + 1
    print(str(counter) + ": " + line, end = '')
    fileHandler.close()
    
    # OUTPUT
    1: Python is an extremely versatile language.
    2: It is not limited to desktop applications or applications on the web.
    3: People who code in Python are often referred to as “Pythonistas” or “Pythoneers”.
    
  • BufferedReader object returned by the ‘buffer’ attribute of a file object
  • Again, as a disclaimer, this is somewhat of an advanced topic, so skip it if it confuses you. Other methods will do just fine to read a file. This is placed here for the sake of completeness and enhanced understanding.

    Another way to read a file, albeit slightly more intricate, is to use the BufferedReader object that is returned by the buffer attribute of the file object in question. I say intricate because of the technicalities involved in the method, otherwise reading the contents is extremely simple.

    I will briefly go over the method and its details here.

    # Contents of 'toBeReadFrom4.txt'
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est. Morbi sed pretium purus, ac posuere nibh. Sed sit amet nunc eu metus viverra gravida ac vel dui. Fusce consectetur felis eu dolor feugiat, eu rhoncus ex faucibus. Donec quis consectetur leo. Cras sit amet ex in augue tincidunt convallis et ut neque. Sed varius mollis urna quis condimentum. Pellentesque a neque sed arcu condimentum vulputate sagittis ut urna.
    
    >>> fh = open('toBeReadFrom4.txt', 'r')
    >>> contents = fh.buffer
    >>> fh.buffer
    <_io.BufferedReader name='toBeReadFrom4.txt'>
    >>> for line in contents:
    print(line)
    
    b'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est. Morbi sed pretium purus, ac posuere nibh. Sed sit amet nunc eu metus viverra gravida ac vel dui. Fusce consectetur felis eu dolor feugiat, eu rhoncus ex faucibus. Donec quis consectetur leo. Cras sit amet ex in augue tincidunt convallis et ut neque. Sed varius mollis urna quis condimentum. Pellentesque a neque sed arcu condimentum vulputate sagittis ut urna.'
    

    So what is a BufferedReader, which is returned by the buffer attribute of the file handler? It is actually a buffer which fetches a large amount of data from the file. Here's a link to the documentation.

    You will notice two arguments in the documentation link, raw and buffer_size. The raw argument refers to the stream from which the data is transferred to the buffer (i.e. file handler or fh here). The buffer_size argument is the amount of data to be transferred to the buffer in one go. The default value of this argument is io.DEFAULT_BUFFER_SIZE, and the open() function uses the st_size attribute of os.stat_result(file_name) function, which gives the length of the file.

    So, the takeaway from this is that the buffer attribute of a file object returns a buffer which contains all the contents of the file since the file has been opened with the open() function. These contents can be accessed using a for loop.

    Example illustrating the DEFAULT_BUFFER_SIZE and st_size attributes in modules io and os.

    # Contents of sample file 'toBeWrittenInto4.txt', contents of which are 104 in length.
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    
    >>> import io
    >>> io.DEFAULT_BUFFER_SIZE
    8192
    
    >>> import os
    >>> os.stat('toBeWrittenInto4.txt')
    os.stat_result(st_mode=33206, st_ino=37436171902787844, st_dev=2229729095, st_nlink=1, st_uid=0, st_gid=0, st_size=104, st_atime=1473137894, st_mtime=1473155801, st_ctime=1473137894)
    >>> os.stat('toBeWrittenInto4.txt').st_size
    104
    
    >>> fh = open('toBeWrittenInto4.txt', 'r')
    >>> contents = fh.buffer
    >>> letterCounter = 0
    >>> for line in contents:
    for letter in line:
    letterCounter = letterCounter + 1
    
    >>> print(letterCounter)
    104
    >>> fh.close()
    

    Another thing to note here is that the contents are printed as a binary string while using the buffer attribute of the file object. So, to convert them to normal strings, you may use the decode() method. If you are not familiar with the decode() method, this will help you.

    # Contents of sample file 'toBeWrittenInto4.txt'
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    
    >>> fh = open('toBeWrittenInto4.txt', 'r')
    >>> contents = fh.buffer
    >>> for line in contents:
    print(str(line))
    
    b'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.\r\n'
    b'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.\r\n'
    b'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.’
    >>> fh.close()
    
    >>> fh = open('toBeWrittenInto4.txt', 'r')
    >>> contents = fh.buffer
    >>> for line in contents:
    print(line.decode("utf-8"))
    
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    
    # There is a blank line after each line because by default the print() ends each print call with a '\n' character(>>> print('Hi') -> >>> print('Hello') gives us 'Hello' in a new line). You will not see these extra lines when you are trying to write the same to another file, because you will be using the write() and not print().
    # If you want to print the lines on to the console just like you read them from the file, you can alter the print statement in the for loop to something like: print(line.decode("utf-8"), end="")
    >>> fh = open('toBeWrittenInto4.txt', 'r')
    >>> contents = fh.buffer
    >>> for line in contents:
    print(line.decode("utf-8"), end="")
    
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.
    # If you wish to keep the \r\n characters intact you can escape new lines using the repr() function. Just alter the call to print function to the following: print(  repr (  line.decode( 'utf-8' )  )  )
    >>> fh = open('toBeWrittenInto4.txt', 'r')
    >>> contents = fh.buffer
    >>> for line in contents:
    print( repr (line.decode("utf-8")))
    
    'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.\r\n'
    'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.\r\n'
    'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec non dignissim diam, iaculis vehicula est.\r\n'
    
    FYI, the buffer attribute is not present when you open a file with 'b' access modes e.g. 'wb', 'rb' etc. An AttributeError is encountered when an attempt is made to use it.

    Reading and Writing to the same file

    fh = open('fileToBeWrittenInto.txt', 'w+')          # w+ opens the file for reading as well as writing; creates new file if it doesn't exist already
    fh.write('Perl Python Ruby')
    
    fh.seek(5)                                          # go to 5th byte from beginning
    print(fh.read(6))                                   # read 6 bytes from the current cursor position i.e. bytes 6-11 i.e. 'Python'
    print(fh.tell())                                    # tell the position of cursor i.e. 11
    fh.seek(12)                                         # go to 12th byte from beginning
    fh.write('Delphi')                                  # write 'Delphi' at position 12 i.e. positions 12-13-14-15-16-17 contains 'D-e-l-p-h-i', overwriting earlier content in these positions.
    fh.seek(0)                                          # reset cursor position to the beginning of the file
    content = fh.read()                                 # read the contents of the file
    print(content)                                      # 'Perl Python Delphi'
    
    fh.close()
    

    Reading From And Writing To Binary Files

    Working with binary files is somewhat out of scope of this course, since it is related to encoding and decoding. Nevertheless, if you are interested, then this primer on operating binary files might be of use to you


    Cursor position in a file

    We can manipulate the cursor position inside a file while reading and writing, using the tell() & seek() methods, as seen in the section Ways to read a file, subheading manipulating the file cursor position.


    The 'with' keyword

    The with keyword automatically shuts down resources as soon as the code block following it is executed. It is considered good practice while dealing with connections, file objects etc. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally clauses.

    >>> with open('fileToBeReadFrom.txt', 'r') as fH:
    contents = fH.read()
    >>> fH.closed
    True
    

    The pickle, shelve & json modules

    Python offers 3 builtin modules which aid in serializing data in files and deserializing it on a later date. These modules are: pickle, shelve, json. In order to avoid overwhelming you with huge amounts of information, I have covered these modules in a separate post, here. So if you are interested in knowing more about serialization in Python, be sure to check that post out.


    A Few General Things

    repr() and str()

    In a nutshell, the __repr__() method of an object is defined to make it unambiguous, whereas the __str__() method is defined to make the object readable.

    The builtin repr() function returns a string containing the printable representation of an object. Each Python object, be it lists, sets, tuples etc. has a magic method __repr__ (Learn more about magic methods here) which is called implicitly when repr() is called on them. Let's look at a few examples.

    >>> repr('string')
    "'string'"
    >>> repr(4)
    '4'
    >>> repr( set( [1, 2, 'three', 'four'] ) )
    "{1, 2, 'four', 'three'}"
    >>> repr( [1, 2, 'three', 'four'] )
    "[1, 2, 'three', 'four']"
    >>> repr( (1, 2, 'three', 'four') )
    "(1, 2, 'three', 'four')"
    >>>
    >>>
    >>>
    >>> str('string')
    'string'
    >>> str(4)
    '4'
    >>> str( set( [1, 2, 'three', 'four'] ) )
    "{1, 2, 'four', 'three'}"
    >>> str( [1, 2, 'three', 'four'] )
    "[1, 2, 'three', 'four']"
    >>> str( (1, 2, 'three', 'four') )
    "(1, 2, 'three', 'four')"
    

    There is a little difference between the outputs of the repr() and str() with same inputs, as seen in examples above. The str(object) calls the __str__ magic method of the object, if defined. The str(object) returns the "informal" or nicely printable string representation of the object. If the object does not have a __str__ method, then str(object) returns the string returned by repr(object) .

    One important thing to note about the repr() function is that it keeps the escape sequences intact, and does not interpret them, like we saw in the BufferedReader code. This behavior is in contrast to the str() function.

    I reiterate, the __repr__() method of an object is defined to make it unambiguous, whereas the __str__() method is defined to make the object readable.

    >>>
    >>> myIntricateString = 'Hello\n'
    >>>
    >>> print(myIntricateString)
    Hello
    
    >>> print(str(myIntricateString))
    Hello
    
    >>> print(repr(myIntricateString))
    'Hello\n'
    

    The Java equivalent of this function is the toString() method, which when called, gives a string containing the printable representation of an object in Java.


    Overriding default behavior of print()

    The print() function, by default, separates each of its arguments with a space. To change this, Python provides us with the optional sep argument, which stands for 'separator'. We can specify the separator character/string here.

    >>> print("Hello","there","!")
    Hello there !
    >>> print("Hello","there","!", sep = "")
    Hellothere!
    >>> print("Hello","there","!", sep = "##")
    Hello##there##!
    

    Another default behavior of the print function is that consecutive calls to the print function leads to printing in consecutive lines in the output. We can change this using the optional end argument.

    >>> try:
    print("Hello")
    print("there.")
    except:
    pass
    
    Hello
    there.
    
    >>> try:
    print("Hello", end = "")
    print("there.")
    except:
    pass
    
    Hellothere.
    
    >>> try:
    print("Hello", end = "\t")
    print("there.")
    except:
    pass
    
    Hello	there.
    

    Further Reading


    On the agenda in the next chapter

    That was a lot to take in! File Manipulation is important to know, and if you have understood every bit of this chapter, then well done! If not, take your time, practice and you will eventually get there. In the next and last chapter of this course, you will learn about how Python handles errors. Till next time!


    Exercises

    • In Unix, the head -n fileName command displays the first n lines of the file fileName. Write a program in Python to emulate the same functionality. [ Solution ]
    • Write a Python program to extract the longest word(s) out of a file.
      # contents of read.txt
      Donec volutpat rhoncus velit a tincidunt. Duis est ipsum, finibus nec molestie id, finibus eu nisi. Donec sit amet consectetur dolor, vitae vehicula enim.
      Nullam quis purus vestibulum, consequat eros eu, placerat justo. Pellentesque tempus commodo ex, eu mattis lectus tempor vitae.
      In felis orci, consectetur nec congue in, ullamcorper vel purus. Aliquam euismod erat ut venenatis placerat. Maecenas eget sodales magna, in interdum velit.
      
      ## EXPECTED OUTPUT ##
      Pellentesque
      

      [ Solution ]

    • Write a Python script which prints corresponding lines of two files together.
      # contents of textOne.txt
      Agilent Technologies
      Alcoa Corporation
      Aac Holdings Inc
      Aaron's Inc
      Advance Auto Parts Inc
      
      # contents of textTwo.txt
      A
      AA
      AAC
      AAN
      AAP
      
      ## EXPECTED OUTPUT ##
      Agilent Technologies : A
      Alcoa Corporation : AA
      Aac Holdings Inc : AAC
      Aaron's Inc : AAN
      Advance Auto Parts Inc : AAP
      

      [ Solution ]

    • Write Python code to pick a line randomly from a file. [ Solution ]
    • Write a Python program to create minified versions of files i.e. remove newline characters at end-of-line and bring all content in a single line.
      # contents of style.css
      body {
      color: blue;
      background: #000;
      text-align: center;
      }
      
      ## EXPECTED OUTPUT ##
      body {color: blue; background: #000; text-align: center; }
      

      [ Solution ]


     


    See also:


Buffer this pageShare on FacebookPrint this pageTweet about this on TwitterShare on Google+Share on LinkedInShare on StumbleUpon

Leave a Reply