Converting a Word document to text using IronPython
The module below demonstrates how to convert a batch of Word documents to text.
If calling from a command line, you can pass the path of the files to convert as an argument. Or, call the module without a argument and it will use what you have defined as test_path.
The doc_to_text method does the actual work of converting an individual Word document to text. Using COM Interop, it opens the Word document, loops through the paragraphs and returns the paragraph text. The text is passed to clean_text to perform any text cleansing. Internally, Word documents use carriage returns (CR), which I replace as carriage returns plus line feeds (CR+LF). Page breaks are represented by the form feed (FF) character. I'm not sure what the BEL character is used for, however it was prevalent in my Word documents – so I replaced them with an empty string.
The convert_files method gets a list of all of the Word documents in a directory, loops through that list converting each file, and saves the result as a text file.
__author__ = "Edward J. Stembler"
__date__ = "2009-01-09"
__module_name__ = "Converts a batch of Word documents, found in a directory, to text"
__version__ = "1.0"
version_info = (1,0,0)
import sys
import clr
import System
from System.Text import StringBuilder
from System.IO import DirectoryInfo, File, FileInfo, Path, StreamWriter
clr.AddReference("Microsoft.Office.Interop.Word")
import Microsoft.Office.Interop.Word as Word
def convert_files(doc_path):
directory = DirectoryInfo(doc_path)
files = directory.GetFiles("*.doc")
for file_info in files:
text = doc_to_text(Path.Combine(doc_path, file_info.Name))
stream_writer = File.CreateText(Path.GetFileNameWithoutExtension(file_info.Name) + ".txt")
stream_writer.Write(text)
stream_writer.Close()
return
def doc_to_text(filename):
word_application = Word.ApplicationClass()
word_application.visible = False
document = word_application.Documents.Open(filename)
result = StringBuilder()
for p in document.Paragraphs:
result.Append(clean_text(p.Range.Text))
document.Close()
document = None
word_application.Quit()
word_application = None
return result.ToString()
def clean_text(text):
text = text.replace("\12", "") # FF
text = text.replace("\07", "") # BEL
text = text.replace("\r", "\r\n") # CR -> CRLF
return text
test_path = "C:\\test\\"
if __name__ == "__main__":
if len(sys.argv) == 2:
convert_files(sys.argv[1])
else:
convert_files(test_path)