Converting a Word document to text using IronPython

Software Engineering 2390 views

The module below demonstrates how to convert a batch of Word documents to text.

If calling from a command line, you can pass the path of the files to convert as an argument. Or, call the module without a argument and it will use what you have defined as test_path.

The doc_to_text method does the actual work of converting an individual Word document to text. Using COM Interop, it opens the Word document, loops through the paragraphs and returns the paragraph text. The text is passed to clean_text to perform any text cleansing. Internally, Word documents use carriage returns (CR), which I replace as carriage returns plus line feeds (CR+LF). Page breaks are represented by the form feed (FF) character. I'm not sure what the BEL character is used for, however it was prevalent in my Word documents – so I replaced them with an empty string.

The convert_files method gets a list of all of the Word documents in a directory, loops through that list converting each file, and saves the result as a text file.

__author__ = "Edward J. Stembler"
__date__ = "2009-01-09"
__module_name__ = "Converts a batch of Word documents, found in a directory, to text"
__version__ = "1.0"
version_info = (1,0,0)

import sys
import clr
import System
from System.Text import StringBuilder
from System.IO import DirectoryInfo, File, FileInfo, Path, StreamWriter


import Microsoft.Office.Interop.Word as Word

def convert_files(doc_path):

    directory = DirectoryInfo(doc_path)
    files = directory.GetFiles("*.doc")

    for file_info in files:
        text = doc_to_text(Path.Combine(doc_path, file_info.Name))

        stream_writer = File.CreateText(Path.GetFileNameWithoutExtension(file_info.Name) + ".txt")


def doc_to_text(filename):

    word_application = Word.ApplicationClass()
    word_application.visible = False

    document = word_application.Documents.Open(filename)

    result = StringBuilder()

    for p in document.Paragraphs:

    document = None

    word_application = None

    return result.ToString()

def clean_text(text):

    text = text.replace("\12", "")    # FF
    text = text.replace("\07", "")    # BEL
    text = text.replace("\r", "\r\n") # CR -> CRLF

    return text

test_path = "C:\\test\\"

if __name__ == "__main__":
    if len(sys.argv) == 2: