[python] How to extract text from an existing docx file using python-docx

I'm trying to use python-docx module (pip install python-docx) but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even they are only showing how to add text to a docx file not reading existing one?

1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:

from docx import Document

document = Document('test_doc.docx')

print document.paragraphs

It returned a list of <docx.text.Paragraph object at 0x... >

Then I did:

for p in document.paragraphs:
    print p.text

It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.

What is the issue? Why URLs are missing?

How could I get complete text without iterating over loop (something like open().read())

This question is related to python python-2.7 python-3.x python-docx

The answer is


you can try this

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Without Installing python-docx

docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:

http://etienned.github.io/posts/extract-text-from-word-docx-simply/


It seems that there is no official solution for this problem, but there is a workaround posted here https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649

just update this file "...\site-packages\docx\oxml_init_.py"

# add
import re
import sys

# add
def remove_hyperlink_tags(xml):
    if (sys.version_info > (3, 0)):
        xml = xml.decode('utf-8')
    xml = xml.replace('</w:hyperlink>', '')
    xml = re.sub('<w:hyperlink[^>]*>', '', xml)
    if (sys.version_info > (3, 0)):
        xml = xml.encode('utf-8')
    return xml
    
# update
def parse_xml(xml):
    """
    Return root lxml element obtained by parsing XML character string in
    *xml*, which can be either a Python 2.x string or unicode. The custom
    parser is used, so custom element classes are produced for elements in
    *xml* that have them.
    """
    root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
    return root_element

and of course don't forget to mention in the documentation that use are changing the official library


You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.


you can try this also

from docx import Document

document = Document('demo.docx')
for para in document.paragraphs:
    print(para.text)

Using python-docx, as @Chinmoy Panda 's answer shows:

for para in doc.paragraphs:
    fullText.append(para.text)

However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:

def para2text(p):
    rs = p._element.xpath('.//w:t')
    return u" ".join([r.text for r in rs])

There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.

The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.

Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it. UPDATE Hyperlink support was added subsequent to this answer.


I had a similar issue so I found a workaround (remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains). I posted this solution on https://github.com/python-openxml/python-docx/issues/85 BP


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict

Examples related to python-3.x

Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation Replace specific text with a redacted version using Python Upgrade to python 3.8 using conda "Permission Denied" trying to run Python on Windows 10 Python: 'ModuleNotFoundError' when trying to import module from imported package What is the meaning of "Failed building wheel for X" in pip install? How to downgrade python from 3.7 to 3.6 I can't install pyaudio on Windows? How to solve "error: Microsoft Visual C++ 14.0 is required."? Iterating over arrays in Python 3 How to upgrade Python version to 3.7?

Examples related to python-docx

How to extract text from an existing docx file using python-docx