How to extract text from an existing docx file using python-docx

Question

I m trying to use python-docx module  pip install python-docx  but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class  Even they are only showing how to add text to a docx file not reading existing one   1st one  opendocx  is not working  may be deprecated  For second case I was trying to use   from docx import Document  document   Document  test doc docx    print document paragraphs   It returned a list of  lt docx text Paragraph object at 0x     gt    Then I did   for p in document paragraphs      print p text   It returned all text but there were few thing missing  All URLs  CTRL CLICK to go to URL  were not present in text on console   What is the issue  Why URLs are missing   How could I get complete text without iterating over loop  something like open   read

User · Answer

you can try this also  from docx import Document  document   Document  demo docx   for para in document paragraphs      print para text

User · Answer

There are two  generations  of python-docx  The initial generation ended with the 0 2 x versions and the  new  generation started at v0 3 0  The new generation is a ground-up  object-oriented rewrite of the legacy version  It has a distinct repository located here   The opendocx   function is part of the legacy API  The documentation is for the new version  The legacy version has no documentation to speak of   Neither reading nor writing hyperlinks are supported in the current version  That capability is on the roadmap  and the project is under active development  It turns out to be quite a broad API because Word has so much functionality  So we ll get to it  but probably not in the next month unless someone decides to focus on that aspect and contribute it   UPDATE Hyperlink support was added subsequent to this answer

User · Answer

Using python-docx  as  Chinmoy Panda  s answer shows    for para in doc paragraphs      fullText append para text    However  para text  will lost the text in w smarttag  Corresponding github issue is here  https   github com python-openxml python-docx issues 328   you should use the following function instead   def para2text p       rs   p  element xpath     w t       return u    join  r text for r in rs

User · Answer

you can try this  import docx  def getText filename       doc   docx Document filename      fullText          for para in doc paragraphs          fullText append para text      return   n  join fullText

User · Answer

Without Installing python-docx docx is basically is a zip file with several folders and files within it  In the link below you can find a simple function to extract the text from docx file  without the need to rely on python-docx and lxml the latter being sometimes hard to install  http   etienned github io posts extract-text-from-word-docx-simply

User · Answer

It seems that there is no official solution for this problem  but there is a workaround posted here https   github com savoirfairelinux python-docx commit afd9fef6b2636c196761e5ed34eb05908e582649 just update this file  quot     site-packages docx oxml init  py quot    add import re import sys    add def remove hyperlink tags xml       if  sys version info  gt   3  0            xml   xml decode  utf-8       xml   xml replace   lt  w hyperlink gt            xml   re sub   lt w hyperlink   gt    gt        xml      if  sys version info  gt   3  0            xml   xml encode  utf-8       return xml        update def parse xml xml        quot  quot  quot      Return root lxml element obtained by parsing XML character string in      xml   which can be either a Python 2 x string or unicode  The custom     parser is used  so custom element classes are produced for elements in      xml  that have them       quot  quot  quot      root element   etree fromstring remove hyperlink tags xml   oxml parser      return root element   and of course don t forget to mention in the documentation that use are changing the official library

User · Answer

I had a similar issue so I found a workaround  remove hyperlink tags thanks to regular expressions so that only a paragraph tag remains   I posted this solution on https   github com python-openxml python-docx issues 85  BP

User · Answer

You can use python-docx2txt which is adapted from python-docx but can also extract text from links  headers and footers  It can also extract images

[python] How to extract text from an existing docx file using python-docx

Examples related to python

Examples related to python-2.7

Examples related to python-3.x

Examples related to python-docx