Read doc file with python

Question

I got a test for job application  my deal is read some  doc files  Does anyone know a library to do this  I had started with a raw python code   f   open  test doc    r   f read     but this does not return a friendly string I need to convert it to utf-8  Edit  I just want get the text from this file

User · Answer

One can use the textract library  It take care of both  doc  as well as  docx   import textract text   textract process  path to file extension     You can even use  antiword   sudo apt-get install antiword  and then convert doc to first into docx and then read through docx2txt   antiword filename doc  gt  filename docx   Ultimately  textract in the backend is using antiword

User · Answer

The answer from Shivam Kotwalia works perfectly  However  the object is imported as a byte type  Sometimes you may need it as a string for performing REGEX or something like that   I recommend the following code  two lines from Shivam Kotwalia s answer     import textract  text   textract process  path to file extension   text   text decode  utf-8      The last line will convert the object text to a string

User · Answer

Prerequisites    install antiword   sudo apt-get install antiword  install docx   pip install docx  from subprocess import Popen  PIPE  from docx import opendocx  getdocumenttext from cStringIO import StringIO def document to text filename  file path       cmd     antiword   file path      p   Popen cmd  stdout PIPE      stdout  stderr   p communicate       return stdout decode  ascii    ignore    print document to text  your file name   your file path     Notice     New versions of python-docx removed this function  Make sure to pip install docx and not the new python-docx

User · Answer

You can use python-docx2txt library to read text from Microsoft Word documents  It is an improvement over python-docx library  as it can  in addition  extract text from links  headers and footers  It can even extract images    You can install it by running     pip install docx2txt   Let s download and read the first Microsoft document on here   import docx2txt my text   docx2txt process  test docx   print my text    Here is a screenshot of the Terminal output the above code     EDIT   This does NOT work for  doc files  The only reason I am keep this answer is that it seems there are people who find it useful for  docx files

User · Answer

I was trying to to the same  I found lots of information on reading  docx but much less on  doc  Anyway  I managed to read the text using the following    import win32com client  word   win32com client Dispatch  Word Application   word visible   False wb   word Documents Open  myfile doc   doc   word ActiveDocument print doc Range   Text

User · Answer

I agree with Shivam s answer except for textract  doesn t exist for windows  And  for some reason antiword also fails to read the   doc  files and gives an error    filename doc  is not a word document    This happens when the file wasn t generated via MS Office  Eg  Web-pages may be stored in  doc format offline    So  I ve got the following workaround to extract the text   from bs4 import BeautifulSoup as bs soup   bs open filename  read     s extract   for s in soup   style    script     tmpText   soup get text   text      join    join tmpText split   t    split   n    encode  utf-8   strip   print text   This script will work with most kinds of files  Have fun

[python] Read .doc file with python

Examples related to python

Examples related to python-2.7