Extracting text from HTML file using Python

Question

I d like to extract the text from an HTML file using Python   I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad     I d like something more robust than using regular expressions that may fail on poorly formed HTML   I ve seen many people recommend Beautiful Soup  but I ve had a few problems using it   For one  it picked up unwanted text  such as JavaScript source   Also  it did not interpret HTML entities   For example  I would expect  amp  39  in HTML source to be converted to an apostrophe in text  just as if I d pasted the browser content into notepad   Update html2text looks promising  It handles HTML entities correctly and ignores JavaScript   However  it does not exactly produce plain text  it produces markdown that would then have to be turned into plain text  It comes with no examples or documentation  but the code looks clean     Related questions    Filter out HTML tags and resolve entities in python Convert XML HTML Entities into Unicode String in Python

User · Answer

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Update

Based on Fraser's comment, here is more elegant solution:

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

User · Answer

PyParsing does a great job   The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing  example link   One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O Reilly Short Cut manual that is also inexpensive   Having said that  I use BeautifulSoup a lot and it is not that hard to deal with the entities issues  you can convert them before you run BeautifulSoup     Goodluck

User · Answer

you can extract only text from HTML with BeautifulSoup  url    https   www geeksforgeeks org extracting-email-addresses-using-regular-expressions-python   con   urlopen url  read   soup   BeautifulSoup con  html parser   texts   soup get text   print texts

User · Answer

PyParsing does a great job   The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing  example link   One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O Reilly Short Cut manual that is also inexpensive   Having said that  I use BeautifulSoup a lot and it is not that hard to deal with the entities issues  you can convert them before you run BeautifulSoup     Goodluck

User · Answer

html2text is a Python program that does a pretty good job at this

User · Answer

you can extract only text from HTML with BeautifulSoup  url    https   www geeksforgeeks org extracting-email-addresses-using-regular-expressions-python   con   urlopen url  read   soup   BeautifulSoup con  html parser   texts   soup get text   print texts

User · Answer

Here s the code I use on a regular basis   from bs4 import BeautifulSoup import urllib request   def processText webpage          EMPTY LIST TO STORE PROCESSED TEXT     proc text           try          news open   urllib request urlopen webpage group            news soup   BeautifulSoup news open   lxml           news para   news soup find all  p   text   True           for item in news para                SPLIT WORDS  JOIN WORDS TO REMOVE EXTRA SPACES             para text         join  item text  split                   COMBINE LINES PARAGRAPHS INTO A LIST             proc text append para text       except urllib error HTTPError          pass      return proc text   I hope that helps

User · Answer

Perl way  sorry mom  i ll never do it in production    import re  def html2text html       res   re sub   lt     gt         html  flags re DOTALL   re MULTILINE      res   re sub   n      n   res      res   re sub   r        res      res   re sub    t           res      res   re sub   t      t   res      res   re sub    n        n    res      return res

User · Answer

if you need more speed and less accuracy then you could use raw lxml   import lxml html as lh from lxml html clean import clean html  def lxml to text html       doc   lh fromstring html      doc   clean html doc      return doc text content

User · Answer

In Python 3 x you can do it in a very easy way by importing  imaplib  and  email  packages  Although this is an older post but maybe my answer can help new comers on this post   status  data   self imap fetch num    RFC822    email msg   email message from bytes data 0  1     email message from string data 0  1     If message is multi part we only want the text version of the body  this walks the message and gets the body   if email msg is multipart        for part in email msg walk                   if part get content type       text plain               body   part get payload decode True   to control automatic email-style MIME decoding  e g   Base64  uuencode  quoted-printable              body   body decode           elif part get content type       text html               continue   Now you can print body variable and it will be in plaintext format    If it is good enough for you then it would be nice to select it as accepted answer

User · Answer

I recommend a Python Package called goose-extractor Goose will try to extract the following information   Main text of an article Main image of article Any Youtube Vimeo movies embedded in article Meta Description Meta tags  More  https   pypi python org pypi goose-extractor

User · Answer

Beautiful soup does convert html entities  It s probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues  This is the code I use to convert html to raw text   import BeautifulSoup def getsoup data  to unicode False       data   data replace   amp nbsp               Fixes for bad markup I ve seen in the wild   Remove if not applicable      masssage bad comments              re compile   lt  -   -      lambda match    lt  --    match group 1             re compile   lt  WWWAnswer T   w d s   gt     lambda match    lt  --    match group 0     -- gt               myNewMassage   copy copy BeautifulSoup BeautifulSoup MARKUP MASSAGE      myNewMassage extend masssage bad comments      return BeautifulSoup BeautifulSoup data  markupMassage myNewMassage          convertEntities BeautifulSoup BeautifulSoup ALL ENTITIES                      if to unicode else None   remove html   lambda c  getsoup c  to unicode True  getText separator u     if c else

User · Answer

There is Pattern library for data mining   http   www clips ua ac be pages pattern-web  You can even decide what tags to keep   s   URL  http   www clips ua ac be   download   s   plaintext s  keep   h1       h2       strong       a    href     print s

User · Answer

NOTE  NTLK no longer supports clean html function  Original answer below  and an alternative in the comments sections     Use NLTK    I wasted my 4-5 hours fixing the issues with html2text   Luckily i could encounter NLTK  It works magically       import nltk    from urllib import urlopen  url    http   news bbc co uk 2 hi health 2284783 stm      html   urlopen url  read       raw   nltk clean html html    print raw

User · Answer

Instead of the HTMLParser module  check out htmllib   It has a similar interface  but does more of the work for you    It is pretty ancient  so it s not much help in terms of getting rid of javascript and css   You could make a derived class  but and add methods with names like start script and end style  see the python docs for details   but it s hard to do this reliably for malformed html    Anyway  here s something simple that prints the plain text to the console  from htmllib import HTMLParser  HTMLParseError from formatter import AbstractFormatter  DumbWriter p   HTMLParser AbstractFormatter DumbWriter     try  p feed  hello lt br gt there    p close    calling close is not usually needed  but let s play it safe except HTMLParseError  print       the html is badly malformed  or you found a bug

User · Answer

Best worked for me is inscripts     https   github com weblyzard inscriptis  import urllib request from inscriptis import get text  url    http   www informationscience ch  html   urllib request urlopen url  read   decode  utf-8    text   get text html  print text    The results are really good

User · Answer

I recommend a Python Package called goose-extractor Goose will try to extract the following information   Main text of an article Main image of article Any Youtube Vimeo movies embedded in article Meta Description Meta tags  More  https   pypi python org pypi goose-extractor

User · Answer

Another example using BeautifulSoup4 in Python 2 7 9   includes   import urllib2 from bs4 import BeautifulSoup   Code   def read website to text url       page   urllib2 urlopen url      soup   BeautifulSoup page   html parser       for script in soup   script    style             script extract        text   soup get text       lines    line strip   for line in text splitlines        chunks    phrase strip   for line in lines for phrase in line split            text     n  join chunk for chunk in chunks if chunk      return str text encode  utf-8      Explained   Read in the url data as html  using BeautifulSoup   remove all script and style elements  and also get just the text using  get text    Break into lines and remove leading and trailing space on each  then break multi-headlines into a line each chunks    phrase strip   for line in lines for phrase in line split         Then using text     n  join  drop blank lines  finally return as sanctioned utf-8   Notes     Some systems this is run on will fail with https    connections because of SSL issue  you can turn off the verify to fix that issue   Example fix  http   blog pengyifan com how-to-fix-python-ssl-certificate verify failed  Python  lt  2 7 9 may have some issue running this  text encode  utf-8   can leave weird encoding  may want to just return str text  instead

User · Answer

You can use html2text method in the stripogram library also   from stripogram import html2text text   html2text your html string    To install stripogram run sudo easy install stripogram

User · Answer

Anyone has tried bleach clean html tags    strip True  with bleach  it s working for me

User · Answer

NOTE  NTLK no longer supports clean html function  Original answer below  and an alternative in the comments sections     Use NLTK    I wasted my 4-5 hours fixing the issues with html2text   Luckily i could encounter NLTK  It works magically       import nltk    from urllib import urlopen  url    http   news bbc co uk 2 hi health 2284783 stm      html   urlopen url  read       raw   nltk clean html html    print raw

User · Answer

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Update

Based on Fraser's comment, here is more elegant solution:

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

User · Answer

Another non-python solution  Libre Office   soffice --headless --invisible --convert-to txt input1 html   The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line  no line breaks   which is what I was looking for  Other methods require post-processing  Lynx does produce nice output  but not exactly what I was looking for  Besides  Libre Office can be used to convert from all sorts of formats

User · Answer

Another option is to run the html through a text based web browser and dump it  For example  using Lynx    lynx -dump html to convert html  gt  converted html txt   This can be done within a python script as follows   import subprocess  with open  converted html txt    w   as outputFile      subprocess call   lynx    -dump    html to convert html    stdout testFile    It won t give you exactly just the text from the HTML file  but depending on your use case it may be preferable to the output of html2text

User · Answer

html2text is a Python program that does a pretty good job at this

User · Answer

PyParsing does a great job   The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing  example link   One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O Reilly Short Cut manual that is also inexpensive   Having said that  I use BeautifulSoup a lot and it is not that hard to deal with the entities issues  you can convert them before you run BeautifulSoup     Goodluck

User · Answer

Here is a version of xperroni s answer which is a bit more complete  It skips script and style sections and translates charrefs  e g    amp  39   and HTML entities  e g    amp amp     It also includes a trivial plain-text-to-html inverse converter       HTML  lt - gt  text conversions      from HTMLParser import HTMLParser  HTMLParseError from htmlentitydefs import name2codepoint import re  class  HTMLToText HTMLParser       def   init   self           HTMLParser   init   self          self  buf              self hide output   False      def handle starttag self  tag  attrs           if tag in   p    br   and not self hide output              self  buf append   n           elif tag in   script    style                self hide output   True      def handle startendtag self  tag  attrs           if tag     br               self  buf append   n        def handle endtag self  tag           if tag     p               self  buf append   n           elif tag in   script    style                self hide output   False      def handle data self  text           if text and not self hide output              self  buf append re sub r  s         text        def handle entityref self  name           if name in name2codepoint and not self hide output              c   unichr name2codepoint name               self  buf append c       def handle charref self  name           if not self hide output              n   int name 1    16  if name startswith  x   else int name              self  buf append unichr n        def get text self           return re sub r              join self  buf    def html to text html               Given a piece of HTML  return the plain text it contains      This handles entities and char refs  but not javascript and stylesheets              parser    HTMLToText       try          parser feed html          parser close       except HTMLParseError          pass     return parser get text    def text to html text               Convert the given text to html  wrapping what looks like URLs with  lt a gt  tags      converting newlines to  lt br gt  tags and converting confusing chars into html     entities              def f mo           t   mo group           if len t     1              return    amp     amp amp          amp  39          amp quot      lt     amp lt      gt     amp gt    get t          return   lt a href   s  gt  s lt  a gt      t  t      return re sub r https                   amp     lt  gt     f  text

User · Answer

PeYoTIL s answer using BeautifulSoup and eliminating style and script content didn t work for me  I tried it using decompose instead of extract but it still didn t work  So I created my own which also formats the text using the  lt p gt  tags and replaces  lt a gt  tags with the href link  Also copes with links inside text  Available at this gist with a test doc embedded   from bs4 import BeautifulSoup  NavigableString  def html to text html        Creates a formatted text email message as a string from a rendered html template  page       soup   BeautifulSoup html   html parser         Ignore anything in head     body  text   soup body         for element in body descendants            We use type and not isinstance since comments  cdata  etc are subclasses that we don t want         if type element     NavigableString                We use the assumption that other tags can t be inside a script or style             if element parent name in   script    style                    continue                remove any multiple and leading trailing whitespace             string       join element string split                if string                  if element parent name     a                       a tag   element parent                       replace link text with the link                     string   a tag  href                         concatenate with any non-empty immediately previous string                     if      type a tag previous sibling     NavigableString and                             a tag previous sibling string strip                              text -1    text -1          string                         continue                 elif element previous sibling and element previous sibling name     a                       text -1    text -1          string                     continue                 elif element parent name     p                         Add extra paragraph formatting newline                     string     n    string                 text     string      doc     n  join text      return doc

User · Answer

There is Pattern library for data mining   http   www clips ua ac be pages pattern-web  You can even decide what tags to keep   s   URL  http   www clips ua ac be   download   s   plaintext s  keep   h1       h2       strong       a    href     print s

User · Answer

PeYoTIL s answer using BeautifulSoup and eliminating style and script content didn t work for me  I tried it using decompose instead of extract but it still didn t work  So I created my own which also formats the text using the  lt p gt  tags and replaces  lt a gt  tags with the href link  Also copes with links inside text  Available at this gist with a test doc embedded   from bs4 import BeautifulSoup  NavigableString  def html to text html        Creates a formatted text email message as a string from a rendered html template  page       soup   BeautifulSoup html   html parser         Ignore anything in head     body  text   soup body         for element in body descendants            We use type and not isinstance since comments  cdata  etc are subclasses that we don t want         if type element     NavigableString                We use the assumption that other tags can t be inside a script or style             if element parent name in   script    style                    continue                remove any multiple and leading trailing whitespace             string       join element string split                if string                  if element parent name     a                       a tag   element parent                       replace link text with the link                     string   a tag  href                         concatenate with any non-empty immediately previous string                     if      type a tag previous sibling     NavigableString and                             a tag previous sibling string strip                              text -1    text -1          string                         continue                 elif element previous sibling and element previous sibling name     a                       text -1    text -1          string                     continue                 elif element parent name     p                         Add extra paragraph formatting newline                     string     n    string                 text     string      doc     n  join text      return doc

User · Answer

The LibreOffice writer comment has merit since the application can employ python macros  It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice  If this resolution is a one-off implementation  rather than to be used as part of a greater production program  opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here

User · Answer

html2text is a Python program that does a pretty good job at this

User · Answer

I am achieving it something like this     gt  gt  gt  import requests  gt  gt  gt  url    http   news bbc co uk 2 hi health 2284783 stm   gt  gt  gt  res   requests get url   gt  gt  gt  text   res text

User · Answer

I am achieving it something like this     gt  gt  gt  import requests  gt  gt  gt  url    http   news bbc co uk 2 hi health 2284783 stm   gt  gt  gt  res   requests get url   gt  gt  gt  text   res text

User · Answer

This isn t exactly a Python solution  but it will convert text Javascript would generate into text  which I think is important  E G  google com   The browser Links  not Lynx  has a Javascript engine  and will convert source to text with the -dump option   So you could do something like   fname   os tmpnam   fname write html source  proc   subprocess Popen   links    -dump   fname                            stdout subprocess PIPE                          stderr open   dev null   w    text   proc stdout read

User · Answer

Here is a version of xperroni s answer which is a bit more complete  It skips script and style sections and translates charrefs  e g    amp  39   and HTML entities  e g    amp amp     It also includes a trivial plain-text-to-html inverse converter       HTML  lt - gt  text conversions      from HTMLParser import HTMLParser  HTMLParseError from htmlentitydefs import name2codepoint import re  class  HTMLToText HTMLParser       def   init   self           HTMLParser   init   self          self  buf              self hide output   False      def handle starttag self  tag  attrs           if tag in   p    br   and not self hide output              self  buf append   n           elif tag in   script    style                self hide output   True      def handle startendtag self  tag  attrs           if tag     br               self  buf append   n        def handle endtag self  tag           if tag     p               self  buf append   n           elif tag in   script    style                self hide output   False      def handle data self  text           if text and not self hide output              self  buf append re sub r  s         text        def handle entityref self  name           if name in name2codepoint and not self hide output              c   unichr name2codepoint name               self  buf append c       def handle charref self  name           if not self hide output              n   int name 1    16  if name startswith  x   else int name              self  buf append unichr n        def get text self           return re sub r              join self  buf    def html to text html               Given a piece of HTML  return the plain text it contains      This handles entities and char refs  but not javascript and stylesheets              parser    HTMLToText       try          parser feed html          parser close       except HTMLParseError          pass     return parser get text    def text to html text               Convert the given text to html  wrapping what looks like URLs with  lt a gt  tags      converting newlines to  lt br gt  tags and converting confusing chars into html     entities              def f mo           t   mo group           if len t     1              return    amp     amp amp          amp  39          amp quot      lt     amp lt      gt     amp gt    get t          return   lt a href   s  gt  s lt  a gt      t  t      return re sub r https                   amp     lt  gt     f  text

User · Answer

Instead of the HTMLParser module  check out htmllib   It has a similar interface  but does more of the work for you    It is pretty ancient  so it s not much help in terms of getting rid of javascript and css   You could make a derived class  but and add methods with names like start script and end style  see the python docs for details   but it s hard to do this reliably for malformed html    Anyway  here s something simple that prints the plain text to the console  from htmllib import HTMLParser  HTMLParseError from formatter import AbstractFormatter  DumbWriter p   HTMLParser AbstractFormatter DumbWriter     try  p feed  hello lt br gt there    p close    calling close is not usually needed  but let s play it safe except HTMLParseError  print       the html is badly malformed  or you found a bug

User · Answer

The LibreOffice writer comment has merit since the application can employ python macros  It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice  If this resolution is a one-off implementation  rather than to be used as part of a greater production program  opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here

User · Answer

While alot of people mentioned using regex to strip html tags  there are a lot of downsides   for example    lt p gt hello amp nbsp world lt  p gt I love you   Should be parsed to   Hello world I love you   Here s a snippet I came up with  you can cusomize it to your specific needs  and it works like a charm  import re import html def html2text htm       ret   html unescape htm      ret   ret translate           8209  ord  -            8220  ord               8221  ord               160  ord                  ret   re sub r  s        ret  flags   re MULTILINE      ret   re sub   lt br gt   lt br   gt   lt  p gt   lt  div gt   lt  h d gt      n   ret  flags   re IGNORECASE      ret   re sub   lt     gt         ret  flags re DOTALL      ret   re sub r            ret      return ret

User · Answer

You can use html2text method in the stripogram library also   from stripogram import html2text text   html2text your html string    To install stripogram run sudo easy install stripogram

User · Answer

I ve had good results with Apache Tika  Its purpose is the extraction of metadata and text from content  hence the underlying parser is tuned accordingly out of the box   Tika can be run as a server  is trivial to run   deploy in a Docker container  and from there can be accessed via Python bindings

User · Answer

install html2text using      pip install html2text   then    gt  gt  gt  import html2text  gt  gt  gt   gt  gt  gt  h   html2text HTML2Text    gt  gt  gt    Ignore converting links from HTML  gt  gt  gt  h ignore links   True  gt  gt  gt  print h handle   lt p gt Hello   lt a href  http   earth google com   gt world lt  a gt     Hello  world

User · Answer

I had a similar question and actually used one of the answers with BeautifulSoup  The problem was it was really slow  I ended up using library called selectolax  It s pretty limited but it works for this task  The only issue was that I had manually remove unnecessary white spaces  But it seems to be working much faster that BeautifulSoup solution  from selectolax parser import HTMLParser  def get text selectolax html       tree   HTMLParser html       if tree body is None          return None      for tag in tree css  script            tag decompose       for tag in tree css  style            tag decompose        text   tree body text separator         text    quot   quot  join text split      this will remove all the whitespaces     return text

User · Answer

I know there s plenty of answers here already but I think newspaper3k also deserves a mention  I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests  It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests    from newspaper import Article  article   Article url  article download   article parse   article text   If you already have the HTML files downloaded you can do something like this   article   Article     article set html html  article parse   article text   It even has a few NLP features for summarizing the topics of articles   article nlp   article summary

User · Answer

Another option is to run the html through a text based web browser and dump it  For example  using Lynx    lynx -dump html to convert html  gt  converted html txt   This can be done within a python script as follows   import subprocess  with open  converted html txt    w   as outputFile      subprocess call   lynx    -dump    html to convert html    stdout testFile    It won t give you exactly just the text from the HTML file  but depending on your use case it may be preferable to the output of html2text

User · Answer

Best worked for me is inscripts     https   github com weblyzard inscriptis  import urllib request from inscriptis import get text  url    http   www informationscience ch  html   urllib request urlopen url  read   decode  utf-8    text   get text html  print text    The results are really good

User · Answer

Beautiful soup does convert html entities  It s probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues  This is the code I use to convert html to raw text   import BeautifulSoup def getsoup data  to unicode False       data   data replace   amp nbsp               Fixes for bad markup I ve seen in the wild   Remove if not applicable      masssage bad comments              re compile   lt  -   -      lambda match    lt  --    match group 1             re compile   lt  WWWAnswer T   w d s   gt     lambda match    lt  --    match group 0     -- gt               myNewMassage   copy copy BeautifulSoup BeautifulSoup MARKUP MASSAGE      myNewMassage extend masssage bad comments      return BeautifulSoup BeautifulSoup data  markupMassage myNewMassage          convertEntities BeautifulSoup BeautifulSoup ALL ENTITIES                      if to unicode else None   remove html   lambda c  getsoup c  to unicode True  getText separator u     if c else

User · Answer

This isn t exactly a Python solution  but it will convert text Javascript would generate into text  which I think is important  E G  google com   The browser Links  not Lynx  has a Javascript engine  and will convert source to text with the -dump option   So you could do something like   fname   os tmpnam   fname write html source  proc   subprocess Popen   links    -dump   fname                            stdout subprocess PIPE                          stderr open   dev null   w    text   proc stdout read

User · Answer

Found myself facing just the same problem today  I wrote a very simple HTML parser to strip incoming content of all markups  returning the remaining text with only a minimum of formatting   from HTMLParser import HTMLParser from re import sub from sys import stderr from traceback import print exc  class  DeHTMLParser HTMLParser       def   init   self           HTMLParser   init   self          self   text           def handle data self  data           text   data strip           if len text   gt  0              text   sub     t r n          text              self   text append text             def handle starttag self  tag  attrs           if tag     p               self   text append   n n           elif tag     br               self   text append   n        def handle startendtag self  tag  attrs           if tag     br               self   text append   n n        def text self           return    join self   text  strip     def dehtml text       try          parser    DeHTMLParser           parser feed text          parser close           return parser text       except          print exc file stderr          return text   def main        text   r             lt html gt               lt body gt                   lt b gt Project  lt  b gt  DeHTML lt br gt                   lt b gt Description lt  b gt   lt br gt                  This small script is intended to allow conversion from HTML markup to                  plain text               lt  body gt           lt  html gt              print dehtml text     if   name         main         main

User · Answer

The best piece of code I found for extracting text without getting javascript or not wanted things   from urllib request import urlopen from bs4 import BeautifulSoup  url    quot http   news bbc co uk 2 hi health 2284783 stm quot  html   urlopen url  read   soup   BeautifulSoup html  features  quot html parser quot      kill all script and style elements for script in soup   quot script quot    quot style quot         script extract        rip it out    get text text   soup get text      break into lines and remove leading and trailing space on each lines    line strip   for line in text splitlines      break multi-headlines into a line each chunks    phrase strip   for line in lines for phrase in line split  quot    quot      drop blank lines text     n  join chunk for chunk in chunks if chunk   print text   You just have to install BeautifulSoup before   pip install beautifulsoup4

User · Answer

html2text is a Python program that does a pretty good job at this

User · Answer

PyParsing does a great job   The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing  example link   One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O Reilly Short Cut manual that is also inexpensive   Having said that  I use BeautifulSoup a lot and it is not that hard to deal with the entities issues  you can convert them before you run BeautifulSoup     Goodluck

User · Answer

The best piece of code I found for extracting text without getting javascript or not wanted things   from urllib request import urlopen from bs4 import BeautifulSoup  url    quot http   news bbc co uk 2 hi health 2284783 stm quot  html   urlopen url  read   soup   BeautifulSoup html  features  quot html parser quot      kill all script and style elements for script in soup   quot script quot    quot style quot         script extract        rip it out    get text text   soup get text      break into lines and remove leading and trailing space on each lines    line strip   for line in text splitlines      break multi-headlines into a line each chunks    phrase strip   for line in lines for phrase in line split  quot    quot      drop blank lines text     n  join chunk for chunk in chunks if chunk   print text   You just have to install BeautifulSoup before   pip install beautifulsoup4

User · Answer

I know there s plenty of answers here already but I think newspaper3k also deserves a mention  I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests  It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests    from newspaper import Article  article   Article url  article download   article parse   article text   If you already have the HTML files downloaded you can do something like this   article   Article     article set html html  article parse   article text   It even has a few NLP features for summarizing the topics of articles   article nlp   article summary

User · Answer

Another example using BeautifulSoup4 in Python 2 7 9   includes   import urllib2 from bs4 import BeautifulSoup   Code   def read website to text url       page   urllib2 urlopen url      soup   BeautifulSoup page   html parser       for script in soup   script    style             script extract        text   soup get text       lines    line strip   for line in text splitlines        chunks    phrase strip   for line in lines for phrase in line split            text     n  join chunk for chunk in chunks if chunk      return str text encode  utf-8      Explained   Read in the url data as html  using BeautifulSoup   remove all script and style elements  and also get just the text using  get text    Break into lines and remove leading and trailing space on each  then break multi-headlines into a line each chunks    phrase strip   for line in lines for phrase in line split         Then using text     n  join  drop blank lines  finally return as sanctioned utf-8   Notes     Some systems this is run on will fail with https    connections because of SSL issue  you can turn off the verify to fix that issue   Example fix  http   blog pengyifan com how-to-fix-python-ssl-certificate verify failed  Python  lt  2 7 9 may have some issue running this  text encode  utf-8   can leave weird encoding  may want to just return str text  instead

User · Answer

Found myself facing just the same problem today  I wrote a very simple HTML parser to strip incoming content of all markups  returning the remaining text with only a minimum of formatting   from HTMLParser import HTMLParser from re import sub from sys import stderr from traceback import print exc  class  DeHTMLParser HTMLParser       def   init   self           HTMLParser   init   self          self   text           def handle data self  data           text   data strip           if len text   gt  0              text   sub     t r n          text              self   text append text             def handle starttag self  tag  attrs           if tag     p               self   text append   n n           elif tag     br               self   text append   n        def handle startendtag self  tag  attrs           if tag     br               self   text append   n n        def text self           return    join self   text  strip     def dehtml text       try          parser    DeHTMLParser           parser feed text          parser close           return parser text       except          print exc file stderr          return text   def main        text   r             lt html gt               lt body gt                   lt b gt Project  lt  b gt  DeHTML lt br gt                   lt b gt Description lt  b gt   lt br gt                  This small script is intended to allow conversion from HTML markup to                  plain text               lt  body gt           lt  html gt              print dehtml text     if   name         main         main

User · Answer

Perl way  sorry mom  i ll never do it in production    import re  def html2text html       res   re sub   lt     gt         html  flags re DOTALL   re MULTILINE      res   re sub   n      n   res      res   re sub   r        res      res   re sub    t           res      res   re sub   t      t   res      res   re sub    n        n    res      return res

User · Answer

Another non-python solution  Libre Office   soffice --headless --invisible --convert-to txt input1 html   The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line  no line breaks   which is what I was looking for  Other methods require post-processing  Lynx does produce nice output  but not exactly what I was looking for  Besides  Libre Office can be used to convert from all sorts of formats

User · Answer

I had a similar question and actually used one of the answers with BeautifulSoup  The problem was it was really slow  I ended up using library called selectolax  It s pretty limited but it works for this task  The only issue was that I had manually remove unnecessary white spaces  But it seems to be working much faster that BeautifulSoup solution  from selectolax parser import HTMLParser  def get text selectolax html       tree   HTMLParser html       if tree body is None          return None      for tag in tree css  script            tag decompose       for tag in tree css  style            tag decompose        text   tree body text separator         text    quot   quot  join text split      this will remove all the whitespaces     return text

User · Answer

Anyone has tried bleach clean html tags    strip True  with bleach  it s working for me

User · Answer

While alot of people mentioned using regex to strip html tags  there are a lot of downsides   for example    lt p gt hello amp nbsp world lt  p gt I love you   Should be parsed to   Hello world I love you   Here s a snippet I came up with  you can cusomize it to your specific needs  and it works like a charm  import re import html def html2text htm       ret   html unescape htm      ret   ret translate           8209  ord  -            8220  ord               8221  ord               160  ord                  ret   re sub r  s        ret  flags   re MULTILINE      ret   re sub   lt br gt   lt br   gt   lt  p gt   lt  div gt   lt  h d gt      n   ret  flags   re IGNORECASE      ret   re sub   lt     gt         ret  flags re DOTALL      ret   re sub r            ret      return ret

User · Answer

if you need more speed and less accuracy then you could use raw lxml   import lxml html as lh from lxml html clean import clean html  def lxml to text html       doc   lh fromstring html      doc   clean html doc      return doc text content

User · Answer

install html2text using      pip install html2text   then    gt  gt  gt  import html2text  gt  gt  gt   gt  gt  gt  h   html2text HTML2Text    gt  gt  gt    Ignore converting links from HTML  gt  gt  gt  h ignore links   True  gt  gt  gt  print h handle   lt p gt Hello   lt a href  http   earth google com   gt world lt  a gt     Hello  world

User · Answer

I ve had good results with Apache Tika  Its purpose is the extraction of metadata and text from content  hence the underlying parser is tuned accordingly out of the box   Tika can be run as a server  is trivial to run   deploy in a Docker container  and from there can be accessed via Python bindings

User · Answer

Here s the code I use on a regular basis   from bs4 import BeautifulSoup import urllib request   def processText webpage          EMPTY LIST TO STORE PROCESSED TEXT     proc text           try          news open   urllib request urlopen webpage group            news soup   BeautifulSoup news open   lxml           news para   news soup find all  p   text   True           for item in news para                SPLIT WORDS  JOIN WORDS TO REMOVE EXTRA SPACES             para text         join  item text  split                   COMBINE LINES PARAGRAPHS INTO A LIST             proc text append para text       except urllib error HTTPError          pass      return proc text   I hope that helps

User · Answer

In Python 3 x you can do it in a very easy way by importing  imaplib  and  email  packages  Although this is an older post but maybe my answer can help new comers on this post   status  data   self imap fetch num    RFC822    email msg   email message from bytes data 0  1     email message from string data 0  1     If message is multi part we only want the text version of the body  this walks the message and gets the body   if email msg is multipart        for part in email msg walk                   if part get content type       text plain               body   part get payload decode True   to control automatic email-style MIME decoding  e g   Base64  uuencode  quoted-printable              body   body decode           elif part get content type       text html               continue   Now you can print body variable and it will be in plaintext format    If it is good enough for you then it would be nice to select it as accepted answer

User · Answer

in a simple way  import re  html text   open  html file html   read   text filtered   re sub r  lt       gt        html text    this code finds all parts of the html text started with   lt   and ending with     and replace all found by an empty string

User · Answer

in a simple way  import re  html text   open  html file html   read   text filtered   re sub r  lt       gt        html text    this code finds all parts of the html text started with   lt   and ending with     and replace all found by an empty string

[python] Extracting text from HTML file using Python

Update

Examples related to python

Examples related to html

Examples related to text

Examples related to html-content-extraction