Strip HTML from strings in Python

Question

from mechanize import Browser br   Browser   br open  http   somewebpage   html   br response   readlines   for line in html    print line   When printing a line in an HTML file  I m trying to find a way to only show the contents of each HTML element and not the formatting itself  If it finds   lt a href  whatever com  gt some text lt  a gt    it will only print  some text     lt b gt hello lt  b gt   prints  hello   etc  How would one go about doing this

User · Accepted Answer

I always used this function to strip HTML tags, as it requires only the Python stdlib:

For Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

For Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

User · Answer

There s a simple way to this  def remove html markup s       tag   False     quote   False     out    quot  quot       for c in s          if c      lt   and not quote              tag   True         elif c      gt   and not quote              tag   False         elif  c      quot   or c     quot   quot   and tag              quote   not quote         elif not tag              out   out   c      return out  The idea is explained here  http   youtu be 2tu9LTDujbw You can see it working here  http   youtu be HPkNPcYed9M t 35s PS - If you re interested in the class about smart debugging with python  I give you a link  http   www udacity com overview Course cs259 CourseRev 1  It s free  You re welcome

User · Answer

Here is a simple solution that strips HTML tags and decodes HTML entities based on the amazingly fast lxml library   from lxml import html  def strip html s       return str html fromstring s  text content     strip html  Ein  lt a href    gt sch amp ouml ner lt  a gt  Text       Output  Ein sch  ner Text

User · Answer

For one project  I needed so strip HTML  but also css and js  Thus  I made a variation of Eloffs answer   class MLStripper HTMLParser       def   init   self           self reset           self strict   False         self convert charrefs  True         self fed              self css   False     def handle starttag self  tag  attrs           if tag     style  or tag   script               self css   True     def handle endtag self  tag           if tag   style  or tag   script               self css False     def handle data self  d           if not self css              self fed append d      def get data self           return    join self fed   def strip tags html       s   MLStripper       s feed html      return s get data

User · Answer

This is a regex solution  import re def removeHtml html     if not html  return html     Remove comments first   innerText   re compile   lt  --  s S   -- gt    sub    html    while innerText find   gt    gt  0    Loop through nested Tags     text   re compile   lt    lt  gt     gt    sub    innerText      if text    innerText        break     innerText   text    return innerText strip

User · Answer

2020 Update Use the Mozilla Bleach library  it really lets you customize which tags to keep and which attributes to keep and also filter out attributes based on values Here are 2 cases to illustrate 1  Do not allow any HTML tags or attributes Take sample raw text raw text    quot  quot  quot   lt p gt  lt img width  quot 696 quot  height  quot 392 quot  src  quot https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432 jpg quot  class  quot attachment-medium large size-medium large wp-post-image quot  alt  quot Ethereum Classic 51  Attack  Okex Crypto Exchange Suffers  5 6 Million Loss  Contemplates Delisting ETC quot  style  quot float left  margin 0 15px 15px 0  quot  srcset  quot https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432 jpg 768w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169 jpg 300w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576 jpg 1024w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392 jpg 696w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601 jpg 1068w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420 jpg 747w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107 jpg 190w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214 jpg 380w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428 jpg 760w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc jpg 1280w quot  sizes  quot  max-width  696px  100vw  696px quot    gt Cryptocurrency exchange Okex reveals it suffered the  5 6 million loss as a result of the double-spend carried out by the attacker s  in Ethereum Classic 51  attack  Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform amp  8217 s users  Also as part   amp  8230   lt  p gt   lt p gt The post  lt a rel  quot nofollow quot  href  quot https   news bitcoin com ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc  quot  gt Ethereum Classic 51  Attack  Okex Crypto Exchange Suffers  5 6 Million Loss  Contemplates Delisting ETC lt  a gt  appeared first on  lt a rel  quot nofollow quot  href  quot https   news bitcoin com quot  gt Bitcoin News lt  a gt   lt  p gt    quot  quot  quot   2  Remove all HTML tags and attributes from raw text   DO NOT ALLOW any tags or any attributes from bleach sanitizer import Cleaner cleaner   Cleaner tags     attributes     styles     protocols     strip True  strip comments True  filters None  print cleaner clean raw text    Output Cryptocurrency exchange Okex reveals it suffered the  5 6 million loss as a result of the double-spend carried out by the attacker s  in Ethereum Classic 51  attack  Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform amp  8217 s users  Also as part   amp  8230   The post Ethereum Classic 51  Attack  Okex Crypto Exchange Suffers  5 6 Million Loss  Contemplates Delisting ETC appeared first on Bitcoin News    3 Allow Only img tag with srcset attribute from bleach sanitizer import Cleaner   ALLOW ONLY img tags with src attribute cleaner   Cleaner tags   img    attributes   img     srcset     styles     protocols     strip True  strip comments True  filters None  print cleaner clean raw text    Output  lt img srcset  quot https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432 jpg 768w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169 jpg 300w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576 jpg 1024w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392 jpg 696w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601 jpg 1068w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420 jpg 747w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107 jpg 190w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214 jpg 380w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428 jpg 760w  https   news bitcoin com wp-content uploads 2020 08 ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc jpg 1280w quot  gt Cryptocurrency exchange Okex reveals it suffered the  5 6 million loss as a result of the double-spend carried out by the attacker s  in Ethereum Classic 51  attack  Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform amp  8217 s users  Also as part   amp  8230   The post Ethereum Classic 51  Attack  Okex Crypto Exchange Suffers  5 6 Million Loss  Contemplates Delisting ETC appeared first on Bitcoin News

User · Answer

You can write your own function   def StripTags text        finished   0      while not finished           finished   1          start   text find   lt             if start  gt   0               stop   text start   find   gt                 if stop  gt   0                   text   text  start    text start stop 1                    finished   0      return text

User · Answer

The solutions with HTML-Parser are all breakable, if they run only once:

html_to_text('<<b>script>alert("hacked")<</b>/script>

results in:

<script>alert("hacked")</script>

what you intend to prevent. if you use a HTML-Parser, count the Tags until zero are replaced:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.containstags = False

    def handle_starttag(self, tag, attrs):
       self.containstags = True

    def handle_data(self, d):
        self.fed.append(d)

    def has_tags(self):
        return self.containstags

    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    must_filtered = True
    while ( must_filtered ):
        s = MLStripper()
        s.feed(html)
        html = s.get_data()
        must_filtered = s.has_tags()
    return html

User · Answer

I needed a way to strip tags and decode HTML entities to plain text  The following solution is based on Eloff s answer  which I couldn t use because it strips entities    from HTMLParser import HTMLParser import htmlentitydefs  class HTMLTextExtractor HTMLParser       def   init   self           HTMLParser   init   self          self result            def handle data self  d           self result append d       def handle charref self  number           codepoint   int number 1    16  if number 0  in  u x   u X   else int number          self result append unichr codepoint        def handle entityref self  name           codepoint   htmlentitydefs name2codepoint name          self result append unichr codepoint        def get text self           return u   join self result   def html to text html       s   HTMLTextExtractor       s feed html      return s get text     A quick test   html   u  lt a href     gt Demo  lt em gt   amp not   u0394 amp  x03b7  amp  956  amp  x03CE   lt  em gt  lt  a gt   print repr html to text html     Result   u Demo   xac  u0394 u03b7 u03bc u03ce     Error handling    Invalid HTML structure may cause an HTMLParseError  Invalid named HTML entities  such as  amp  apos   which is valid in XML and XHTML  but not plain HTML  will cause a ValueError exception  Numeric HTML entities specifying code points outside the Unicode range acceptable by Python  such as  on some systems  characters outside the Basic Multilingual Plane  will cause a ValueError exception    Security note  Do not confuse HTML stripping  converting HTML into plain text  with HTML sanitizing  converting plain text into HTML   This answer will remove HTML and decode entities into plain text     that does not make the result safe to use in a HTML context   Example   amp lt script amp gt alert  Hello    amp lt  script amp gt  will be converted to  lt script gt alert  Hello    lt  script gt   which is 100  correct behavior  but obviously not sufficient if the resulting plain text is inserted as-is into a HTML page   The rule is not hard  Any time you insert a plain-text string into HTML output  you should always HTML escape it  using cgi escape s  True    even if you  know  that it doesn t contain HTML  e g  because you stripped HTML content     However  the OP asked about printing the result to the console  in which case no HTML escaping is needed    Python 3 4  version   with doctest    import html parser  class HTMLTextExtractor html parser HTMLParser       def   init   self           super HTMLTextExtractor  self    init             self result            def handle data self  d           self result append d       def get text self           return    join self result   def html to text html          Converts HTML to plain text  stripping tags and converting entities        gt  gt  gt  html to text   lt a href     gt Demo lt  --   -- gt   lt em gt   amp not   u0394 amp  x03b7  amp  956  amp  x03CE   lt  em gt  lt  a gt         Demo   xac  u0394 u03b7 u03bc u03ce         Plain text  doesn t mean result can safely be used as-is in HTML       gt  gt  gt  html to text   amp lt script amp gt alert  Hello    amp lt  script amp gt          lt script gt alert  Hello    lt  script gt        Always use html escape to sanitize text before using in an HTML context       HTMLParser will do its best to make sense of invalid HTML       gt  gt  gt  html to text  x  lt  y  amp lt z  lt  --b        x  lt  y  lt  z        Unrecognized named entities are included as-is    amp apos   is recognized      despite being XML only       gt  gt  gt  html to text   amp nosuchentity   amp apos           amp nosuchentity                  s   HTMLTextExtractor       s feed html      return s get text     Note that HTMLParser has improved in Python 3  meaning less code and better error handling

User · Answer

If you want to strip all HTML tags the easiest way I found is using BeautifulSoup:

from bs4 import BeautifulSoup  # Or from BeautifulSoup import BeautifulSoup

def stripHtmlTags(htmlTxt):
    if htmlTxt is None:
            return None
        else:
            return ''.join(BeautifulSoup(htmlTxt).findAll(text=True))

I tried the code of the accepted answer but I was getting "RuntimeError: maximum recursion depth exceeded", which didn't happen with the above block of code.

User · Answer

Short version   import re  cgi tag re   re compile r   lt  --   -- gt   lt    gt    gt        Remove well-formed tags  fixing mistakes by legitimate users no tags   tag re sub     user input     Clean up anything else by escaping ready for web   cgi escape no tags    Regex source  MarkupSafe   Their version handles HTML entities too  while this quick one doesn t   Why can t I just strip the tags and leave it   It s one thing to keep people from  lt i gt italicizing lt  i gt  things  without leaving is floating around   But it s another to take arbitrary input and make it completely harmless   Most of the techniques on this page will leave things like unclosed comments   lt  --  and angle-brackets that aren t part of tags  blah  lt  lt  lt  gt  lt blah  intact   The HTMLParser version can even leave complete tags in  if they re inside an unclosed comment   What if your template is    firstname       lastname      firstname     lt a  and lastname    href  http   evil com   gt   will be let through by every tag stripper on this page  except  Medeiros    because they re not complete tags on their own   Stripping out normal HTML tags is not enough   Django s strip tags  an improved  see next heading  version of the top answer to this question  gives the following warning      Absolutely NO guarantee is provided about the resulting string being HTML safe  So NEVER mark safe the result of a strip tags call without escaping it first  for example with escape      Follow their advice   To strip tags with HTMLParser  you have to run it multiple times   It s easy to circumvent the top answer to this question   Look at this string  source and discussion     lt img lt  -- -- gt  src x onerror alert 1     gt  lt  -- -- gt    The first time HTMLParser sees it  it can t tell that the  lt img    gt  is a tag   It looks broken  so HTMLParser doesn t get rid of it   It only takes out the  lt  -- comments -- gt   leaving you with   lt img src x onerror alert 1     gt    This problem was disclosed to the Django project in March  2014   Their old strip tags was essentially the same as the top answer to this question   Their new version basically runs it in a loop until running it again doesn t change the string      strip once runs HTMLParser once  pulling out just the text of all the nodes   def strip tags value          Returns the given HTML with all tags stripped           Note  in typical case this loop executes  strip once once  Loop condition       is redundant  but helps to reduce number of executions of  strip once      while   lt   in value and   gt   in value          new value    strip once value          if len new value   gt   len value                  strip once was not able to detect more tags             break         value   new value     return value   Of course  none of this is an issue if you always escape the result of strip tags     Update 19 March  2015  There was a bug in Django versions before 1 4 20  1 6 11  1 7 7  and 1 8c1   These versions could enter an infinite loop in the strip tags   function   The fixed version is reproduced above   More details here   Good things to copy or use  My example code doesn t handle HTML entities - the Django and MarkupSafe packaged versions do   My example code is pulled from the excellent MarkupSafe library for cross-site scripting prevention   It s convenient and fast  with C speedups to its native Python version    It s included in Google App Engine  and used by Jinja2  2 7 and up   Mako  Pylons  and more   It works easily with Django templates from Django 1 7   Django s strip tags and other html utilities from a recent version are good  but I find them less convenient than MarkupSafe   They re pretty self-contained  you could copy what you need from this file   If you need to strip almost all tags  the Bleach library is good   You can have it enforce rules like  my users can italicize things  but they can t make iframes    Understand the properties of your tag stripper   Run fuzz tests on it   Here is the code I used to do the research for this answer   sheepish note - The question itself is about printing to the console  but this is the top Google result for  python strip html from string   so that s why this answer is 99  about the web

User · Answer

This method works flawlessly for me and requires no additional installations   import re import htmlentitydefs  def convertentity m       if m group 1                try              return unichr int m group 2            except ValueError              return   amp   s     m group 2          try              return htmlentitydefs entitydefs m group 2           except KeyError              return   amp  s     m group 2   def converthtml s       return re sub r  amp             convertentity s   html    converthtml html  html replace   amp nbsp            Get rid of the remnants of certain formatting subscript superscript etc

User · Answer

If you need to preserve HTML entities  i e   amp amp    I added  handle entityref  method to Eloff s answer   from HTMLParser import HTMLParser  class MLStripper HTMLParser       def   init   self           self reset           self fed          def handle data self  d           self fed append d      def handle entityref self  name           self fed append   amp  s     name      def get data self           return    join self fed   def html to text html       s   MLStripper       s feed html      return s get data

User · Answer

A python 3 adaption of s  ren-l  vborg s answer  from html parser import HTMLParser from html entities import html5  class HTMLTextExtractor HTMLParser           Adaption of http   stackoverflow com a 7778368 196732         def   init   self           super     init             self result           def handle data self  d           self result append d       def handle charref self  number           codepoint   int number 1    16  if number 0  in  u x   u X   else int number          self result append unichr codepoint        def handle entityref self  name           if name in html5              self result append unichr html5 name         def get text self           return u   join self result   def html to text html       s   HTMLTextExtractor       s feed html      return s get text

User · Answer

You can use either a different HTML parser  like lxml  or Beautiful Soup  -- one that offers functions to extract just text  Or  you can run a regex on your line string that strips out the tags  See Python docs for more

User · Answer

Simple code   This will remove all kind of tags and content inside of it   def rm s       start False     end False     s     s     for i in range len s -1           if i lt len s               if start  False                  if s i     gt                        end i                     s s  start  s end 1                       start end False             else                  if s i     lt                        start i     if s count   lt    gt 0          self rm s      else          s s replace   amp nbsp                 return s   But it won t give full result if text contains  lt   symbols inside it

User · Answer

I have used Eloff s answer successfully for Python 3 1  many thanks     I upgraded to Python 3 2 3  and ran into errors    The solution  provided here thanks to the responder Thomas K  is to insert super     init     into the following code   def   init   self       self reset       self fed            in order to make it look like this   def   init   self       super     init         self reset       self fed            and it will work for Python 3 2 3   Again  thanks to Thomas K for the fix and for Eloff s original code provided above

User · Answer

The Beautiful Soup package does this immediately for you    from bs4 import BeautifulSoup  soup   BeautifulSoup html  text   soup get text   print text

User · Answer

Here s a solution similar to the currently accepted answer  https   stackoverflow com a 925630 95989   except that it uses the internal HTMLParser class directly  i e  no subclassing   thereby making it significantly more terse    def strip html text       parts                                                                                parser   HTMLParser                                                                  parser handle data   parts append                                                    parser feed text                                                                     return    join parts

User · Answer

Using BeautifulSoup  html2text or the code from  Eloff  most of the time  it remains some html elements  javascript code     So you can use a combination of these libraries and delete markdown formatting  Python 3    import re import html2text from bs4 import BeautifulSoup def html2Text html       def removeMarkdown text           for current in          2 30          0 30  d             0 30  d                  markdown   re compile current  flags re MULTILINE              text   markdown sub      text          return text     def removeAngular text           angular   re compile          2 40                2 40                2 40              2 40                text   angular sub      text          return text     h   html2text HTML2Text       h images to alt   True     h ignore links   True     h ignore emphasis   False     h skip internal links   True     text   h handle html      soup   BeautifulSoup text   html parser       text   soup text     text   removeAngular text      text   removeMarkdown text      return text   It works well for me but it can be enhanced  of course

User · Answer

Here s my solution for python 3   import html import re  def html to txt html text          unescape html     txt   html unescape html text      tags   re findall   lt    gt    gt   txt      print  found tags         print tags      for tag in tags          txt txt replace tag         return txt   Not sure if it is perfect  but solved my use case and seems simple

User · Answer

An lxml html-based solution  lxml is a native library and can be more performant than a pure python solution   To install the lxml module use pip install lxml Remove ALL tags  from lxml import html      from file-like object or URL tree   html parse file like object or url      from string tree   html fromstring  safe  lt script gt unsafe lt  script gt  safe    print tree text content   strip         OUTPUT   safe unsafe safe    Remove ALL tags with pre-sanitizing HTML  dropping some tags   from lxml import html from lxml html clean import clean html  tree   html fromstring  quot  quot  quot  lt script gt dangerous lt  script gt  lt span class  quot item-summary quot  gt                              Detailed answers to any questions you might have                          lt  span gt  quot  quot  quot       text only print clean html tree  text content   strip         OUTPUT   Detailed answers to any questions you might have    Also see http   lxml de lxmlhtml html cleaning-up-html for what exactly the lxml cleaner does   If you need more control over what exactly is sanitized before converting to text then you might want to use the lxml Cleaner explicitly by passing the options you want in the constructor  e g  cleaner   Cleaner page structure True                    meta True                    embedded True                    links True                    style True                    processing instructions True                    inline style True                    scripts True                    javascript True                    comments True                    frames True                    forms True                    annoying tags True                    remove unknown tags True                    safe attrs only True                    safe attrs frozenset   src   color    href    title    class    name    id                       remove tags   span    font    div                       sanitized html   cleaner clean html unsafe html   If you need more control over how plain text is generated then instead of text content   you can use lxml etree tostring  plain bytes   tostring tree  method  text   encoding  utf-8   print plain decode  utf-8

User · Answer

I haven t thought much about the cases it will miss  but you can do a simple regex   re sub   lt    lt     gt        text    For those that don t understand regex  this searches for a string  lt     gt   where the inner content is made of one or more     characters that isn t a  lt   The   means that it will match the smallest string it can find  For example given  lt p gt Hello lt  p gt   it will match  lt  p gt  and  lt  p gt  separately with the    Without it  it will match the entire string  lt   Hello   gt    If non-tag  lt  appears in html  eg  2  lt  3   it should be written as an escape sequence  amp     anyway so the   lt  may be unnecessary

User · Answer

You can use BeautifulSoup get text   feature   from bs4 import BeautifulSoup  html str        lt td gt  lt a href  http   www fakewebsite com  gt Please can you strip me  lt  a gt   lt br  gt  lt a href  http   www fakewebsite com  gt I am waiting     lt  a gt   lt  td gt      soup   BeautifulSoup html str   print soup get text      or via attribute of Soup Object  print soup text    It is advisable to explicitly specify the parser  for example as BeautifulSoup html str  features  html parser    for the output to be reproducible

User · Answer

I m parsing Github readmes and I find that the following really works well   import re import lxml html  def strip markdown x       links sub   re sub r                      r  1   x      bold sub   re sub r                   r  1   links sub      emph sub   re sub r               r  1   bold sub      return emph sub  def strip html x       return lxml html fromstring x  text content   if x else      And then  readme       lt img src  https   raw githubusercontent com kootenpv sky master resources skylogo png    gt               sky is a web scraping framework  implemented with the latest python versions in mind  3 4                 It uses the asynchronous  asyncio  framework  as well as many popular modules              and extensions               Most importantly  it aims for   next generation   web crawling where machine intelligence              is used to speed up the development maintainance reliability of crawling               It mainly does this by considering the user to be interested in content              from  domains   not just a collection of  single pages                templating approach   templating-approach        strip markdown strip html readme     Removes all markdown and html correctly

User · Answer

This is a quick fix and can be even more optimized but it will work fine  This code will replace all non empty tags with    and strips all html tags form a given input text  You can run it using   file py input output         usr bin python import sys  def replace strng replaceText       rpl   0     while rpl  gt  -1          rpl   strng find replaceText          if rpl    -1              strng   strng 0 rpl    strng rpl   len replaceText        return strng   lessThanPos   -1 count   0 listOf       try       write File     writeto   open sys argv 2   w         read file and store it in list     f   open sys argv 1   r       for readLine in f readlines            listOf append readLine               f close         remove all tags       for line in listOf          count   0            lessThanPos   -1           lineTemp    line              for char in lineTemp               if char      lt                    lessThanPos   count             if char      gt                    if lessThanPos  gt  -1                      if line lessThanPos count   1       lt  gt                            lineTemp   replace lineTemp line lessThanPos count   1                           lessThanPos   -1             count   count   1         lineTemp   lineTemp replace   amp lt    lt            lineTemp   lineTemp replace   amp gt    gt                              writeto write lineTemp        writeto close        print  Write To ---  gt     sys argv 2  except      print  Help  invalid arguments or exception      print  Usage     sys argv 0    inputfile outputfile

[python] Strip HTML from strings in Python

The answer is

Short version!

Why can't I just strip the tags and leave it?

To strip tags with HTMLParser, you have to run it multiple times.

Good things to copy or use

Remove ALL tags

Remove ALL tags with pre-sanitizing HTML (dropping some tags)

Examples related to python

Examples related to html

Tags