Python code to remove HTML tags from a string

Question

I have a text like this   text       lt div gt   lt h1 gt Title lt  h1 gt   lt p gt A long text          lt  p gt   lt a href    gt  a link  lt  a gt   lt  div gt       using pure Python  with no external module I want to have this    gt  gt  gt  print remove tags text  Title A long text      a link   I know I can do it using lxml html fromstring text  text content   but I need to achieve the same in pure Python using builtin or std library for 2 6   How can I do that

User · Answer

Note that this isn t perfect  since if you had something like  say   lt a title   gt   gt  it would break  However  it s about the closest you d get in non-library Python without a really complex function   import re  TAG RE   re compile r  lt    gt    gt     def remove tags text       return TAG RE sub     text    However  as lvc mentions xml etree is available in the Python Standard Library  so you could probably just adapt it to serve like your existing lxml version   def remove tags text       return    join xml etree ElementTree fromstring text  itertext

User · Answer

There s a simple way to this in any C-like language  The style is not Pythonic but works with pure Python   def remove html markup s       tag   False     quote   False     out           for c in s              if c      lt   and not quote                  tag   True             elif c      gt   and not quote                  tag   False             elif  c        or c         and tag                  quote   not quote             elif not tag                  out   out   c      return out   The idea based in a simple finite-state machine and is detailed explained here  http   youtu be 2tu9LTDujbw  You can see it working here  http   youtu be HPkNPcYed9M t 35s  PS - If you re interested in the class about smart debugging with python  I give you a link  https   www udacity com course software-debugging--cs259  It s free

User · Answer

Using a regex Using a regex  you can clean everything inside  lt  gt    import re  def cleanhtml raw html     cleanr   re compile   lt     gt      cleantext   re sub cleanr      raw html    return cleantext  Some HTML texts can also contain entities that are not enclosed in brackets  such as   amp nsbm   If that is the case  then you might want to write the regex as cleanr   re compile   lt     gt   amp   a-z0-9     0-9  1 6   x 0-9a-f  1 6       This link contains more details on this  Using BeautifulSoup You could also use BeautifulSoup additional package to find out all the raw text  You will need to explicitly set a parser when calling BeautifulSoup I recommend  quot lxml quot  as mentioned in alternative answers  much more robust than the default one  html parser   i e  available without additional install   from bs4 import BeautifulSoup cleantext   BeautifulSoup raw html   quot lxml quot   text  But it doesn t prevent you from using external libraries  so I recommend the first solution  EDIT  To use lxml you need to pip install lxml

User · Answer

Python has several XML modules built in  The simplest one for the case that you already have a string with the full HTML is xml etree  which works  somewhat  similarly to the lxml example you mention   def remove tags text       return    join xml etree ElementTree fromstring text  itertext

User · Answer

global temp  temp      s        def remove strings text        global temp       if text                 return temp      start   text find   lt         end   text find   gt         if start    -1 and end    -1            temp   temp   text      return temp  newstring   text end 1    fresh start   newstring find   lt     if newstring  fresh start              temp    s newstring  fresh start   remove strings newstring fresh start     return temp

[python] Python code to remove HTML tags from a string

Examples related to python

Examples related to html

Examples related to xml

Examples related to string

Examples related to parsing