Decode HTML entities in Python string

Question

I m parsing some HTML with Beautiful Soup 3  but it contains HTML entities which Beautiful Soup 3 doesn t automatically decode for me    gt  gt  gt  from BeautifulSoup import BeautifulSoup   gt  gt  gt  soup   BeautifulSoup   lt p gt  amp pound 682m lt  p gt     gt  gt  gt  text   soup find  p   string   gt  gt  gt  print text  amp pound 682m   How can I decode the HTML entities in text to get    682m  instead of   amp pound 682m

User · Answer

I had a similar encoding issue  I used the normalize   method  I was getting a Unicode error using the pandas  to html   method when exporting my data frame to an  html file in another directory  I ended up doing this and it worked         import unicodedata    The dataframe object can be whatever you like  let s call it table         table   pd DataFrame data columns   Name   Team   OVR   POT        table index   1   encode table data so that we can export it to out  html file in templates folder this can be whatever location you wish            this is where the magic happens      html data unicodedata normalize  NFKD  table to html    encode  ascii   ignore     export normalized string to html file      file   open  templates home html   w         file write html data        file close      Reference  unicodedata documentation

User · Answer

Python 3 4   Use html unescape     import html print html unescape   amp pound 682m      FYI html parser HTMLParser unescape is deprecated  and was supposed to be removed in 3 5  although it was left in by mistake  It will be removed from the language soon     Python 2 6-3 3  You can use HTMLParser unescape   from the standard library    For Python 2 6-2 7 it s in HTMLParser For Python 3 it s in html parser    gt  gt  gt  try            Python 2 6-2 7          from HTMLParser import HTMLParser     except ImportError            Python 3         from html parser import HTMLParser       gt  gt  gt  h   HTMLParser    gt  gt  gt  print h unescape   amp pound 682m      682m   You can also use the six compatibility library to simplify the import    gt  gt  gt  from six moves html parser import HTMLParser  gt  gt  gt  h   HTMLParser    gt  gt  gt  print h unescape   amp pound 682m      682m

User · Answer

Beautiful Soup 4 allows you to set a formatter to your output     If you pass in formatter None  Beautiful Soup will not modify strings   at all on output  This is the fastest option  but it may lead to   Beautiful Soup generating invalid HTML XML  as in these examples    print soup prettify formatter None      lt html gt      lt body gt       lt p gt       Il a dit  lt  lt Sacr   bleu  gt  gt       lt  p gt      lt  body gt     lt  html gt   link soup   BeautifulSoup   lt a href  http   example com  foo val1 amp bar val2  gt A link lt  a gt    print link soup a encode formatter None      lt a href  http   example com  foo val1 amp bar val2  gt A link lt  a gt

User · Answer

You can use replace entities from w3lib html library  In  202   from w3lib html import replace entities  In  203   replace entities   amp pound 682m   Out 203   u  xa3682m   In  204   print replace entities   amp pound 682m     682m

User · Answer

This probably isnt relevant here   But to eliminate these html entites from an entire document  you can do something like this    Assume document   page and please forgive the sloppy code  but if you have ideas as to how to make it better  Im all ears - Im new to this    import re import HTMLParser  regexp     amp        list of html   re findall regexp  page   finds all html entites in page for e in list of html      h   HTMLParser HTMLParser       unescaped   h unescape e   finds the unescaped value of the html entity     page   page replace e  unescaped   replaces html entity with unescaped value

User · Answer

Beautiful Soup handles entity conversion  In Beautiful Soup 3  you ll need to specify the convertEntities argument to the BeautifulSoup constructor  see the  Entity Conversion  section of the archived docs   In Beautiful Soup 4  entities get decoded automatically   Beautiful Soup 3   gt  gt  gt  from BeautifulSoup import BeautifulSoup  gt  gt  gt  BeautifulSoup   lt p gt  amp pound 682m lt  p gt                       convertEntities BeautifulSoup HTML ENTITIES   lt p gt   682m lt  p gt    Beautiful Soup 4   gt  gt  gt  from bs4 import BeautifulSoup  gt  gt  gt  BeautifulSoup   lt p gt  amp pound 682m lt  p gt     lt html gt  lt body gt  lt p gt   682m lt  p gt  lt  body gt  lt  html gt

[python] Decode HTML entities in Python string?

Examples related to python

Examples related to html

Examples related to html-entities