[python] How do I perform HTML decoding/encoding using Python/Django?

I have a string that is HTML encoded:

'''<img class="size-medium wp-image-113"\
 style="margin-left: 15px;" title="su1"\
 src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"\
 alt="" width="300" height="194" />'''

I want to change that to:

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" /> 

I want this to register as HTML so that it is rendered as an image by the browser instead of being displayed as text.

The string is stored like that because I am using a web-scraping tool called BeautifulSoup, it "scans" a web-page and gets certain content from it, then returns the string in that format.

I've found how to do this in C# but not in Python. Can someone help me out?

Related

This question is related to python django html-encode

The answer is


If anyone is looking for a simple way to do this via the django templates, you can always use filters like this:

<html>
{{ node.description|safe }}
</html>

I had some data coming from a vendor and everything I posted had html tags actually written on the rendered page as if you were looking at the source.


See at the bottom of this page at Python wiki, there are at least 2 options to "unescape" html.


Use daniel's solution if the set of encoded characters is relatively restricted. Otherwise, use one of the numerous HTML-parsing libraries.

I like BeautifulSoup because it can handle malformed XML/HTML :

http://www.crummy.com/software/BeautifulSoup/

for your question, there's an example in their documentation

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

For html encoding, there's cgi.escape from the standard library:

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

For html decoding, I use the following:

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

For anything more complicated, I use BeautifulSoup.


Daniel's comment as an answer:

"escaping only occurs in Django during template rendering. Therefore, there's no need for an unescape - you just tell the templating engine not to escape. either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %}"


For html encoding, there's cgi.escape from the standard library:

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

For html decoding, I use the following:

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

For anything more complicated, I use BeautifulSoup.


In Python 3.4+:

import html

html.unescape(your_string)

I found a fine function at: http://snippets.dzone.com/posts/show/4569

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

I found this in the Cheetah source code (here)

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

not sure why they reverse the list, I think it has to do with the way they encode, so with you it may not need to be reversed. Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists... this is going in my library though :)

i noticed your title asked for encode too, so here is Cheetah's encode function.

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

This is the easiest solution for this problem -

{% autoescape on %}
   {{ body }}
{% endautoescape %}

From this page.


I found a fine function at: http://snippets.dzone.com/posts/show/4569

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

Use daniel's solution if the set of encoded characters is relatively restricted. Otherwise, use one of the numerous HTML-parsing libraries.

I like BeautifulSoup because it can handle malformed XML/HTML :

http://www.crummy.com/software/BeautifulSoup/

for your question, there's an example in their documentation

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

For html encoding, there's cgi.escape from the standard library:

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

For html decoding, I use the following:

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

For anything more complicated, I use BeautifulSoup.


If anyone is looking for a simple way to do this via the django templates, you can always use filters like this:

<html>
{{ node.description|safe }}
</html>

I had some data coming from a vendor and everything I posted had html tags actually written on the rendered page as if you were looking at the source.


Searching the simplest solution of this question in Django and Python I found you can use builtin theirs functions to escape/unescape html code.

Example

I saved your html code in scraped_html and clean_html:

scraped_html = (
    '&lt;img class=&quot;size-medium wp-image-113&quot; '
    'style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; '
    'src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; '
    'alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'
)
clean_html = (
    '<img class="size-medium wp-image-113" style="margin-left: 15px;" '
    'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
    'alt="" width="300" height="194" />'
)

Django

You need Django >= 1.0

unescape

To unescape your scraped html code you can use django.utils.text.unescape_entities which:

Convert all named and numeric character references to the corresponding unicode characters.

>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True

escape

To escape your clean html code you can use django.utils.html.escape which:

Returns the given text with ampersands, quotes and angle brackets encoded for use in HTML.

>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True

Python

You need Python >= 3.4

unescape

To unescape your scraped html code you can use html.unescape which:

Convert all named and numeric character references (e.g. &gt;, &#62;, &x3e;) in the string s to the corresponding unicode characters.

>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True

escape

To escape your clean html code you can use html.escape which:

Convert the characters &, < and > in string s to HTML-safe sequences.

>>> from html import escape
>>> scraped_html == escape(clean_html)
True

With the standard library:

  • HTML Escape

    try:
        from html import escape  # python 3.x
    except ImportError:
        from cgi import escape  # python 2.x
    
    print(escape("<"))
    
  • HTML Unescape

    try:
        from html import unescape  # python 3.4+
    except ImportError:
        try:
            from html.parser import HTMLParser  # python 3.x (<3.4)
        except ImportError:
            from HTMLParser import HTMLParser  # python 2.x
        unescape = HTMLParser().unescape
    
    print(unescape("&gt;"))
    

Even though this is a really old question, this may work.

Django 1.5.5

In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

You can also use django.utils.html.escape

from django.utils.html import escape

something_nice = escape(request.POST['something_naughty'])

I found this in the Cheetah source code (here)

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

not sure why they reverse the list, I think it has to do with the way they encode, so with you it may not need to be reversed. Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists... this is going in my library though :)

i noticed your title asked for encode too, so here is Cheetah's encode function.

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

Searching the simplest solution of this question in Django and Python I found you can use builtin theirs functions to escape/unescape html code.

Example

I saved your html code in scraped_html and clean_html:

scraped_html = (
    '&lt;img class=&quot;size-medium wp-image-113&quot; '
    'style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; '
    'src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; '
    'alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'
)
clean_html = (
    '<img class="size-medium wp-image-113" style="margin-left: 15px;" '
    'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
    'alt="" width="300" height="194" />'
)

Django

You need Django >= 1.0

unescape

To unescape your scraped html code you can use django.utils.text.unescape_entities which:

Convert all named and numeric character references to the corresponding unicode characters.

>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True

escape

To escape your clean html code you can use django.utils.html.escape which:

Returns the given text with ampersands, quotes and angle brackets encoded for use in HTML.

>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True

Python

You need Python >= 3.4

unescape

To unescape your scraped html code you can use html.unescape which:

Convert all named and numeric character references (e.g. &gt;, &#62;, &x3e;) in the string s to the corresponding unicode characters.

>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True

escape

To escape your clean html code you can use html.escape which:

Convert the characters &, < and > in string s to HTML-safe sequences.

>>> from html import escape
>>> scraped_html == escape(clean_html)
True

You can also use django.utils.html.escape

from django.utils.html import escape

something_nice = escape(request.POST['something_naughty'])

With the standard library:

  • HTML Escape

    try:
        from html import escape  # python 3.x
    except ImportError:
        from cgi import escape  # python 2.x
    
    print(escape("<"))
    
  • HTML Unescape

    try:
        from html import unescape  # python 3.4+
    except ImportError:
        try:
            from html.parser import HTMLParser  # python 3.x (<3.4)
        except ImportError:
            from HTMLParser import HTMLParser  # python 2.x
        unescape = HTMLParser().unescape
    
    print(unescape("&gt;"))
    

Below is a python function that uses module htmlentitydefs. It is not perfect. The version of htmlentitydefs that I have is incomplete and it assumes that all entities decode to one codepoint which is wrong for entities like &NotEqualTilde;:

http://www.w3.org/TR/html5/named-character-references.html

NotEqualTilde;     U+02242 U+00338    ??

With those caveats though, here's the code.

def decodeHtmlText(html):
    """
    Given a string of HTML that would parse to a single text node,
    return the text value of that node.
    """
    # Fast path for common case.
    if html.find("&") < 0: return html
    return re.sub(
        '&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
        _decode_html_entity,
        html)

def _decode_html_entity(match):
    """
    Regex replacer that expects hex digits in group 1, or
    decimal digits in group 2, or a named entity in group 3.
    """
    hex_digits = match.group(1)  # '&#10;' -> unichr(10)
    if hex_digits: return unichr(int(hex_digits, 16))
    decimal_digits = match.group(2)  # '&#x10;' -> unichr(0x10)
    if decimal_digits: return unichr(int(decimal_digits, 10))
    name = match.group(3)  # name is 'lt' when '&lt;' was matched.
    if name:
        decoding = (htmlentitydefs.name2codepoint.get(name)
            # Treat &GT; like &gt;.
            # This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML.
            # If htmlentitydefs included mappings for those entities,
            # then this code will magically work.
            or htmlentitydefs.name2codepoint.get(name.lower()))
        if decoding is not None: return unichr(decoding)
    return match.group(0)  # Treat "&noSuchEntity;" as "&noSuchEntity;"

See at the bottom of this page at Python wiki, there are at least 2 options to "unescape" html.


I found this in the Cheetah source code (here)

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

not sure why they reverse the list, I think it has to do with the way they encode, so with you it may not need to be reversed. Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists... this is going in my library though :)

i noticed your title asked for encode too, so here is Cheetah's encode function.

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

Daniel's comment as an answer:

"escaping only occurs in Django during template rendering. Therefore, there's no need for an unescape - you just tell the templating engine not to escape. either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %}"


I found this in the Cheetah source code (here)

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

not sure why they reverse the list, I think it has to do with the way they encode, so with you it may not need to be reversed. Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists... this is going in my library though :)

i noticed your title asked for encode too, so here is Cheetah's encode function.

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

See at the bottom of this page at Python wiki, there are at least 2 options to "unescape" html.


This is the easiest solution for this problem -

{% autoescape on %}
   {{ body }}
{% endautoescape %}

From this page.


Even though this is a really old question, this may work.

Django 1.5.5

In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

In Python 3.4+:

import html

html.unescape(your_string)

Use daniel's solution if the set of encoded characters is relatively restricted. Otherwise, use one of the numerous HTML-parsing libraries.

I like BeautifulSoup because it can handle malformed XML/HTML :

http://www.crummy.com/software/BeautifulSoup/

for your question, there's an example in their documentation

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

See at the bottom of this page at Python wiki, there are at least 2 options to "unescape" html.


For html encoding, there's cgi.escape from the standard library:

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

For html decoding, I use the following:

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

For anything more complicated, I use BeautifulSoup.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to django

How to fix error "ERROR: Command errored out with exit status 1: python." when trying to install django-heroku using pip Pylint "unresolved import" error in Visual Studio Code Is it better to use path() or url() in urls.py for django 2.0? Unable to import path from django.urls Error loading MySQLdb Module 'Did you install mysqlclient or MySQL-python?' ImportError: Couldn't import Django Django - Reverse for '' not found. '' is not a valid view function or pattern name Class has no objects member Getting TypeError: __init__() missing 1 required positional argument: 'on_delete' when trying to add parent table after child table with entries How to switch Python versions in Terminal?

Examples related to html-encode

Which characters need to be escaped in HTML? How to encode the plus (+) symbol in a URL Display encoded html with razor Transmitting newline character "\n" Html encode in PHP HtmlSpecialChars equivalent in Javascript? HtmlEncode from Class Library How to remove html special chars? How do I perform HTML decoding/encoding using Python/Django?