This Python reference offers programmers a quick way to learn Python and also serves as a source for reminders.
While the version documented here is Python 3.5.3, most of this is suitable for other versions of Python 3. Check your version for details.
The two functions in the html module
html.unescape() take a standard
Python string data type and convert characters with special meaning
to and from HTML.
The functionality of html works hand-in-hand with submodules html.parser which parses HTML documents, and html.entities which contains four dictionaries of HTML characters.
Here we assume the most basic import scenario without aliasing. See our reference document on importing modules for more.
The main purpose for assignment is to create a string object that can be processed by the two functions in the html module.
Below is syntax for creating a string object with assignment followed by examples.
Assignment to a string using the formal approach with the built-in string function. If left blank, the default object='' will create an empty string object.
Assignment to a string using the shortcut, normally enclosed in single quote (') or double quote (") symbols.
Here we demonstrate how to create an HTML string object labeled
x using the long form and short form.
Note how in both examples a single quote (') or double quote (") is required.
html functions can
be called from Python scripts or the Python Interpreter.
The first function
strips three characters that have special meaning in HTML, specifically
&, < and >,
replacing them with the three HTML entities
&, < and
>, respectively . It also optionally can
convert single quotes and more commonly double quotes which in
HTML are customarily used to pass attributes within tags.
The second function
converts the HTML entities or numeric character references to
Unicode characters, meaning it converts back the other way, to HTML.
The string is a required parameter and by default quote=True will also replace single quote (') and double quote (") symbols.
Converts named and numeric character references in string to Unicode characters. So it reverses the conversion.
A typical use case for the
html.escape() function is to use
output that you don't want the browser to render. For example, you want
to show the tag
example, instead of have the browser interpret it.
For the two examples below, let's create a very basic HTML document
to work with, and save it in the current working directory as
As a side note, when I created this web page I used the
html.escape() function because you are
reading this as HTML and I didn't want the browser to interpret the
tags as HTML, so I converted it to use HTML entities instead.
Now, let's open that HTML file, read it as a string object,
which is required for the
function to work. After that we'll perform the actual escape.
A few things to note here. First, notice how the escaping substituted
\n characters for the end of lines.
Also, spaces used when indenting the HTML focument were preserved.
In the example below we use the same function but modify the optional
False, thereby not
escaping the double quotes.
Here if you look closely, the two locations where double quotes were used in the original document, they were retained.
Granted this is a very short HTML document, but since most on the
Internet exceed 1,000 lines this illustrates the
power of the
If you're curious about the mental gymnastics required to get these
HTML entities to show up properly in an HTML document
8], I suggest reviewing the page
source in your browser.
The second function converts the named and numeric characters in HTML to Unicode.
This may seem confusing at first. Here we converted the original HTML
html.escape(), naming it
html_escaped. Then we converted that
resulting string object back to the original form using the
A use case here is if someone sends you an escaped HTML document and you need to create a working HTML document from it.
Keep in mind, there are no methods in the html module
itself; however, behind the scenes, both functions
described above use the
method that can be applied to any string object.
The method syntax examples below assume that the string object has
A common use case with HTML strings, especially in search contexts, is to transform all uppercase characters to lowercase using this string method.
The old and new parameters are required and are supplied as strings (' ', " ", ''' ''', or """ """). The optional parameter count will make that number of replacements, from start to finish.
These methods, while not part of the html module are mentioned here so you can gain a better understanding of string objects in Python, and as a background for our next example.
Here we create an instance of an HTML string object with tags and
text capitalized. Using the
method for string objects we can transform it to lowercase before
converting it with the
function described above.
A use case for this approach would be for a search application where text is changed to lowercase before it is indexed.
To access local help on the Python html module type
help('html') at the Python
Interpreter. Output for Python 3.5.3 looks like this.
In the end, the html module offers two convenient functions and a bridge to two related submodules in The Standard Library. First, the html.parser module which offers a way to parse HTML documents. Second is the html.entities module which includes the definitions of html entities and Unicode strings we saw earlier.
As you move forward to more advanced use cases like web scraping involving parsing HTML, other popular parsers come into view, like lxml, html5lib and the BeautifulSoup bs4 module.
Hopefully, this coverage of the html module has piqued your interest.
Subscribe to our growing YouTube Channel, a companion to this free online educational website.