FactorPad
Faster Learning Tutorials

Syntax of the Python html module with examples

This simple module with just two functions is worth learning because it connects the two very functional submodules: parser and entities.
  1. About - Review the purpose, rules and location of the html module.
  2. Assignment - Construct a string object with examples.
  3. Functions - Learn html function syntax with examples.
  4. Methods - See how string methods can help with html.
  5. Help - Find additional information locally.
Paul Alan Davis, CFA, December 6, 2018
Updated: December 6, 2018
This is a convenient module to pass in a string and convert characters that have special meaning in HTML. See how it beats hand-editing.

Outline Back Next

~/ home  / tech  / python  / reference  / python html


A Guide to Working with HTML documents in Python

Beginner

Python Reference

This Python reference offers programmers a quick way to learn Python and also serves as a source for reminders.

While the version documented here is Python 3.5.3, most of this is suitable for other versions of Python 3. Check your version for details.

Outline

  1. About the Python html string Data Type
    1. How to access the html string data type
  2. Python html string Assignment
    1. Syntax
    2. Examples
  3. Python html Functions
    1. Syntax
    2. Examples
  4. Python html string Methods
    1. Syntax
    2. Examples
  5. Find Local Help on Python html

1. About the Python html Data Type

The two functions in the html module html.escape() and html.unescape() take a standard Python string data type and convert characters with special meaning to and from HTML.

The functionality of html works hand-in-hand with submodules html.parser which parses HTML documents, and html.entities which contains four dictionaries of HTML characters.

a. How to access the html module
  • Module name - html
  • Local source code (Linux) - /usr/lib/python3.5/html/__init__.py
  • How to import module - import html
  • Explicit function call - html.escape()

Here we assume the most basic import scenario without aliasing. See our reference document on importing modules for more.

2. Python html string Assignment

The main purpose for assignment is to create a string object that can be processed by the two functions in the html module.

Below is syntax for creating a string object with assignment followed by examples.

a. Assignment syntax
Syntax Priority
x = str(object='')
Assignment to a string using the formal approach with the built-in string function. If left blank, the default object='' will create an empty string object.
High
x = 'string'
Assignment to a string using the shortcut, normally enclosed in single quote (') or double quote (") symbols.
High
b. Assignment examples

Here we demonstrate how to create an HTML string object labeled x using the long form and short form.

# An HTML string using the built-in str() function >>> x = str("<html><head><title>MyHTML</title></head></html>") >>> type(x) <class 'str'> # An HTML string using the '' shortcut, could also use "" >>> x = '<html><head><title>MyHTML</title></head></html>' >>> type(x) <class 'str'>

Note how in both examples a single quote (') or double quote (") is required.

3. Python html Functions

The Python html functions can be called from Python scripts or the Python Interpreter.

The first function html.escape() strips three characters that have special meaning in HTML, specifically &, < and >, replacing them with the three HTML entities &amp;, &lt; and &gt;, respectively . It also optionally can convert single quotes and more commonly double quotes which in HTML are customarily used to pass attributes within tags.

The second function html.unescape() converts the HTML entities or numeric character references to Unicode characters, meaning it converts back the other way, to HTML.

a. Function syntax
Syntax Priority
html.escape(string, quote=True)
The string is a required parameter and by default quote=True will also replace single quote (') and double quote (") symbols.
High
html.unescape(string)
Converts named and numeric character references in string to Unicode characters. So it reverses the conversion.
Mid

A typical use case for the html.escape() function is to use output that you don't want the browser to render. For example, you want to show the tag <br> for example, instead of have the browser interpret it.

b. Function examples

For the two examples below, let's create a very basic HTML document to work with, and save it in the current working directory as template.html.

<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>MyHTML</title> </head> </html>

As a side note, when I created this web page I used the html.escape() function because you are reading this as HTML and I didn't want the browser to interpret the tags as HTML, so I converted it to use HTML entities instead.

The html.escape() function

Now, let's open that HTML file, read it as a string object, which is required for the html.escape() function to work. After that we'll perform the actual escape.

>>> html_file = open("template.html") # opens file object >>> type(html_file) <class '_io.TextIOWrapper'> >>> html_string = html_file.read() # create a string object >>> type(html_string) <class 'str'> >>> html_file.close() # close the file >>> html_string # print string to screen '<!DOCTYPE html>\n<html lang="en">\n <head>\n <meta charset="UTF-8">\n <title>MyHTML</title>\n </head>\n</html>\n' >>> html.escape(html_string) # escape HTML tags and quotes '&lt;!DOCTYPE html&gt;\n&lt;html lang=&quot;en&quot;&gt;\n &lt;head&gt;\n &lt;meta charset=&quot;UTF-8&quot;&gt;\n &lt;title&gt;MyHTML&lt;/title&gt;\n &lt;/head&gt;\n&lt;/html&gt;\n'

A few things to note here. First, notice how the escaping substituted \n characters for the end of lines. Also, spaces used when indenting the HTML focument were preserved.

In the example below we use the same function but modify the optional parameter to False, thereby not escaping the double quotes.

>>> html_string # same string as above '<!DOCTYPE html>\n<html lang="en">\n <head>\n <meta charset="UTF-8">\n <title>MyHTML</title>\n </head>\n</html>\n' >>> html.escape(html_string, False) # escape HTML tags not quotes '&lt;!DOCTYPE html&gt;\n&lt;html lang="en"&gt;\n &lt;head&gt;\n &lt;meta charset="UTF-8"&gt;\n &lt;title&gt;MyHTML&lt;/title&gt;\n &lt;/head&gt;\n&lt;/html&gt;\n'

Here if you look closely, the two locations where double quotes were used in the original document, they were retained.

Granted this is a very short HTML document, but since most on the Internet exceed 1,000 lines this illustrates the power of the html.escape() function.

If you're curious about the mental gymnastics required to get these HTML entities to show up properly in an HTML document 8], I suggest reviewing the page source in your browser.

The html.unescape() function

The second function converts the named and numeric characters in HTML to Unicode.

>>> html_string # same string as above '<!DOCTYPE html>\n<html lang="en">\n <head>\n <meta charset="UTF-8">\n <title>MyHTML</title>\n </head>\n</html>\n' >>> html_escaped = html.escape(html_string) # create escaped object >>> html_escaped '&lt;!DOCTYPE html&gt;\n&lt;html lang="en"&gt;\n &lt;head&gt;\n &lt;meta charset="UTF-8"&gt;\n &lt;title&gt;MyHTML&lt;/title&gt;\n &lt;/head&gt;\n&lt;/html&gt;\n' >>> html.unescape(html_escaped) # unescape the escaped object '<!DOCTYPE html>\n<html lang="en">\n <head>\n <meta charset="UTF-8">\n <title>MyHTML</title>\n </head>\n</html>\n'

This may seem confusing at first. Here we converted the original HTML string using html.escape(), naming it html_escaped. Then we converted that resulting string object back to the original form using the html.unescape() function.

A use case here is if someone sends you an escaped HTML document and you need to create a working HTML document from it.

4. Python html Methods

Keep in mind, there are no methods in the html module itself; however, behind the scenes, both functions described above use the replace() method that can be applied to any string object.

a. Method syntax

The method syntax examples below assume that the string object has been named x.

Syntax Priority
x.lower()
A common use case with HTML strings, especially in search contexts, is to transform all uppercase characters to lowercase using this string method.
Mid
x.replace(old, new[, count])
The old and new parameters are required and are supplied as strings (' ', " ", ''' ''', or """ """). The optional parameter count will make that number of replacements, from start to finish.
Mid

These methods, while not part of the html module are mentioned here so you can gain a better understanding of string objects in Python, and as a background for our next example.

b. Method examples

Here we create an instance of an HTML string object with tags and text capitalized. Using the lower() method for string objects we can transform it to lowercase before converting it with the html.escape() function described above.

# An HTML string with capitalize tags and text >>> x = '<HTML><HEAD><TITLE>MyHTML</TITLE></HEAD></HTML>' >>> type(x) <class 'str'> # Change all text to lowercase using the lower() string method >>> y = x.lower() >>> y '<html><head><title>myhtml</title></head></html>'

A use case for this approach would be for a search application where text is changed to lowercase before it is indexed.

5. Find Local Help on Python html

To access local help on the Python html module type help('html') at the Python Interpreter. Output for Python 3.5.3 looks like this.

Help on package html: NAME html - General functions for HTML manipulation. MODULE REFERENCE https://docs.python.org/3.5/library/html The following documentation is automatically generated from the Python source file. It may be incomplete, incorrect or include features that are considered implementation detail and may vary between Python implementations. When in doubt, consult the module reference at the location listed above. PACKAGE CONTENTS entities parser FUNCTIONS escape(s, quote=True) Replace special characters "&", "<" and ">" to HTML-safe sequences. If the optional flag quote is true (the default), the quotation mark characters, both double quote (") and single quote (') characters are translated. unescape(s) Convert all named and numberic references (e.g. &gt;, &62;, &x3e;) in the string s to the corresponding unicode characters. This function uses the rules defined by the HTML 5 standard for both valid and invalid character references, and the list of HTML 5 named character references defined in html.entities.html5. DATA __all__ = ['escape', 'unescape'] FILE /usr/lib/python3.5/html/__init__.py
Summary

In the end, the html module offers two convenient functions and a bridge to two related submodules in The Standard Library. First, the html.parser module which offers a way to parse HTML documents. Second is the html.entities module which includes the definitions of html entities and Unicode strings we saw earlier.

As you move forward to more advanced use cases like web scraping involving parsing HTML, other popular parsers come into view, like lxml, html5lib and the BeautifulSoup bs4 module.

Hopefully, this coverage of the html module has piqued your interest.


Related Python Content


What's Next?

Subscribe to our growing YouTube Channel, a companion to this free online educational website.

  • To see all reference material, click Outline.
  • To see what the reference is all about, click Back.
  • While we work on the next Reference page, Next is disabled.

Outline Back Next

~/ home  / tech  / python  / reference  / python html



 
 
Keywords:
python html
python html module
escape html attributes
convert html to unicode
python html converter
escape html in python
python html unescape
html entities
python html escape
html unicode
python web scraping
html lowercase
python string methods