« Software consulting pitfalls : Red flags and signs you should run the other way | Main | Domain name resellers - They're still partying like it's 1999 »

February 3, 2011

Python "'ascii' codec can't decode byte" explained and how to solve it

On a previous post entitled Why you benefit from using UTF-8 Unicode everywhere in your web applications I explained the benefits of using UTF-8 Unicode encoding everywhere in your applications, which included a deep look into how character encodings work and all the fragmented approaches that still exist to this day.

If you've worked with Python and processed any non-english language characters, there's a high probability you've seen the error: "'ascii' codec can't decode byte", in this post I'll explain why this is a common error and how to solve it.

[Entry continues to the left and below ad ]

The first thing you need to know is that Python uses an ASCII encoding by default. That's right, even with all the goodness Unicode/UTF-8 brings to the table, Python can only represent a meager 128 characters by default.

This means that whenever an attempt is made to manipulate something that includes things like a British pound sign £, a French word with a cedilla ç or a Spanish word with accents á, é, í, ó, ú, you're likely to get the error "'ascii' codec can't decode byte".

After all, ASCII doesn't have enough space to represent such characters, so it doesn't know what to do with them, so you get this nice and informative(*gasp*) error: "'ascii' codec can't decode byte". ( The post Why you benefit from using UTF-8 Unicode everywhere in your web applications contains more details on these limitations and ASCII).

Just so you can confirm this for yourself, open up a Python interpreter and type in the following commands:

Listing 1.1 Python default's to ASCII encoding
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'

Now that you've confirmed this, I'll explain where you can change this so you can avoid all those pesky "'ascii' codec can't decode byte", but also why you shouldn't change this default encoding.

The default encoding configuration in Python is defined in the site.py file located inside the Python interpreter. On Linux/Unix systems this would be under a directory like /usr/lib/python<version>/ and on Windows systems under a directory like C:\Python<version>\Lib\. Inside this file you'll find a method called setencoding and a property defined as encoding="ascii".

Great, so all you need to do is call this method every time you're about to process some special characters, go ahead and try it:

Listing 1.2 Python doesn't allow changing setdefaultencoding at run-time
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.setdefaultencoding('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'

So what happened here ? It turns out you can't set a default encoding at run-time in Python . This means you're left with two choices. The quick choice is to modify the encoding value inside site.py to 'utf-8' or whatever other encoding you expect to process. The lengthier choice is to address this ASCII encoding issue head on.

There are many explanations on why setdefaultencoding is not available in Python and why Python uses ascii as its default encoding .

You can enjoy reading the technical merits and criticisms for these default choices if you like, but frankly with someone like Guido van Rossum and the core people developing Python having over 200+ years of combined experience developing programming languages, you can likely assume there is a very good reason for these defaults and it's also no oversight.

So how do you solve this ' "'ascii' codec can't decode byte" error without modifying site.py ? You'll need to manually convert characters which can't be handled in ASCII. Here's a walk through of this conversion process.

Lets assume you have some content in a file or database encoded with Unicode/UTF-8 and you want to do some processing with it in a Python environment:

Listing 1.3. Unicode/UTF-8 file or database string with special characters
Art. 1º.  En los Estados Unidos Mexicanos todo individuo gozará de las garantías que otorga
 esta Constitución, las cuales no podrán restringirse, ni suspenderse, sino en los casos y con
 las condiciones que ella misma establece.

So you read the content into Python and place it in a variable called content which you then use in a Python third party library like BeautifulSoup or a framework like Django, both of which work by default with Unicode. We can replicate this behaviour by using Python's unicode method as illustrated next.

Listing 1.4. Converting content with special characters to Unicode with no prior decoding.
>>>unicode(content)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 6: ordinal not in range(128)
# Lets see what the variable is actually holding
>>>content
'Art. 1\xc2\xba.  En los Estados Unidos Mexicanos todo individuo gozar\xc3\xa1 de las
 garant\xc3\xadas que otorga esta Constituci\xc3\xb3n, las cuales no podr\xc3\xa1n 
restringirse, ni suspenderse, sino en los casos y con las condiciones que ella misma 
establece.\n'
# And lets confirm the print output  
>>>print content
Art. 1º.  En los Estados Unidos Mexicanos todo individuo gozará de las garantías que otorga 
esta Constitución, las cuales no podrán restringirse, ni suspenderse, sino en los casos y con 
las condiciones que ella misma establece.

So what's happening here ? For one thing the special characters are still there, as you can see by executing print on the content variable. But the most important aspect is the actual characters stored by the variable. Look at how º is mapped to \xc2\xba and á to \xc3\xa1, as well as how the error has a symbol "0xc2" which is used to represent º. In addition, the last part of the message "ordinal not in range(128)" is indicative of the 128 character mapping limit in ASCII ( See Why you benefit from using UTF-8 Unicode everywhere in your web applications for more details on this limit).

Here it doesn't matter if your input was Unicode/UTF-8, since Python defaults to an ASCII encoding, what you get are ASCII encoded Unicode/UTF-8 characters. So any attempt at performing a unicode operation on them and the interpreter balks, since ASCII doesn't recognize such characters.

Of course if you change the site.py parameters in Python to 'utf-8', you would get rid of the issue. Since Python would expect UTF-8 and you would only need to provide it with UTF-8 input.

But how would you make it work with Python's default configuration ? It's simple, since you already know what type of encoding the content is in you would just need to decode it, as shown in the next snippet

Listing 1.5. Decoding content
# Decode the content using the encoding you know beforehand
>>>hasslefreecontent = content.decode('utf-8')
>>>hasslefreecontent
# NOTE u' to indicate a unicode string and changes to the special character representations
u'Art. 1\xba.  En los Estados Unidos Mexicanos todo individuo gozar\xe1 de las garant\xedas que otorga 
esta Constituci\xf3n, las cuales no podr\xe1n restringirse, ni suspenderse, sino en los casos y con 
las condiciones que ella misma establece.\n'
unicode(hasslefreecontent)
# And lets confirm the print output  
>>>print hasslefreecontent
Art. 1º.  En los Estados Unidos Mexicanos todo individuo gozará de las garantías que otorga 
esta Constitución, las cuales no podrán restringirse, ni suspenderse, sino en los casos y con 
las condiciones que ella misma establece.
type(hasslefreecontent)
<type 'unicode'>

Here you can see a call made to unicode() on the new variable -- which has the decoded content -- works! No more "'ascii' codec can't decode byte". In addition, note that a call to print works as expected and the new variable is of the type unicode.

If for some reason you wanted to switch back to an ASCII representation you could do it just as easily by encoding -- which is the opposite of the process you just did of decoding. Since the process of encoding can be prone to the same difficulties of not being able to interpret special characters, there are a series of options you can use which are illustrated in the following example

Listing 1.6. (Re) Encoding content
>>>hasslefreecontent.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in position 6: ordinal not in range(128)
# Ups, now ascii didn't know what to do with those special unicode characters!
# Lets try some options 
>>> hasslefreecontent.encode('ascii','ignore')
'Art. 1.  En los Estados Unidos Mexicanos todo individuo gozar de las garantas que otorga 
esta Constitucin, las cuales no podrn restringirse, ni suspenderse, sino en los casos y con 
las condiciones que ella misma establece.\n'
>>> hasslefreecontent.encode('ascii','replace')
'Art. 1?.  En los Estados Unidos Mexicanos todo individuo gozar? de las garant?as que otorga
 esta Constituci?n, las cuales no podr?n restringirse, ni suspenderse, sino en los casos y con 
las condiciones que ella misma establece.\n'
>>> hasslefreecontent.encode('ascii','xmlcharrefreplace')
'Art. 1º.  En los Estados Unidos Mexicanos todo individuo gozar&#225; de las garant&#237;as que otorga 
esta Constituci&#243;n, las cuales no podr&#225;n restringirse, ni suspenderse, sino en los casos y con 
las condiciones que ella misma establece.\n'
>>> 

As you can observe, if you try to encode Unicode to ASCII, Python also doesn't know what to do with those special characters not supported in ASCII. As alternatives, you can opt to ignore these special characters -- in which case the output is blank -- replace these special characters -- in which case the output appears with a question mark -- or use the XML entity representation of each character -- in which case you'll see the representations as &# marks which would make the content perfectly functional on browsers and XML parsers.

So there you have it, two alteratives to solving "'ascii' codec can't decode byte" insanity in Python.

You may also want to read Why you benefit from using UTF-8 Unicode everywhere in your web applications for more on encoding issues in general.

Posted by Daniel at February 3, 2011 11:47 AM