Primer on Character Encoding Settings with HTML/PHP

Authors:
  • Eric Goldman

8 May 2010   ::   Web Development   ::   #utf-8 #iso-8859-1 #mojibake #ascii #encoding #character set

 

Introduction: What are Character Sets and Character Encoding?

While as humans we perceive words as a collection of letters or glyphs as full word representations (e.g., Chinese) computers store everything as a binary representation, including letters. Every letter is represented in binary as some numeric value which is constant within a character set. The most familiar simple character set is the ASCII Character Set, which covers most standard English letters and numbers. However of course, other letters exists as well as diacritic marks and accents (e.g., β, ©, Ü, ŏ), β, ©, Ü, ŏ) When text is saved into memory/storage it is usually saved using one specific character encoding. Since the saved binary is actually just binary data it is possible to render text saved with one encoding in a different encoding. This is usually not intended and often results accidently and results in corrupt data display; however, because some character encodings storage values overlap it may not always be discernable.

ASCII is insufficient for coverage of all global alphabets and glyphs. Unicode provides a set of standards that include characters in multiple languages, and allows for multi-byte character representations. In ASCII all characters are of a uniform length (8 bit/1 byte); however, there are obviously insufficient possibilities to store all the characters needed for global languages. Of the Unicode standards, the most commonly used is UTF-8, which is the most frequent Unicode character encoding using on UNIX systems and the Internet.

Generally, your best bet is to render all pages in UTF-8 and to store all data in the UTF-8 character set. This involves making sure the proper settings are configured on your web server, database server, and on the HTML pages you will render. It is an unfortunate, but quite common scenario to have your data saved in one character set and then displayed in another character set. You will notice this on many improperly configured websites on the Internet, where you will see errant characters which do not look like they belong. In some configuration the default encoding is iso-8859-1; so make sure you set UTF-8.

Browser Considerations

By default, web browsers may not render your web pages in the encoding you desire. However, with most current and previous generation browsers dating back through the early 2000s you are not locked in to one specific encoding. However, if you do not properly set the encoding on your web page, the browser may use its default setting or may simply try to guess, and may of course guess wrong. It is also possible that everything is set up correctly on the client side (browser), but a server misconfiguration results in display errors. However, in almost all cases if the page has been authored correctly and the server settings are correct no action or special action should be necessary on the client side.

Web Page Considerations

On the actual page itself, you should ensure that the proper encoding has been selected. Without manually specifying the encoding, you are leaving things up to chance; it is much better to be explicit if you want to ensure proper rendering. For manually created page, you can enter this in the head element yourself. Most CMSs also provide a functionality to control the default encoding so you can set it on each page either in the Web GUI or through a configuration file. In the case of Joomla! 1.5, the default character encoding is UTF-8 (See the API here). For reference, you want your head tag to beprepared in the following manner (e.g., <meta http-equiv="Content-type" content="text/html;charset=UTF-8" /). Setting the <meta /> variable becomes increasingly important as your page is repurposed or converted in automatic translation services like Google Page Translation.

Web Server Considerations

You can ensure that all of your pages are rendered using UTF-8 using .htaccess. Be sure that you only place the .htaccess file at the correct level of the file hierarchy in case, for some reason, you need other encodings. The correct directive to set in your .htaccess file is: AddDefaultCharset UTF-8. You can find more details about the setting and what is means by visitin the Apache Documentation. s setting is sent in the HTTP header (e.g., Content-Type: text/html; charset=iso-8859-1 ), which is used to transmit the HTML page. While you may set the character encoding “charset” in the HTML <head /> tag (e.g., <meta http-equiv="Content-Type" content="text/html; charset=utf-8">), the HTTP server setting will often override that setting. For more information, please read this explanation from the W3C.

Database (MySQL Considerations)

In most common web CMS and applications scenarios, you will be dealing with dynamic content or content stored in a database. The information stored in the database should of course be set to UTF-8 to prevent any oddities later during rendering. Though it is possible to create content in one encoding, save it in another format, and then display it in the original encoding properly, it is highly recommended to ensure everything is aligned properly. It may also affect the functionality/behavior of some string functions if the encoding is not properly set. A common problem is moving from a development database server with one encoding to one with a different encoding, often resulting in data corruption. Specifically, in MySQL the encoding can be set specifically at the server level, database level, table level, and even as the field level. You should also take care of the “collation” settings in MySQL; a collation is a set of rules for comparing characters. In most cases, if you set the encoding to UTF-8 at the server level it should also be the default at all the lower levels. These settings can be set in configuration files, on the MySQL command line, or through some web interfaces such as the most recent version of PhpMyAdmin. Character encoding and collation settings are available in both MySQL 4 & 5. For more detailed information, you can visit the MySQL Manual.

Conclusion

Don’t let all the hard work you put into writing your code and designing your pages go to waste. Business users and web page visitors do not see or care what is going on underneath the hood. Client may think their browser is going crazy or the page was hacked if they see funny characters or may complain that the site is not prime time ready. End-users and visitors to the page may think the site is unprofessional or has a virus if they see odd characters. In some cases, it can also destroy design or make it very difficult to read page content. Also, remember that Googlebot and the Bing spiders are just computer applications and your keywords and SEO can be hurt if they cannot understand your page’s encoding. You should ensure that all your server and page settings are correct, and always test on multiple browsers and platforms.

Resources