Sunday, November 18, 2007

character encoding

Why worry about ?

The character encoding associated with a Web page determines how the page renders in a Web browser. One important distinction to understanding the concept is the difference between character encoding and a character set.

Dictionary.com defines a character set as a particular mapping between characters and byte strings (i.e., a set of characters required for a certain language). It is the combination of a particular character encoding (which maps between byte strings and integers) and a particular coded character set. A coded character set is a set of characters for which a unique number has been assigned to each character. Character encoding is how these abstract characters are mapped to bytes for manipulation in a computer. To sum it up, character encoding tells the Web browser what set of characters to use when converting the bits to characters. Here are several reasons you should specify character encoding:

  • You should worry about character encoding since its declaration became a requirement with the HTML 4.01 specification.
  • If a character encoding is not specified in a Web page, the browser will guess at what encoding should be used to render Web page content. This guesswork can result in the wrong encoding scheme being used.
  • Browsers allow users to choose a default character encoding. This choice may not match the setting for a Web page.

A Web page’s character encoding is specified in the first line.

What is available?

The character encoding supported in HTML is defined with the Unicode character set. Unicode supports every alphabet with the capacity to represent millions of characters, including accented characters. Each character is assigned a two byte code value. This goes against the popular ASCII encoding used in the United States, which uses one byte.

Here is a sampling of available character encodings:

  • ISO 8859-1: This is the standard encoding of the Latin alphabet. Also know as Latin1, it includes the Latin-based languages of the world.
  • UTF-8 (8-bit UCS/Unicode Transformation Format): This character encoding is able to represent any character in the Unicode standard. A key difference is the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII.
  • UTF-16 (16-bit Unicode Transformation Format): This is a variable-length character encoding for Unicode that is capable of encoding every Unicode character.
  • US-ASCII: This is a subset of UTF-8 that covers the ASCII standard set of characters.

A full listing of character encoding options is available online, but UTF-8 is the recommended and most popular encoding scheme used today.

Choosing a character encoding

The main issue with character encoding selection is the need to use one that covers all the different languages and requirements of the intended audience. Character encoding is critical when dealing with multilingual applications that may use different languages that utilize different character encoding schemes.

When choosing a character encoding scheme, you must be aware of the characters that you will be using, along with the character encoding supported by the browser and any other applications that may be used to work with the files. The standards UTF-8 (which I stick with for my work) and US-ASCII are widely supported by browsers. You should do your research when working with standards other than these two.

Using a character encoding

When accessing a Web application, a Web browser will use the following steps to determine its character encoding:

  • The HTTP Content-Type header sent by the server is the default way to define character encoding. This is the preferred method, and it takes precedence over other items in this list. Here is an example of the Content-Type line sent as part of the HTTP header:
Content-Type: text/html; charset=utf-8

Web developers may specify the Content-Type header for a page via the syntax available to the developer. For example, an ASP.NET developer may use the following line:

<%Response.charset="utf-8"%>

A PHP developer may use this line:

header('Content-type: text/html; charset=utf-8');
  • XHTML docents may use the XML declaration in the first line of the page to specify character encoding. Here is one example:
<?xml version="1.0" encoding="UTF-8"?>
  • You can use the HTML/XHTML meta content-type element. It is placed inside the header portion of the page with the character encoding specified in its charset property.
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

CSS considerations

You may declare the encoding of external CSS style sheets. This step is not necessary with CSS embedded in a page, as the page’s character encoding takes care of it. You may designate the character encoding for a CSS file by adding a line to the top of the CSS file. The following syntax is used:

@charset "utf-8";

In addition, the charset attribute of the link element may be used.



Powered by ScribeFire.

No comments: