What you need to know about PHP’s internal character encoding

Dealing with strings is an essential part of any web application and PHP is no different. Every developer, though, should be aware of character encoding.

Character encoding, in the context of a web page, determines what characters are available and how they are stored. The number of characters available in an encoding is determined by the number of bits/bytes used to store each character.

Encoding

There are many encodings available. An early Latin-based one is ASCII that uses just 7 bits per character resulting in 128 available characters (2 ^ 7 = 128).

In the early days of the Internet character sets such as the Western Latin ISO/IEC 8859-1, also known as latin1, were used which were 8-bit/a single byte per character resulting in 256 available characters (2 ^ 8 = 256).

256 characters might sound like a lot but these include control characters such as new lines and tabs. Sometimes this became limited especially when dealing with international characters sets. One way round this is to use HTML entities for characters not included in your encoding. However, it makes life easier to use an encoding that covers a lot more than 256 characters.

Unicode/UTF

Unicode was developed to create a standard to represent most characters in the world’s writing systems. Unicode itself is not an encoding. Rather, it is implemented in the UTF-8, UTF-16 and UTF-32 encodings.

UTF-8 and 16 use a variable byte width to store characters (1–4 and 2–4 respectively) whereas UTF-32 always uses four bytes.

UTF-8 has become the most widespread in websites and is the one I recommend you use. The common letters are stored first in just one byte and most of the special characters from common world languages are represented with two bytes.

Why you should know about encoding and PHP

PHP’s standard set of string functions (the ones beginning with str) assume a string contains one byte per character. If you’re using an encoding with variable byte size like UTF-8 this can be problematic when you perform a string function on a string that contains a character stored using more than one byte.

echo strlen('£100');

This should output 4 but outputs 5 because the pound sign uses two bytes in UTF-8. Imagine you are checking a password length and they use a character that takes up more than one byte and you will allow them to have a password that is too short.

Thankfully, PHP comes with a series of multibyte string functions that count characters in the context of a specific encoding. They are the same their single byte counterparts only they are prefixed with mb_ and accept an optional encoding parameter.

Consider the above example again:

echo strlen('£100'); // Outputs 5, incorrect
echo '<br />';
echo mb_strlen('£100', 'UTF-8'); // Outputs 4, correct

Now it reports string length correctly. It can be a pain passing encoding each time so you can set the internal encoding for multibyte functions globally:

mb_internal_encoding('UTF-8');

Of course, this relies on any third party libraries you use not changing the internal encoding to something different so bear that mind. Another thing to note is not all str functions have a mb_ equivalent — because they don’t all need to. For example, there is no mb_str_replace. Functions such as this work correctly regardless of encoding so if there isn’t a function, you don’t need it.

Binary Data

If you’re working with binary strings then you shouldn’t use the multibyte string functions. Consider this example:

echo strlen(sha1('www.texelate.co.uk', true)); // Outputs 20, correct
echo '<br />';
echo mb_strlen(sha1('www.texelate.co.uk', true), 'UTF-8'); // Outputs 19, incorrect

strlen will look at the number of bytes (a SHA1 hash is always 160 bits/20 bytes long unless you convert it to hexadecimal then it will double in size as hexadecimal can only represent four bits per character) whereas mb_strlen will look for characters that match the encoding. In the case of UTF-8, due to its variable byte length, a SHA1 hash will produce varying string length based on how the hashed result matches UTF-8 characters. I’ve used SHA1 as an example as, like other PHP hashing functions, you can return the hash in binary. This principal applies to all binary data, however.

Conclusion

For flexibility use UTF-8 for all encodings
For non-binary strings use mb_ functions where available
Use standard string functions for binary data

Resources

Tim Bennett is a Leeds-based web designer from Yorkshire. He has a First Class Honours degree in Computing from Leeds Metropolitan University and currently runs his own one-man web design company, Texelate.