Helping WordPress Deal with Exotic Text Encoding

In my ongoing quest to bring flawless multi-language (multi-encoding to be more accurate) support to WordPress, I just had a blindingly simple, yet highly efficient, idea for an improvement.

If your blog is accustomed to receiving trackbacks or comments containing non-standard characters (accents, kanjis etc.), then you have probably noticed that a fair share end up getting mangled in the process. WP is not really at fault here, since this is caused by some browsers’ failure to respect the encoding set in a page when sending form content (e.g. submitting a comment). No need to tell you which poor excuse for a browser so shamelessly ignore proper web standards. This is of little comfort anyway, since in the end, all that matters is that WordPress is getting toh-mah-toh when it is expecting toh-may-toh, and pretty much ends up displaying poh-tah-to to everybody else.

The fix, as I was saying is ridiculously easy. And to the best of my knowledge it won’t break anything in your current WP install. Worst that could happen is that it won’t fix your problem, but it won’t break your blog.

OK. Let’s start with the How To (gory details afterward):

  1. If you have an .htaccess file in your blog folder, open it.
  2. If you do not have one (or if you don’t have the faintest idea what it is: which means you probably do not have one). You need to create a new file called .htaccess (mind the period at the beginning of the filename) and upload it to your blog root folder.
  3. Append the following lines to your existing or newly created .htaccess file:
    php_value mbstring.internal_encoding UTF-8
    php_flag mbstring.encoding_translation on
  4. You are done!

As an optional extra step, you can also edit the wp-settings.php file in your WordPress install and inserts the following two lines at the very end, just before the closing “?>” tag:

if (function_exists("mb_internal_encoding"))

This optional step is only somewhat necessary if your blog uses an encoding different from the standard UTF-8 encoding recommended by WordPress and sane people the world over. If once again you have no idea what this is all about, you can safely assume your blog is using UTF-8 (as it should be) and skip this step. If you are not using UTF-8, then: 1) you probably should not be complaining about any encoding issues you brought onto yourself by refusing to use UTF-8 in the first place 2) the htaccess hack won’t work without adding the two lines above and might still not work afterward (though I think it should).

What this all does [aka the boring stuff]:

The first line of the htaccess hack insures that, if multi-byte support is available (namely the mbstring PHP module), it gets used to automagically translate any input to whatever you ask it to use. That’s damn convenient and done everywhere in the code once you enable this, meaning that even the most hidden functions benefit from the change and will always deal with the right encoding (if they perform encoding translations of their own, they shouldn’t be affected negatively). The second line set the encoding to be used internally (and therefore translated to, whenever performing automatic translation) as UTF-8.

The optional second part of the hack asks PHP to translate its output from UTF-8 to whatever the blog encoding setting might be, thus ensuring your blog will appear consistent with what the blog meta-tags announce. Since that content has already been translated and stored as UTF-8, this is superfluous for people who have their blog set to use UTF-8. But it is probably a good thing to have in order to ensure flawless code compatibility (i.e. it won’t break if you decide to change the settings).

The reason why we cannot directly set the internal encoding to whatever the blog needs to output is twofold:

  • It can’t be done in the htaccess file, since we need to access WP’s settings through PHP to know what encoding to use.
  • Most importantly, not all encodings are fit to be used by PHP as an internal encoding (while pretty much any encoding can be used for output), one more good reason to use UTF-8 for everything.

Unfortunately, it is not possible to circumvent the htaccess step and call ini_set("mbstring.encoding_translation", "on") from wp-settings.php as we do for mb_http_output(), since this setting is PHP_INI_PERDIR level.

Before you ask: this hack won’t break your install, even if you do not have the PHP mbstring module installed: htaccess directives will simply be ignored and the function_exists() check will prevent the wp-settings.php hack from being executed. Therefore nothing will be changed for people who do not have mbstring enabled and the overhead is absolutely inconsequential.

I do not think there is any viable way to support encoding conversions without an install of PHP that has the mbstring module enabled. If your server does not offer it and you really want to be able to support “non-English” characters, you should consider switching hosting companies. All the US hosting solutions I have used to this day (that’s many) had mbstring enabled (though not necessarily with the correct settings, but that’s what this hack will rectify), so really it’s not that hard to find.

I guess it would be great to see this implemented in a form or another in the main code base, although I do realize some people might not be enthused over the idea of shipping WordPress with a preset .htaccess file.

Any thought or proposed improvement/correction on the matter greatly appreciated.

1 comment

Leave a Reply