« Xmas geek gift | Main | Happy New Cock Year! »

Converting a Movable Type blog from ISO-8859-1 to UTF-8

Let me tell you that moving an MT blog from one server to another is not necessarily a piece of cake, especially if you have the odd idea of switching the charset, from ISO-8859-1 to UTF-8 (Unicode). It took me an awful amount of time, trials and errors, and I'm documenting the process in the hope that it will save time for someone else.

The key thing to keep in mind is that switching from one charset to another, with existing content, is not a matter of changing one setting here or there. It's not because you have modified AddDefaultCharset in Apache or PublishCharset in MT that you set to go. The charset must be consistent all the way through, from the content to the receiving end. This means that the content itself has to be in UTF-8 (possibly converted from another charset), stored in the database, manipulated by your blog software and served by the web server in UTF-8. It's this consistency that can be problematic to achieve, and a source of trouble if not.

First, I transfered all the static files from the source server to the destination one (this is where you may want to upgrade your copy of MT). Then I transferred the database, using mysqldump, connnecting from the new server directly to the source (you may not be able to do so, in wich case, you would have to transfer the resulting file):

mysqldump --default-character-set=latin1 -C -u username -p -h host --opt --skip-add-locks --skip-extended-insert database > mtdump.sql

My source database is using ISO-8859-1. The trap here is to forget to override the charset, because old versions of the mysql client (like the one on my source server) have ISO-8859-1 as the default while new ones (starting with 4.1 I think) use UTF-8.
Another trap may be to use --opt alone, which is supposed to be the best option (it's on by default on recent mysql versions). However, you may face two problems:
- you don't have permission to LOCK tables, therefore the --skip-add-locks option
- if you cannot change the max_allowed_packet variable on your destination server, you may get the following error during the import: ERROR 1153 (08S01): Got a packet bigger than 'max_allowed_packet' bytes, therefore the --skip-extended-insert option which produces a bigger file but with smaller chunks of INSERTs.

Then I converted the SQL dump from ISO-8859-1 to UTF-8:

iconv -f iso-8859-15 -t utf-8 mtdump.sql > mtutf8.sql

[Note: I use ISO-8859-15 here, because of the euro (€) sign. See this tutorial on charsets for more information on the ISO-8859/latin1 family.]

Then I used mysql, setting the proper input charset, to feed the destination database:

mysql -u username -p --default-character-set=utf8 database < mtutf8.sql

Once this is done, you'll need to configure your blogs with the proper settings at destination (notably the paths names), you'll have to rebuild the templates that are linked to files and that use accentuated characters and, of course, rebuild the whole site.

Now the really unfunny traps...

First, supposing that you have a funny firstname like François and the highly stupid idea of writing it as it should be written (with its bells and whistles, accents, cedillas, etc.) in a MT login name, you may have to resort to some trickery to login to your new MT installation. Never use accents in a login name, especially with a product developed in the U.S.! (BTW, TypeKey is quite broken in this regard too, always trying to scramble my first name.)

Along the same lines, if like me your writings go beyond ASCII, and if you are using the dirify attribute to create category folders and file names, you will find that the dirify function is broken (apparently so since MT 3.1). It works with ISO-8859-1 but not with UTF-8, and it will turn all your accents into 'a'. This is quite problematic since UTF-8 is supposed to be the default charset since MT 3.0! To overcome this, you can either resort to use the dirify for Unicode plugin and change to dirify_unicode="1" in the same way one would use dirify="1", or (this is what I did), grab in this plugin the entire my %HighASCII = (...) hash table that sits in sub convert_high_unicode to replace the one in lib/MT/Util.pm. I prefer the latter way, since I hope that Six Apart will eventually fix the bug in dirify.

That's all for tonight, I hope I didn't forget anything big. If you see this post, you've reached the new server. If you note anything strange, please let me know!

P.S.: 1. I've first considered using TypeMover, which makes the attractive promise to handle both the transfer from one MT installation to another and the conversion of charset. But my attempt ended up with one major problem: it converts all the accents into HTML entities, which is a big no-no for me (would you like to edit content where half of the words are scrambled with ugly &blah; blocks?)

2. This tutorial assumes that you are moving your blog on the same major version of MT. If this is not the case, then you'll have to get your content, convert it to UTF-8 then follow the MT upgrade steps. The order is not that important.

TrackBack

TrackBack URL for this entry:
http://padawan.info/cgi-bin/mt/mt-trckbck.cgi/1157

Listed below are links to weblogs that reference Converting a Movable Type blog from ISO-8859-1 to UTF-8:

» Déménagement from padawan.info/fr
Si vous voyez ce billet, c'est que vous avez atteint le nouveau server qui héberge ce blog, chez TextDrive. Ce ne fut pas une mince affaire que de déménager mon installation MT, faire une mise à jour et surtout convertir... [Read More]

» Ausgesuchte K�stlichkeiten: 01.02.2005 from pixelgraphix
Converting a Movable Type blog from ISO-8859-1 to UTF-8 CutePDF - Create PDF for free, Save PDF Forms, Edit PDF easily PDF Writer für Windows: Kostenlos für den privaten... [Read More]

Comments (6)

You can plan these things (migrations) as long as you like and there will _always_ be something else to do during the process.

I'm glad the switch went well for you. Was the upgraded mysql version one of the motivations for changing host?

Matt, you can't imagine how many traps I've fell through during the move, but I hope the worst is behind me now.

I had a number of reasons, notably the fact that my previous host hasn't upgraded its platform in years, lagging behind several versions of everything LAMP, and was lacking certain Perl extensions that are required by lots of MT 3.x plugins. TextDrive allows me much more control on my site and they're even more responsive since they've got people working on at least three timezones (US, Europe and Australia).

tehu:

Conclusion for the non advanced user : use Typepad ?

> Conclusion for the non advanced user : use Typepad ?

Mmh, you don't have the choice of charset on TypePad, the charset is what the provider decides (UTF-8 as far as I can see on my TypePad account, which is good).

No, a better advice would be: don't change the charset of an existing blog if you don't need to ;-).

[I guess the next question will be: are you a masochist? ;-)]

Padawan,

the dirify-unicode plug-in doesn't work too well with MT 3.2.

"Ăă Îî Ţţ Şş Ââ Iñtërnâţiônàlizætiønş" will output "aeae-aa-aa-initeirnaitiioinailizatiansi". Something's wrong here...

Probably the issue has nothing to do with the MT versions and it's just a bug in the plug-in table. I'll try to update it with the latest version of my conversion table
http://www.timbru.com/files/textpattern/accented-latin-characters-conversion-ascii.txt

PS: I created the conversion table for the version 0.2 of the plug-in too.

For those of you using postgres the comands are:

$ pg_dump -f my_mt.dmp mt
$ iconv -f ISO8859-1 -t UTF-8 my_mt.dmp>utf.dmp
$ dropdb mt
DROP DATABASE
$ createdb -T template0 mt
CREATE DATABASE
$ psql mt

Post a comment (comment policy)

About

This page contains a single entry from the blog posted on December 28, 2004 1:45 AM.

The previous post in this blog was Xmas geek gift.

The next post in this blog is Happy New Cock Year!.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 4.01