December 2004 Archives

Let me tell you that moving an MT blog from one server to another is not necessarily a piece of cake, especially if you have the odd idea of switching the charset, from ISO-8859-1 to UTF-8 (Unicode). It took me an awful amount of time, trials and errors, and I'm documenting the process in the hope that it will save time for someone else.

The key thing to keep in mind is that switching from one charset to another, with existing content, is not a matter of changing one setting here or there. It's not because you have modified AddDefaultCharset in Apache or PublishCharset in MT that you set to go. The charset must be consistent all the way through, from the content to the receiving end. This means that the content itself has to be in UTF-8 (possibly converted from another charset), stored in the database, manipulated by your blog software and served by the web server in UTF-8. It's this consistency that can be problematic to achieve, and a source of trouble if not.

First, I transfered all the static files from the source server to the destination one (this is where you may want to upgrade your copy of MT). Then I transferred the database, using mysqldump, connnecting from the new server directly to the source (you may not be able to do so, in wich case, you would have to transfer the resulting file):

mysqldump --default-character-set=latin1 -C -u username -p -h host --opt --skip-add-locks --skip-extended-insert database > mtdump.sql

My source database is using ISO-8859-1. The trap here is to forget to override the charset, because old versions of the mysql client (like the one on my source server) have ISO-8859-1 as the default while new ones (starting with 4.1 I think) use UTF-8.
Another trap may be to use --opt alone, which is supposed to be the best option (it's on by default on recent mysql versions). However, you may face two problems:
- you don't have permission to LOCK tables, therefore the --skip-add-locks option
- if you cannot change the max_allowed_packet variable on your destination server, you may get the following error during the import: ERROR 1153 (08S01): Got a packet bigger than 'max_allowed_packet' bytes, therefore the --skip-extended-insert option which produces a bigger file but with smaller chunks of INSERTs.

Then I converted the SQL dump from ISO-8859-1 to UTF-8:

iconv -f iso-8859-15 -t utf-8 mtdump.sql > mtutf8.sql

[Note: I use ISO-8859-15 here, because of the euro (€) sign. See this tutorial on charsets for more information on the ISO-8859/latin1 family.]

Then I used mysql, setting the proper input charset, to feed the destination database:

mysql -u username -p --default-character-set=utf8 database < mtutf8.sql

Once this is done, you'll need to configure your blogs with the proper settings at destination (notably the paths names), you'll have to rebuild the templates that are linked to files and that use accentuated characters and, of course, rebuild the whole site.

Now the really unfunny traps...

First, supposing that you have a funny firstname like François and the highly stupid idea of writing it as it should be written (with its bells and whistles, accents, cedillas, etc.) in a MT login name, you may have to resort to some trickery to login to your new MT installation. Never use accents in a login name, especially with a product developed in the U.S.! (BTW, TypeKey is quite broken in this regard too, always trying to scramble my first name.)

Along the same lines, if like me your writings go beyond ASCII, and if you are using the dirify attribute to create category folders and file names, you will find that the dirify function is broken (apparently so since MT 3.1). It works with ISO-8859-1 but not with UTF-8, and it will turn all your accents into 'a'. This is quite problematic since UTF-8 is supposed to be the default charset since MT 3.0! To overcome this, you can either resort to use the dirify for Unicode plugin and change to dirify_unicode="1" in the same way one would use dirify="1", or (this is what I did), grab in this plugin the entire my %HighASCII = (...) hash table that sits in sub convert_high_unicode to replace the one in lib/MT/Util.pm. I prefer the latter way, since I hope that Six Apart will eventually fix the bug in dirify.

That's all for tonight, I hope I didn't forget anything big. If you see this post, you've reached the new server. If you note anything strange, please let me know!

P.S.: 1. I've first considered using TypeMover, which makes the attractive promise to handle both the transfer from one MT installation to another and the conversion of charset. But my attempt ended up with one major problem: it converts all the accents into HTML entities, which is a big no-no for me (would you like to edit content where half of the words are scrambled with ugly &blah; blocks?)

2. This tutorial assumes that you are moving your blog on the same major version of MT. If this is not the case, then you'll have to get your content, convert it to UTF-8 then follow the MT upgrade steps. The order is not that important.

Xmas geek gift

| | Comments (3) | TrackBacks (0)

I've decided to give myself (and this blog) a treat by signing up to TextDrive's Holiday special plan. One year of hosting, shock-full of features, en bonne compagnie, for $99 (i.e. only 73€ at today's rate). I owe a special thanks to Damelon, whoever takes the TGV first, the beer and meal are on me. The deal ends tomorrow, should you be looking for a place for your site (they're very blog-friendly, even for MT, despite all the rage).

I shall be moving soon there, possibly (if not preferably) breaking things up in the journey (if breaking things more than they already are is at all possible). Sorry in advance. Anyway, you should be having some holiday fun rather than hanging around this insignificant blog ;-).

Joyeux Noël (again).

Merry is better

| | Comments (1) | TrackBacks (0)

Happy Hol Merry Christmas!

Merry is better. It means drunk.

It's 100% Champagne tonight here. I guess that 13 bottles for five will be enough (we had 14 but we couldn't resist popping one earlier and we're not superstitious ;-). And for those of you who always fall for the big marketing French brands such as Veuve Cliquot, try some much better quality-price ones such as Ruinart and Deutz, you'll be surprised.

And as we say here, in the land of political incorrectness, joyeux Noël !

Yes! Those, with their ugly hidden agenda, who wanted to impose software patents without neither a discussion nor a democratic majority in the parliament, have been rebuffed one more time. Proof that public pressure has an effect over lobbies. I truly hope this directive will never see the light, and that Europe will remain free of software patents. With the US slowed down with their broken patent system ("my portfolio is bigger than yours, don't you dare to sue me!"), this could be one of the biggest competitive advantage we can gain from doing nothing to align to the US demands.

Thanks Poland!

The Chesnot and Malbrunot families have received the best Christmas gift they could possibly expect this year. After 124 days being held hostages in Iraq, I hope the two journalists will be able to return safely at home as quickly as humanely possible.

Yahoo! Search's most prominent (and unofficial) blogger, Jeremy Zawodny, writes about comment spam. And he's not shy about the key role that search engines have in fighting against this plague:

Then a partial solution is fairly clear. I've heard and seen others discuss it over the past few months. The search engines need to be smarter about reading and indexing content.

When folks like Tim build software that classifies pages, the software needs to be able to recognize the difference between links produced by the blog owner(s) and those contributed by readers and spambots.

Once you can identify the difference between those two types of links, you simply stop using the second type of link when calculating rank. Sure, you can still count them for the purpose of providing link counts--just don't factor them into the ranking.

How's that for removing the incentive?

There are already several proposals on how to do that. My favorite is a simple pair of comments that act as a wrapper around any content that you don't want the search engines to index, and it's my favorite because it's the simplest I've seen so far and it gives me control on what gets indexed or not within a page, not just links. Others involve a fairly comprehensive qualification of links relationships that paves the way to lots of very interesting applications but have about zero chance to be effective until it's built-in with a very simple GUI in web editing tools so that Joe Average starts using it. There might already be more solutions around than blog spammers, and I'm sure even Google and Yahoo! have their own ideas. The most important thing is that they agree on the same one, call it the industry standard and tell world + dog to happily use it on their web sites.

Brad Choate objects that the fix is already out there: just use a redirection service for links. However this has three main drawbacks: 1) it kills the referrer information, 2) it wastes resources for handling a simple link, 3) it prevents links mining services such as Technorati and alike to map connections between sites. Ironically, it should be pointed that Movable Type does provide a redirection of sorts, except that it doesn't work in comments body, making it fairly useless. But the killer reason why this is not the horse I'd bet on, is that nothing prevents the search engines bots to eventually be smart enough to follow the redirections until they reach the destination, and handle it exactly as those redirections did not exist. The redirection that Brad mentions is actually a hack based on the current inability of those engines to follow some redirection mechanisms.

After reading MT Plus Comment Spam Equals Dead Site and news that TextDrive has a huge problem with spam attacks on MT's comment script, I started to write a post about MT and "splog" (as in "spam a blog"). I quickly turned into venting my frustration about the lack of good anti-spam protection on a stock install of MT. It happens that I have a client site on this server, a client for which I was considering MT as the solution. He knows very well what comment spam is, imagine the embarrassment. And even if his site runs another solution, he will still be impacted by other attacked MT installations on the same host (probably mine included should I move there).

Then I calmed down, stopped writing, waited a bit, read the comments around, found Comment Spamalot by Ben Hammersley and was finally remembered by myself that Mark Pilgrim was just realistic.

Venting and screaming seem a natural reaction to blog spam. After all those scum-bags are invading and defacing our personal space, who would not get out of their nerves? And that's the point, the trouble is with the spammers, who will go as far as they can to burn more lands for the sake of making a buck (especially in those countries where it is immoral to let the idiots keep their money). But it's not a very good way to think constructively if not objectively about a problem, is it?

As Anil quite rightly puts it in TypeKey-Blacklist Games:

I know it?s fun and/or fashionable to give us grief about MT but I can?t help but think that part of the reason people linked to your post is not because they wanted to give useful feedback, as you?ve done with this post, but because they like to pile on in classic blogosphere fashion. We?re listening to all the helpful feedback, and I?m hoping that people stay focused on information that helps us all combat this problem.

So, let's try to keep a cold head and think about the state of affairs:

  • Spammers want to increase the PageRank of their sites. They hit blogs (among many other public places) because they can place links easily and at industrial scale by using automated scripts.
  • Spammers don't care about tools, they are not hitting a webloging tool in particular, they are just after high PageRank sites. The trigger seems to be a PageRank equals to or greater than 6.
  • Six Apart is between a rock and a hard place because MT is a very popular blogging tool and spammers are after the most popular sites. But other popular tools, such as WordPress, are being hit too and face the same problem. People thinking that switching tools will solve their problem for good are just fooling themselves.
  • Any sort of ban based on IP is completely useless. Spammers are faking IPs, using legitimate ones. This will ultimately end with your giant IP blacklist blocking legitimate users (it happened to me already).
  • Comment moderation, at least à la MT, while being efficient at preventing spams to appear on a weblog breaks the liveliness of conversations and doesn't reduce the amount of work for the blog owner (quite the contrary in fact).
  • TypeKey is not the miracle solution. While Anil claims he has zero spam thanks to it, others say that it drastically reduces the amount of legitimate comments because it's a real barrier to many. I don't comment on Blogger blogs because you have to register with Blogger first. I don't want to leave my name and email address at yet another place on the web, because I can't keep track of them and I don't know what they're doing with it. And, dare I say, I have some doubts that TypeKey is compliant with the European directive on data privacy protection.
  • TrackBack spam is an even bigger problem on the verge of explosion

One more comment about TypeKey: the idea of checking the validity of an email address is good. However it is not the same as registering someone. I would like to have the ability, built in MT or via a plugin, to send a confirmation by email to each commenter. The commenters would have to click a link to confirm their own comments (a sort of auto-moderation). At first sight this should have the same advantage as TypeKey (the obligation to use a valid email address), would not require registration, but also would show the same drawbacks (spammers can create disposable email addresses and automate the process).

At the end, I can't help but think that the solution is actually in the hands of just two companies: Google and Yahoo!. And it's a very easy one. They could just agree on an extension of the robots exclusion protocol, with a granularity of a web page, not just a site. The idea is to agree on a simple comment string that would mark, in a page, zones that the search engine should not index. It's not a new idea, many search engines already offer this feature and here is how it's implemented by HtDig, a popular open-source one:

<!--htdig_noindex-->
...this zone will not get into the index...
<!--/htdig_noindex-->

After all, PageRank is based on the assumption that when a site links to another, it's a sign of interest. It should only be logical, then, to trust the publishers to refine further within their publications what they want the search engines to index or not. As for comment spam, I would then place all untrusted comments within those "no index" tags and voilà, there is no incentive for spammers to game the search engines anymore. How many hours to do that compared to the endless arms battle against automated blog spam?

Blog spam is just an instrument in gaming the search engines. The keys to end it are in the hands of Google and Yahoo!, not Six Apart nor any other blog software developers.

This is too funny to deserve just a link in the list. Go check this: Dell Tech Force, "Dell Dollies! Dell, Gates and Ellison, with strings attached!" [via Alec Muffet]

MarsEdit

| | Comments (0) | TrackBacks (0)

Brent Simmons, well known for his remarkable NetNewsWire aggregator, has just announced the launch of MarsEdit, a weblog editor "that makes weblog writing like writing email". If you think that browsers and HTML were a giant step backward in terms of Human-to-Computer Interface, MarsEdit will bring you back all the comfort and benefits of a desktop application running on a modern operating system, in addition to the style and simplicity that characterize Brent and his productions.

Needless to say, this entry was posted using MarsEdit. Kudos et bonjour from Paris, Brent!

All Internet experts know that the net cannot work without a little bit of magic, so it won't come as a surprise to see Hermione Granger representing Hogwarts listed as the first attendee to the last ICANN conference in Cape Town.

The Register wonders if ICANN needs some teaching in magical arts, may be some lessons in the Dark Forces against the evil Voldemort aka Yoshio Utsumi, secretary-general of the ITU. It also speculates about CEO Paul Twomey being Harry Potter. "It's the only logical explanation. How else could ICANN have survived until now?" says ICANN watcher Michael Koomfrin. Further investigation revealed that ICANN VP Paul Verhoef was Ron Weasley, but ICANN has refused to comment on rumours that chairman Vint Cerf was actually Draco Malfoy.

I wonder who Joi Ito is, really.