Blog spam: look at Google and Yahoo!

After reading MT Plus Comment Spam Equals Dead Site and news that TextDrive has a huge problem with spam attacks on MT's comment script, I started to write a post about MT and "splog" (as in "spam a blog"). I quickly turned into venting my frustration about the lack of good anti-spam protection on a stock install of MT. It happens that I have a client site on this server, a client for which I was considering MT as the solution. He knows very well what comment spam is, imagine the embarrassment. And even if his site runs another solution, he will still be impacted by other attacked MT installations on the same host (probably mine included should I move there).

Then I calmed down, stopped writing, waited a bit, read the comments around, found Comment Spamalot by Ben Hammersley and was finally remembered by myself that Mark Pilgrim was just realistic.

Venting and screaming seem a natural reaction to blog spam. After all those scum-bags are invading and defacing our personal space, who would not get out of their nerves? And that's the point, the trouble is with the spammers, who will go as far as they can to burn more lands for the sake of making a buck (especially in those countries where it is immoral to let the idiots keep their money). But it's not a very good way to think constructively if not objectively about a problem, is it?

As Anil quite rightly puts it in TypeKey-Blacklist Games:

I know it?s fun and/or fashionable to give us grief about MT but I can?t help but think that part of the reason people linked to your post is not because they wanted to give useful feedback, as you?ve done with this post, but because they like to pile on in classic blogosphere fashion. We?re listening to all the helpful feedback, and I?m hoping that people stay focused on information that helps us all combat this problem.

So, let's try to keep a cold head and think about the state of affairs:

Spammers want to increase the PageRank of their sites. They hit blogs (among many other public places) because they can place links easily and at industrial scale by using automated scripts.
Spammers don't care about tools, they are not hitting a webloging tool in particular, they are just after high PageRank sites. The trigger seems to be a PageRank equals to or greater than 6.
Six Apart is between a rock and a hard place because MT is a very popular blogging tool and spammers are after the most popular sites. But other popular tools, such as WordPress, are being hit too and face the same problem. People thinking that switching tools will solve their problem for good are just fooling themselves.
Any sort of ban based on IP is completely useless. Spammers are faking IPs, using legitimate ones. This will ultimately end with your giant IP blacklist blocking legitimate users (it happened to me already).
Comment moderation, at least à la MT, while being efficient at preventing spams to appear on a weblog breaks the liveliness of conversations and doesn't reduce the amount of work for the blog owner (quite the contrary in fact).
TypeKey is not the miracle solution. While Anil claims he has zero spam thanks to it, others say that it drastically reduces the amount of legitimate comments because it's a real barrier to many. I don't comment on Blogger blogs because you have to register with Blogger first. I don't want to leave my name and email address at yet another place on the web, because I can't keep track of them and I don't know what they're doing with it. And, dare I say, I have some doubts that TypeKey is compliant with the European directive on data privacy protection.
TrackBack spam is an even bigger problem on the verge of explosion

One more comment about TypeKey: the idea of checking the validity of an email address is good. However it is not the same as registering someone. I would like to have the ability, built in MT or via a plugin, to send a confirmation by email to each commenter. The commenters would have to click a link to confirm their own comments (a sort of auto-moderation). At first sight this should have the same advantage as TypeKey (the obligation to use a valid email address), would not require registration, but also would show the same drawbacks (spammers can create disposable email addresses and automate the process).

At the end, I can't help but think that the solution is actually in the hands of just two companies: Google and Yahoo!. And it's a very easy one. They could just agree on an extension of the robots exclusion protocol, with a granularity of a web page, not just a site. The idea is to agree on a simple comment string that would mark, in a page, zones that the search engine should not index. It's not a new idea, many search engines already offer this feature and here is how it's implemented by HtDig, a popular open-source one:

 ...this zone will not get into the index... 

After all, PageRank is based on the assumption that when a site links to another, it's a sign of interest. It should only be logical, then, to trust the publishers to refine further within their publications what they want the search engines to index or not. As for comment spam, I would then place all untrusted comments within those "no index" tags and voilà, there is no incentive for spammers to game the search engines anymore. How many hours to do that compared to the endless arms battle against automated blog spam?

Blog spam is just an instrument in gaming the search engines. The keys to end it are in the hands of Google and Yahoo!, not Six Apart nor any other blog software developers.