Blog spam: look at Google and Yahoo!

After reading MT Plus Comment Spam Equals Dead Site and news that TextDrive has a huge problem with spam attacks on MT's comment script, I started to write a post about MT and "splog" (as in "spam a blog"). I quickly turned into venting my frustration about the lack of good anti-spam protection on a stock install of MT. It happens that I have a client site on this server, a client for which I was considering MT as the solution. He knows very well what comment spam is, imagine the embarrassment. And even if his site runs another solution, he will still be impacted by other attacked MT installations on the same host (probably mine included should I move there).

Then I calmed down, stopped writing, waited a bit, read the comments around, found Comment Spamalot by Ben Hammersley and was finally remembered by myself that Mark Pilgrim was just realistic.

Venting and screaming seem a natural reaction to blog spam. After all those scum-bags are invading and defacing our personal space, who would not get out of their nerves? And that's the point, the trouble is with the spammers, who will go as far as they can to burn more lands for the sake of making a buck (especially in those countries where it is immoral to let the idiots keep their money). But it's not a very good way to think constructively if not objectively about a problem, is it?

As Anil quite rightly puts it in TypeKey-Blacklist Games:

I know it?s fun and/or fashionable to give us grief about MT but I can?t help but think that part of the reason people linked to your post is not because they wanted to give useful feedback, as you?ve done with this post, but because they like to pile on in classic blogosphere fashion. We?re listening to all the helpful feedback, and I?m hoping that people stay focused on information that helps us all combat this problem.

So, let's try to keep a cold head and think about the state of affairs:

  • Spammers want to increase the PageRank of their sites. They hit blogs (among many other public places) because they can place links easily and at industrial scale by using automated scripts.
  • Spammers don't care about tools, they are not hitting a webloging tool in particular, they are just after high PageRank sites. The trigger seems to be a PageRank equals to or greater than 6.
  • Six Apart is between a rock and a hard place because MT is a very popular blogging tool and spammers are after the most popular sites. But other popular tools, such as WordPress, are being hit too and face the same problem. People thinking that switching tools will solve their problem for good are just fooling themselves.
  • Any sort of ban based on IP is completely useless. Spammers are faking IPs, using legitimate ones. This will ultimately end with your giant IP blacklist blocking legitimate users (it happened to me already).
  • Comment moderation, at least à la MT, while being efficient at preventing spams to appear on a weblog breaks the liveliness of conversations and doesn't reduce the amount of work for the blog owner (quite the contrary in fact).
  • TypeKey is not the miracle solution. While Anil claims he has zero spam thanks to it, others say that it drastically reduces the amount of legitimate comments because it's a real barrier to many. I don't comment on Blogger blogs because you have to register with Blogger first. I don't want to leave my name and email address at yet another place on the web, because I can't keep track of them and I don't know what they're doing with it. And, dare I say, I have some doubts that TypeKey is compliant with the European directive on data privacy protection.
  • TrackBack spam is an even bigger problem on the verge of explosion

One more comment about TypeKey: the idea of checking the validity of an email address is good. However it is not the same as registering someone. I would like to have the ability, built in MT or via a plugin, to send a confirmation by email to each commenter. The commenters would have to click a link to confirm their own comments (a sort of auto-moderation). At first sight this should have the same advantage as TypeKey (the obligation to use a valid email address), would not require registration, but also would show the same drawbacks (spammers can create disposable email addresses and automate the process).

At the end, I can't help but think that the solution is actually in the hands of just two companies: Google and Yahoo!. And it's a very easy one. They could just agree on an extension of the robots exclusion protocol, with a granularity of a web page, not just a site. The idea is to agree on a simple comment string that would mark, in a page, zones that the search engine should not index. It's not a new idea, many search engines already offer this feature and here is how it's implemented by HtDig, a popular open-source one:

<!--htdig_noindex-->
...this zone will not get into the index...
<!--/htdig_noindex-->

After all, PageRank is based on the assumption that when a site links to another, it's a sign of interest. It should only be logical, then, to trust the publishers to refine further within their publications what they want the search engines to index or not. As for comment spam, I would then place all untrusted comments within those "no index" tags and voilà, there is no incentive for spammers to game the search engines anymore. How many hours to do that compared to the endless arms battle against automated blog spam?

Blog spam is just an instrument in gaming the search engines. The keys to end it are in the hands of Google and Yahoo!, not Six Apart nor any other blog software developers.

2 TrackBacks

Links ball. "I'm close to declaring the form of personal web design dead" (sort of via hicksdesign) 42 (via Airbag's Longboard "Blog spam is just an instrument in gaming the search engines. The keys to end it are in the hands of Google and Ya... Read More

Fighting Spam from Dr Dave's Blog on December 18, 2004 10:19 AM

Having spent a sizable share of the past few weeks writing and supporting an anti-spam plugin for Wordpress , I have been extensively reading and cogitating on the issue. While I am ashamed to say that I have not come up with any magic answer to the pr... Read More

4 Comments

This is great feedback, and thanks. I've passed your post on to Jay Allen, and he'l be responding to this soon. We *really* appreciate your help and feedback, and I do agree there's no enough protection in MT out of the box. I think we can fix that.

While you make very valid points and I agree MT can do more to help users handle this scourge (especially when it comes to TrackBack) the solutions you put forth seem eqally as flawed.

For instance if the htdig scheme was implemented what if a spammer posted something like:

<!--htdig_noindex-->
... valid comments
<!--/htdig_noindex-->
spam comment
<!--htdig_noindex-->
... more valid comments
<!--/htdig_noindex-->

Obviously stripping out comments would fix that, but you are still excluding all comments from being indexed by the search engines. Personally some of the best knowledge/content I post to the Internet is in the comments section of weblogs. To exclude that from the collective knowledge of these search engines seems rather unfortunate.

I've felt that enough has not exploration has been done to centralize ones comments and publishing them remotely. TrackBack could provide the means of delivery, but it would need to mature and be implemented differently then it has in the past.

Tim, very good points indeed.

Still I'm convinced that this happens only because search engines can be gamed, and they should help in fighting this plague.

Also, what's wrong with the idea of having the power to not endorse a link to another site (in the PageRank + spam era, I mean)?

Tim, I would object first that as soon as you allow strangers to inject content in your site, you become subject to certain risks. The flaw you mention is not different from other code injection techniques (cross-sites scripting, SQL code injection, whatever) and, of course, one would have to strip out the code to prevent this to happen. This can be handled by the software exactly like today, where I can have it strip out all or part of HTML tags for example.

Secondly, where did I say that I would screen comments out? I want to prevent the links to benefit from my search engine rank, that's all. And again, exactly like MT does rewrite the comment URL to place a redirection, I would have it wrap all links in comments with the "noindex" wrapper. The comments would still be indexed.

Afterall, I have the ability to get an entire page out of search engines, why wouldn't I be able to refine this ability to parts of a page? Including navigation and HTML code that have zero content value for visitors. I've got the feeling that this would show more benefits than drawbacks on the long term.

As for centralizing ones comments, that would be fine for the "blogeois" that we are. But what about people who don't have a site? This is as limiting as telling them that if they want to comment on your posts, they should do it on their blogs.

mensuelles Archives

Recent Entries

  • Steve Jobs

    "Remembering that I’ll be dead soon is the most important tool I’ve ever encountered to help me make the big choices in life. Because...

  • Your privacy on MOTOBLUR by Motorola

    After the Nokia Ovi Store carelessness, it's now Motorola who's allowing strangers to get access to your private information on their MOTOBLUR portal. Exactly like...

  • How to resume a broken ADC download

    (I'm documenting this trick for myself to remember, but it can be useful for others…) Apple, on its Apple Developer Connection site, has a bad...

  • WTF is this ‘myEventWatcherDiv’ doing in my web?

    All of a sudden I started to find the following line in most of the web pages I was browsing, including ones I made where...

  • Your privacy on Nokia Ovi Store

    My friend Adam Greenfield recently complained about the over-engineering culture at Nokia: I was given an NFC phone, and told to tap it against the...