Fighting Spam

Having spent a sizable share of the past few weeks writing and supporting an anti-spam plugin for WordPress, I have been extensively reading and cogitating on the issue.

While I am ashamed to say that I have not come up with any magic answer to the problem (and don’t think anybody ever will), I nonetheless have lots of remarks and ideas I’d love to share. And given the appallingly low level of certain discussions on spam I have read lately, I figure it couldn’t hurt.

The Root of All Evil

As we all know, the root of this dark threat lies in the incentive offered by Google to spammers, through its PageRank system. If Google stopped indexing comment URLs, spammers would have no reason to spam and would stop spamming.

So why doesn’t Google do something about it?

Some well meaning bloggers seem to think there’s a clear connection between Google’s own blogging platform and their reluctance to alter their algorithm, blissfully ignoring a few obvious refutations to that conspiracy theory:

First, there is no easy and foolproof way for Google to skip URLs in a blog comment area only and parse everything else correctly. It would require detecting that the page is a blog then figuring what part of that page contains comments in order to skip the URLs. No need to be a genius to see how that would not be a minor alteration to the ostensibly complex parsing algorithm in use by Google, assuming it could be done at all. And if you answered something along the line of “but it would just have to look for [insert your blogging platform-specific tag here]”, I’ll let you figure why this is not even worth considering.

Second, it has been clearly established by now that Google’s approach is to intervene as little as possible. This means they do not skip even the most repugnant websites, do not censor (ok, let’s assume you do not live in China for the sake of argument) and overall do not bend over backward to satisfy any company’s specific needs or desires. And while debatable, this policy is considered a good thing by most, and probably what made them so popular in the first place.

Hence, thinking that your pretty little blog, or the blogosphere as a whole, for that matter, warrants a change of policy from Google that could potentially do much more harm than good is both shortsighted and incredibly full of itself.

More importantly, if you think Google is solely to blame for this problem, why don’t you insert a simple noindex in your robot.txt file and get it over with?

Talking about Google’s own blogging platform, may I kindly remind the whining ones that Blogger itself doesn’t get any special treatment from the Googlebot, yet, they are entirely free of spam.

How do they do it?

Well, using the one simple and foolproof anti-spam technique that’s been available all along and still to this day not implemented by any other major platforms:

URL Redirection

If you remove direct links from comments, you remove Pagerank boost, if you remove Pagerank boost, you take away any incentive to blog-spamming. Provided you alter slightly the comment mechanism of your platform, so as to require a minimal amount of work for existing spambots to keep spamming your platform, you most likely will hit the threshold where the reward is no longer worth the pain for insidious spambot writers the world around.

And, unlike any other spam filter you could ever come up with, this is a lasting victory.

Yep, that’s true, there is a method, and it is:

  • Easy to implement
  • Perfectly transparent for users (the inconvenience of not having the actual URL displayed in the status bar can be resolved in a number of ways).
  • Absolutely 100% efficient

Of course, for this method to have any point, it needs to be adopted at the platform level. That is: tools like MT or WP need to come with this feature enabled by default (and a strong warning against disabling it). Anything short of that is pointless.

The main argument against URL redirection is that it “breaks the web”: it uses link tags in a way that was not intended originally.

To which I first ask: what about millions of PG-13 links pervading every corners of the web where they are absolutely not welcome? Isn’t that breaking the web too?

In addition, there is lots of room for debate about how blogs in themselves are not already breaking the web to a certain extent: redundant linking of websites through the commenting system for no other reason that participation to a discussion is not exactly helping to build a meaningful semantic web.

More to the point, while I am not specifically enthusiastic about such a trick, it is quite obviously the only end-all, catch-all solution. If you consider spam to be your priority number one, then you shouldn’t even be hesitant.

Yet, I can bet a few bucks that we will never see it implemented by any major existing platform, and if you must know, I do not think standard-compliance is the real reason why (I’ll let you guess for yourself).

But at any rate, please stop running around fretting about the end of the [blogging] world: there is a solution, the day people are really tired of spam, they will use it, until then shut-up or do something about it.

Wasting Time and Effort

Assuming we forget about URL redirection for now, there is new plugins and hacks being pushed everyday, promising easy and durable solutions to the problem in just a few steps (cf. comment form changes, hidden fields, changing file names, htaccess ip deny lists etc etc), yet, they usually give in to the flow of spam after a few days or weeks at the most. Ironically, the most braindead protection tricks, thanks to their very limited adoption, tend to last longer, which is not to say they are not eventually broken.

Now here is a tip for anybody devising a way to fight spam: if it takes 5 extra lines in a spambot script to work around your protection, you are utterly wasting your time, and incidentally, the time of everybody you recommend that trick to. As it stands, a good 90% of all spam plugins/hacks I have read about could be broken by a spambot written from scratch in an hour (no, I am not planning to write one to prove my point, you’ll have to take my word).

Half-assed protections are not only pointless, they are also doing a great disservice to the community as a whole: by diverting energy from efficient spam fighting and leading many inexperienced bloggers to give up on the whole spam-fighting effort after growing wary of all the “instant-spam fighting” solution that only seem to work for 3 days and let thousands of spams through on the fourth.

Which is not to say that all these simple filters are not worth keeping (something as braindead as a check for entities in URLs can weed out lots of spams easily), it’s just that obviously, they need to be aggregated.

It still beats me to see why, when a developer releases a new plugin, he is so convinced of its standalone value that he doesn’t deem it necessary to implement any other side checks.

I shouldn’t even have to defend the point of aggregating filters, seeing how the benefits are obvious, but here are a few examples:

  • If you use any kind of banning/blacklist system, what a lesser filter catches can be reused to block any further attempts.
  • Many filters that will stop zero spam when used by themselves will catch most if combined with a single other filter. For example, IP bans are pointless, unless you also use some sort of encrypted payload in the form to prevent IP spoofing.
  • Combining spam checks together allows much greater accuracy, preventing both false positives and excessive moderation.

For these reasons too, fragmenting the protection by installing many uncoordinated plugins is highly counterproductive: remember that thing about a chain and the weakest link… Here the weakest plugin can either be the one that abusively flags legit comments, or the one that approves them and shortcuts other checks. This holds particularly true for built-in checks released with the blogging platform: unless they are efficient at stopping all spams, everything they stop is potentially useful blacklisting data that is kept from the plugins.

This is, for example, why WP 1.2.2‘s improvements actually weaken the protection brought by Spam Karma: It adds a basic filter for HTML entities ({,  etc.), which is a very safe way to catch many spams, but also stands in the way of advanced data harvesting by more complete plugins. A typical example of a good idea which ends up being detrimental to the effort as a whole because of its restricted scope.

There is absolutely nothing revolutionary about the way plugins like Spam Karma or Spaminator work: it’s all obvious stuff, and most of it had already been done in a way or another somewhere else. Yet, they seem to be doing a much better work, at the moment, than the bulk of other spam-fighting plugins.

Hear me: I have not an ounce of doubt they can, and will, be broken by spammers at one point, I just think the threshold is exponentially higher than the single-filter plugins, merely because they do not put all their eggs in the same basket. This is something that could easily be done by other plugins.

Failing that requirement and if they do not want to invest the time, plugin developers should probably just save themselves the pain and not waste everybody’s time on yet another “hidden field in the comment form” hack…

(To be continued…)

Next Episode: Building an Efficient Cooperative Blacklisting Network

Filed under: Meta

4 comments

  1. I think they should just hang all spammers up by their pink bits and let them be publically flogged by those who hate spam with the heat of a nova.. or spammers should be forced to share their profits with those who have had to pay lots of money to counteract the negative effects of spam. Also, anybody who has supported a spammer by purchase of their goods, purchasing email lists or developed spamming software should also be severely spanked. ggrrr

  2. hey man, just wanted to say thanks for being out there doing and thinking this shit. I’m in the process of moving from mt to wp1.2 and my TEST installation that wasn’t even open for business got hundreds of spam before i even used it. Glad to know someone is out there fighting for the community.

Comments are closed.