The State of Spam [Karma]

First blog update on Spam Karma, WordPress development and Spam in many months, and a crucial one at that. Being notoriously verbose to the point of irrelevance, yet with lots to say today, I have tried to provide a telegraphic sum-up below, feel free to skip and go straight to the parts you may care about (hint for the busy ones: the plot thickens mostly around part 6 and 7).

1. How well is SK2 stopping spam currently?

Pretty damn well, thank you.

2. What’s wrong in the peaceful Kingdom of SpamKarmia then?

A new breed of Evil has been summoned and is threatening to breach in.

3. How evil?

Very Evil… and powerful.

4. Won’t anybody show up and save the day?

Doubtful…

5. Is there really nothing you can do?

Of course there is.

6. Then why aren’t you busy doing it, you lazy bastard

Here is why: …

7. You wouldn’t leave us to die here, would you?

Watch me.


And now for the details:

1. How well is SK2 stopping spam currently?

If you’ve been using SK2 for a while until now, you know it’s working pretty damn well. Over the past year, on the different blogs I manage (some of which receive a steady stream of both legit and spam comments, TBs and PBs): over 99% of spam was caught and under 0.1% false positive (pretty much zero, actually).

The only spam comments that made it through, were usually spams posted manually: that is, where a human would browse to the site, maybe even read the post and post a topical comment looking nearly like ham, save for a blatantly “commercial” site linked in the URL field. These were nearly impossible to stop, as SK2 works 90% on detecting spambots and relies only moderately on blacklisting (which helps to keep its false-positive rates extremely low).

These “manual” spams, though, never were much of an issue, as the essence of spam is automation, without which it loses all its appeal: Assuming it takes a few seconds for an admin to manually moderate spam, and given the numbers of bloggers vs. spammers, anything under hundreds of spams per seconds, is just not worth a spammer’s time.

Also one important thing to understand is that SK2 learns and improves: Flagging the spams it let through, helps stopping the next ones. It is fairly normal for a fresh install to let a few spams through at the beginning, but flagging them and thus allowing SK2 to build its blacklists and pattern lists, should immediately improve the catching rate dramatically.

2. Then why have I seen so much spam going through lately?

Unfortunately, as some of you might have noticed, SK2’s performances as seen from the outside, seem to have dropped suddenly over the past few days. While the bulk of the spam still remains at the door, a meaningful percentage now manages to fly right through SK2’s basic filters. And given the numbers involved, even 1% of all spam attempts is a lot to deal with. There again: SK2’s blacklists learn, and conscientiously flagging each uncaught spam should help keep things under control, but this is still a major quality drop from SK2’s usual performance.

The reason for this sudden burst, is a new breed of spam, or more likely, of spambots. It is confirmed now that some spammers have gotten hold of much more efficient spamming tools. Ones that bypass some of SK2’s strongest filters without trouble.

Also of note is the fact that Trackbacks and Pingbacks are absolutely unaffected by this issue (although a small unrelated bug was fixed in the latter SK2.1 releases and you may want to upgrade again from the site: more on this later).

3. How does this new spambot generation work?

This is a very difficult question, since it involves lots of guessing and detective work. Pretty much like in a war, we do not have access to the enemy’s weapons designs. A very uneven war, actually, since the enemy does have access to ours.

There are ways, though, to gather information about what spambots do, and try guessing how they do it.

[long and uselessly detailed technical droning: you probably want to skip that if you aren’t an anti-spam plugin developer yourself:]

First of all, these spams do not present most of the idiotic traits of their lower colleagues: they do not try cramming hundreds of URLs or inserting hundreds of easily spotted junk keywords in the comment content. Instead, they use only the dedicated name and homepage fields to sneak in spam URL and keywords. The comment content is often perfectly innocuous, sometimes even topical (by copying parts of another comment or a trackbacking post). All in all, these spams could easily be missed by a human moderator who wouldn’t look carefully at the contact name and URL.

When dissected in the http server logs, the spam looks strikingly human-generated: queries for all the files (pictures, css, favicon and javascript included), sometimes a valid referrer URL is provided, links are followed (e.g. from the frontpage to a specific post), the user-agent, of course is valid and claims to be a regular browser. Timestamps generated by a single spamming IP even seem to point to a typically human erratic way of browsing. Most importantly, the spam bypasses SK2’s Javascript filter, which indicates an ability to parse javascript.

However, looking closer at timestamps and a host of other small details, I am fairly certain these aren’t posted by a human, but are indeed a new breed of spambots. There are many ways I can think of, to make such a spambot with javascript-parsing ability and other “mimicking” skills… In fact, I’m just surprised it hadn’t been done before. But this new development is also worrying, as it seems to indicate that spammers have finally gotten hold of real coders to do the job: whereas previous spambots could have been the work of any random script-kiddies with half a brain and a vague knowledge of scripting, these seem a bit more thought out in their design and their implementation. This is particularly worrying as I do not know of any anti-spam system currently that I, or a somewhat similarly skilled coder (that is: not that incredibly skilled) couldn’t force through eventually.

So far, the overal dumbness of spambot programmers gave anti-spam plugins a very easy edge. Things will change if real coders start taking an interest in this no-doubt very lucrative market and starts churning out efficient spambots program to the spam monkeys. And do not doubt a second there aren’t or won’t be such black hat developers in this market (the same way there are in other domains of internet spam)… Even if Mark Pilgrim was slightly off the mark in his apocalyptic sum-up of the situation, he was certainly right on one point: there is huge money involved, certainly enough to pay the hourly services of a decent professional coder… perhaps even [cue ominous strings on the soundtrack] a coder already involved in the blogging community.

No, not me (unless I’ve been sleepcoding again).

4. Will any other anti-spam tool fare better than SK2 with this particular spam (or spam in general)?

First off, SK2 is hardly out of the game: even as it is, and with a few tweakings, it can easily be brought back to a satisfying, if not perfect, level of protection. Not to mention a possible harder, faster and better successor to SK2 (more on that later).

As for the rest.

You’ll have to believe me when I say I truly wished for a better offer in anti-spam tools. Far from seeing it as some sort of “competition” (to what? a product I am neither selling nor making any revenue off?), I consider diversity in spam-fighting tools the most efficient way to fight spam. The same way bio-diversity is your guarantee against viruses and germs, presenting a wide array of defense tools to spammers means they can less easily focus their attention on one in particular and try to break it.

What we really do not need, however, is yet another blissfully ignorant moron releasing some stupid 5-line, 3-year outdated, kiddie trick that will not fool a single spammer and waste hours of users’ time. Unfortunately there are a lot of these. So let me go through a quick roundup of what worked, works, and never worked, I’ll skip the details for today, so you’ll have to take my word when I say that:

  • Captchas: work. Despite the ultra-theoretical “captcha breaking” scheme urban legend, spammers aren’t about to break a captcha on your blog. The big downside of Captchas, is that they are extremely user-unfriendly, intrusive and most of all: hurt accessibility (how do blind users do?).
  • Pretty much any other plugins won’t work. Blacklists, “spam words”, stupid script renaming tricks and all: all pretty useless taken one by one. Some used to work years ago, all have been successfully broken by spammers. Some are even dangerous by the number of false positives they yield. Just save your time and skip them. Javascript payloads also likely won’t be working (I’d love to hear from anybody currently using such a type of plugin, but I’m pretty sure of this one).
  • Bad Behavior will not stop these specific spammers. For the simple reason that BB is not designed to filter spam. It is only meant to stop the 70% stupid bots that do stupid things. Unfortunately bots are getting smarter, and the ones you wanna worry about are in the top percent of these 30%, thus far out of reach of BB.
  • Akismet works. Roughly with the same result rates as SK2. Possibly a slightly higher catching rate, but also a higher false positive rate (which is a big no-no, in my opinion, but that’s up to you). Other concerns generally thrown around include privacy, reliability and terms of use (it is free, but you are entirely dependent on a third party server). My personal issue is that I am doubtful of the long-term resilience of a monolithic DB such as Akismet’s when confronted to both Denial of Service attempts and data poisoning. There is some breathing room until spammers turn their unbridled attention to these weaknesses, but the fact Akismet is now bundled with WP will only accelerates things.

As you can tell, there is scant little out there, only a few plugins that all fare somewhat on a par with SK2, all with their pros and cons. Most important of all, there is currently nothing I wouldn’t feel confident breaking through, was I to start in the business of spamming tomorrow…

Just wire the amount to my swiss account.

I kid.

6. Is there really nothing you can do?

Of course there is.

I have a very fertile imagination, and still a couple tricks to throw in the way of the spamming monkeys, spanning from small bits of tweaking all the way to major, insane and quite possibly break-through concept ideas. Very few in the middle actually. Problem being of course that the more potentially efficient tools would also tend to be the more time-consuming, hazardous ones.

Let me try to sum up the whole state of Spamdom such as I see it, with a tedious numerical analogy:

Say spam-protection goes from 1 to 100, where 1 is “sitting duck”, and 100 is “so protected that Houdini himself wouldn’t get a spam through”. Now let’s say most anti-spam plugins tend to hit somewhere in the 1-10 range, with a few, such as Akismet or SK2, hitting something like a 20 (perhaps also rising a bit as time and improvements went).
Simultaneously spamming techniques have also been adapting and improving, and it’s fair to say they are now approaching a 20, and steadily rising. Essentially, spammers are lazy (or pragmatic, depends on how you see it) and their target is to be just above the anti-spam barrier, not much higher.
Now, among the anti-spam tricks left in reserve, I’d say I got a few small ones that should without too much effort bump SK2 a few points up (with compounded effect, something like a 25), which is nice, but certainly won’t buy more than a few weeks/months.

Since they are also by far the easiest ones to implement, I am already working on them.

There are two other separate projects I’ve been toying, testing and prototyping with: a first one involving a somewhat novel approach to Naive Bayes filtering (definitely not on comment content), which would be a definite +10 on our SpamScale, and another, considerably more complex and difficult to explain in details, that could be crudely summed up as a P2P Blacklisting system.

That last idea I have been thinking through for a looong time now. I have some confidence that it may hold the key to the End of Blog Spam as We Know It… A definite +50 on our scale…

Of course, these last two ones, are also the ones that will take serious time investments before even figuring if I can do something with them… Which takes us to the one and only question you all care about:

7. Why aren’t you busy working on the next anti-spam solution before this spam thing becomes out of control?

Well, because as I said above, it is a lot of work. Work that would add to the top of the already heavy SK2-related workload I deal with daily. Don’t get me wrong, as I’ve stated previously: I love developing, I love developing SK2 and most of the time I love hearing from you (even if sometimes I get irrepressible urges to ram online manuals down some throats). But being a fully human carbon-based entity with little photosynthesis abilities, I happen to need food near-daily…

Also due to recent life changes, I am now a tad busier (being a full-time student) and much poorer (being a full-time student) than before. Hence the regrettable need I am in, to privilege works that either feed me or keep my university peers and professors content.

Can you tell where this is getting? No? OK:

To make it short, I am launching a Fund Drive

The idea is simple: if you use SK2, if you like it, if you’d like to see more of it in the future, if you’d like this future to be sooner than never, if you’d like to help fund the crack habit of a starving student who also happens to dedicate way too much of his free time to eradicating spam, if you think this is worth a few cents, hell even a few dollars, if you can afford to spend this money without robbing your kid or your cat of their next birthday present… Consider donating:

$2.00
$5.00
$10.00
$20.00
$30.00
$50.00
$666.00

There are currently a few thousands of you actively using SK2 (yep, crazy huh?)… I figure if we weed out the cheapos and those who honestly can’t afford it, plus those who consider their small use of SK2 not worth a monetary contribution (hey, I don’t pay for all my shareware… I’m nobody to throw you the first stone), that might still leave a few dozens of you? If each one contributes a few bucks, that should be enough for me to justify spending a few weeks working on SK3 rather than flipping burgers to pay for booze (and occasionally food and rent).

Non-monetary donations of any sorts are all gladly accepted: food specialties from where you live (especially if it’s distilled and drinkable, but the solid kind is cool too), postcards and anything else that won’t cause a police raid to my place at 6 in the morning… Note that due to recent health regulations, I can no longer accept your first-born child in payment for services, but thanks for offering.

If, like me, you are a starving student who cannot afford to divert any of your drug money to pay for my costly addictions, then consider donating some time. There will be need for it: mostly in doc writing (FAQs, user guide, maybe even a support forum at some point since the whole 2-hours emailing a day is becoming a bit tedious). Just put your name in and my people will get in touch with your people when the time arises.

If making a donation, please provide a nickname (if you don’t want your full name to be used) and your blog’s address, as I will probably make a donation page to list all those (if any) who donated.

8. Would you seriously stop developing SK if you don’t get money?

Of course not.

But it is unfortunately true that I will have to lower my involvement with anti-spam dev in favour of more, err, survival-oriented activities. Obviously, I’d much rather be paid for something I love doing (like squashing spam and spammers) than any random job… But it isn’t much of a choice.

I guess I should set some sort of imaginary milestones in terms of funding and how far/fast it would take me on the SK3 development trail, but I’d rather not look like a complete moron when all but a fraction of it will have trickled in at the end of the month… So I’ll just give you my word that I’ll do my best with what I get, and probably with what I don’t get either…

No matter what happens, I will be releasing SK2.2 (with minor tweaks and bug-fixing) at some point… Hopefully within a week… The two bigger components will honestly depend on how much interest they raise and the time I can afford to spend on them (we are talking at least month-long projects)…

Oh, and let me remind you that donations are not, I repeat: not, mandatory in any way whatsoever.
This is not a change in licensing: SK2 is and will remain free for all non-commercial use and redistribution (note that you can still use SK2 on a commercial blog, the only restriction is on packaging and distributing or otherwise selling SK2 for profit: in which case I ask that you contact me for permission first).

I also wanted to take the occasion to thank very sincerely all those who have already donated money, time or simply kind words through email: you have made my day on many occasions, and helped making it worth it so far.

Thanks a lot and do not hesitate to spread the word!

Filed under: WordPress

69 comments

  1. Hello,
    Hey now a days really comment spam issue is rising , we need to take a serious step for the same. i m doing a project on filtering this comment spams, for that i require some comment spam samples but i m unable to get it. So could u plz help me by sending them on my mail.
    It would be of great help to me.

  2. SK2 sounds very nice. I am contemplating using wordpress here soon to try it out. One tool you might want to look at, if you haven’t already heard about it, is the Pivot Blacklist (http://www.i-marco.nl/wiki/pivot-blacklist) which the Pivot weblog tool (www.pivotlog.net) uses. It uses OSA, HashCash, and a “SillyQuestion” type of quiz. So far, this is the most effective anti-spam tool i have ever seen, and that is with just 1 of the options enabled. I can look at my log and see every spam comment blocked, including the “friendly” spam comments. An example of what I see:

    March 23, 2006, 9:20 pm 60.56.229.13 blocked hashcash violation: (Your site is amaizing. Can I share some
    March 23, 2006, 9:35 pm 200.242.249.70 blocked hashcash violation: (It looks like you really had a nice time
    March 23, 2006, 10:39 pm 68.87.76.148 blocked hashcash violation: (I like your website alot…its lots of f

    And it just keeps on going. Check it out though. Maybe there is something from it that you can add, or vice versa. Marco, the author, also makes WP plugins.

  3. Let me jump in, being the author of Pivot Blacklist. HashCash is originally a WP plugin created by Elliott Back. While it still works rather well I believe it’s a dead end. As Dave indicates, bots are getting too smart. If a bot runs the javascript, nothing’s gonna stop it.

    What’s still quite effective though is the ‘spam quiz’ idea. A bot will have a seriously hard time answering trivial questions, especially if each blog uses a different one. It defeats all spam except for manual spam of course. If your blog happens to be in a non-english language the protection is even better because manual spammers will need to understand the language the question was written in 😉

    I must admit though, I haven’t seen much of the really smart bots yet. My own blog doesn’t have any protection enabled at all at the moment. I use a custom AJAX comment submission scheme. There’s no form action to be scraped. Instead the form submission is hidden inside the javascript. A bot which executes the javascript would be able to beat this but… so far this has never happened on my weblog. When it does I’ll need to throw in the spam quiz thing.

    If you combine the spamquiz thing with a cookie to remember the answer I think it’s as unobtrusive as possible. Unlike captcha’s, blind people can use it and it can’t be beaten with any script (except for hardcore AI maybe). It’s a ridiculously lo-fi solution compared to SK2, Akismet, Hashcash or AJAX stuff but it works INCREDIBLY well.

  4. “Captchas: work. Despite the ultra-theoretical “captcha breaking” scheme urban legend, spammers aren’t about to break a captcha on your blog. The big downside of Captchas, is that they are extremely user-unfriendly, intrusive and most of all: hurt accessibility (how do blind users do?).”

    How do blind users do what? some Captcha algorithms also offer to make the chars being spoken through a wave-file generation (click on the picture to hear the char-sequence).
    Using a little algorithm to add random patterns of noise to the audio so the spoken char is still audible enough to be recognised by the human ear, but fooling fourier algorithms to analyse the audio and extract the character from it.

    Blocking Ip’s is another thing, i know there are a lot of open proxy servers, why can’t they be blacklisted using a similar globally deployed detection protocol like being used for open relay mail-servers?

    There are various sites that publish open (or anonymous) proxy servers, 2 hints:
    http://tools.rosinstrument.com/proxy/
    http://www.atomintersoft.com/products/alive-proxy/proxy-list/

    You can make use of those sites by updating your ip-blocking list.
    I know it’s not friendly to block proxy ip’s, but i call it stupid of srever- hosts to make their server wide open in the first place.

    To prevent nagging from victims of a false positive, yield expressions and accusations but rather inform a user that he/she is denied submission because one of the situations is applicable like:
    -User uses an IP from an open proxy server that is blacklisted
    -User uses a (dynamic) IP that has been registered earlier for spam origination.
    -Submitted urls are listed as spam-sites or contain unrelevant information to the contents of this site
    -Too many spam-related keywords detected used in the submitted text.

    It sounds formal yet will bring a lot more understanding than shouting an accusing phrase (“You are a dirty spammer!” or whatever similar a legitimate poster is being confronted with).

    Regards,

    Vince.

  5. Thanks champ – I was about to go insane after about a week of ridiculously high and increasing levels of blog spam (besides the normal email spam I get in my 10+ mailboxes) … so you’ve returned sanity to at least one part of my life.

    Sorry for the stingy donation – but we can’t have millionaire WordPress plugin developers now, can we 🙂

  6. Thank you, thank you, thank you. Seriously. I know I didn’t donate much cause, well… I don’t have much money. I do, however, write documentation at work for web applications and stuff, so if you need help with documentation or FAQ’s have your people call my people and uh… yeah.

  7. Hi. I use email as a primary communication tool, and I am a low capability computer user. Is your program appropriate for me, or is a blog site a different kind of application. I would appreciate any help or advice you would have time to share, and I would be happy to donate/pay. I am a medical researcher, but receive spam from a multitude of sources. Thank you for any advice or help. John Tarvin

  8. Thanks for this great plugin. I plan to switch to it from Akismet (no clue how the third party server it relies on will remain fast and last in the long term).

  9. Thanks for helping us keep up, in the anti-spam arms race.

    I have noticed a fair number of people (or robots) finding my site by searching for the SK2 “spams eaten” footer. Perhaps those are bots targetting SK2 protected sites specifically.

  10. Anti-Spam is a joke to be honest. I hate spam but please let me tell you a story.

    I created an application to send text messages using http://www.vodafone.ie. Now this you say is no impressive task but let me continue.

    Their system is setup much like an anti spam system. It checks the time it took you to input a text, the time it took you to log on, the time it took you to take a cup of tea and everything else it can.

    Wanna know how I got around it? Simple… I created an application which basically mimics a user. First it opens http://www.vodafone.ie, it then waits 4 seconds and then inputs the user and password and clicks submit. It then waits a couple of seconds and clicks “SMS Messages” and then allows you to enter as many characters as you want. Their version only allows 160 so what mine does is it splits the text into 154 length chunks and adds a … to the end and beginning of each text. It waits about 5 seconds between each text to make the server think it’s an actual person typing it in. And know what? I know for a fact there is absolutely no way to tell the difference beteween it and an actual person without causing more harm than good.

    I know it’s not pretty but face facts o.O

    Replies to email only.

  11. Hi,
    I love SK better than Akismet dr dave ;-). Unfortunately, I, too, am a student so money is pretty tight for me. So in the mean time, I can only help you with a dirt and mortar help such as writing FAQ or attending forums a couple of hours every day (or week?). Just let me know if you need any help.
    thanks.

Comments are closed.