I keep this blog mostly for my personal journal, so I won't forget what I worked on, and to keep my writing skills up. I never thought anyone would seriously be interested in what I wrote, so I've paid little attention to the comments I received for my posts. To keep the steady flow of spam at minimum, I've set up Spam Karma 2, and didn't worry too much about it anymore.
Just now when I entered the admin area for routine upgrade, I've noticed that beside the 2500 spam comments I also have about 30 not marked as spam, and I became curious. Are these real comments, or have they bypassed the spam filter?
I've blogged about some "now" topics, like time-lapse and Android tablets, and it seems that it's a kind of sweet spot to those who just want to sell their junk be it non-subscription medicine or fashionable cheap sunglasses.
The reason I think this deserves it's own post is because I've read through some of the comments and it showed that several different engines of a varying level of relevancy are in use for spamming. There is always a link provided as the comment's author's own site, which is the page that the SEO is for. There is also an e-mail address that is valid, but usually is a machine generated one.
The simplest of all is just a elaborately worded congratulation on the article or a promise to distribute it on other channels like reddit. Well it's just social engineering, all about flattering the author, so he will show it on among his other comments. These kind of comments are using a template and are so blend they can be used on basically any post. The fixed template is it's main weakness. Even the simplest spam filters can be trained to detect these fixed templates. How they got through? Probably my spam template database didn't yet contain the exact template when it arrived.
The next type is the machine generated lorem ipsum type spams. There are subtle differences among these as well. The simplest of these is the web scraper based commenter, which takes a - possibly completely random - part of an other page and is publishing it as a post. Some of these are so crude, they cut the first and last words in half. It's quite difficult to detect these, as the text contained in it is coherent, since these texts are published several sites, the spam filter can use the comments reported by others as templates and detect them.
A more interesting approach is when several key phrases are used as seeds and are put in random context with a lorem ipsum type generator. These generators vary in complexity, some look like an alphabet soup, but some have punctuation and capitalization as if it was proper text. The idea of using this type of spam is that by posting the keywords the page pointing at the target will be even more relevant scoring higher on the SEO. How are these handled in the spam filter? Well the text is generated usually using words from the very same page/blog. The link that is the target of the SEO is what it gives it away.
The "best" ones I've found were both readable and almost relevant to the article itself. They were surely written by humans and some of those were so relevant I almost accepted them. If my blog had more traffic and commenters they would have almost certainly have passed as valid comments. I strongly suspect these were written by actual people and are a the most relevant boilerplate comment (they are always positive and supportive) is selected based on the analysis of the actual article. I think these templates are collected from various forums by people and are categorized and regularly changed to bypass the filters.
At the end I became so paranoid I marked all my comments as spam. If there was anyone whose actual comment I inadvertently removed, I'm sorry!
NB: this article I intend to use as a kind of honeytrap. All comments (passing Spam Karma) on it will be preserved and allowed so as to prove my point.
If an actual human is about to share his thoughts, he's welcome, but please state I'm an actual human, just to avoid confusion. :)