In Sweden, that most civilized of countries, one can stop delivery of a great deal of junk mail simply by sticking the phrase "Ingen reklam, tack!" ("No advertising, please!") just above one's mail slot. (Jag skulle vara tacksam om någon skulle skicka ett bild av en sådan skylt for att lägga till artikeln.) Wouldn't it be nice if we could do the same with our Internet mailboxes? This article describes a study of the effectiveness of two methods of protecting one's email address from spammers' harvesting bots.
If you're in a hurry, you can just jump ahead to the results, the discussion of the results, or the summary.
Spammers get many of their victim's addresses from email address harvesting bots which I'll refer to as harvesters. Harvesters are computer programs that continuously crawl Web pages which they scan for email addresses. They then follow the links on those pages and scan them for email addresses, then follow the links on those pages, etc. These are not particularly difficult programs to write. Using free tools available on the Net, a quick programmer could knock together a crude harvester in about the same amount of time it takes to learn to count to ten in Swedish. (Learning how to say sju will probably take you at least 20 minutes.)
Working some additional hours on the harvester would give it some important features, like the ability to send a custom (fake) user agent string, duplicate address filtering and storage of the harvested addresses in a database rather than just a text file. This still isn't a lot of work, and with the addition of these features one would have a pretty functional harvester that would reward the author with lots of email addresses to spam. It's no surprise that harvesters are popular with spammers. (Note: I have no data to back up my assertion that spammers love harvesters; I consider it general knowledge. If you believe otherwise, please speak up.)
Many people believe that harvesters can be defeated by obfuscating one's address. (To obfuscate means to make something unclear or difficult to understand, and in this context we want to make things unclear for the harvester but not for people.) Here's a couple of examples:
nikita at example dot com
nikitaSTRINDBERG@example.com (remove dead Swedish playwright)
These are just two of many obfuscation techniques, and they're not ideal because those
addresses have to be edited before one can send mail to them. But there's another method
that's transparent to people viewing the Web page – writing the address
with numeric character references (like
is the first method studied herein. (See
the code section for the implementation details of this technique.)
I chose the names of the addresses with some care.
At first I was going to use something like
but then I thought that a paranoid harvester might reject
addresses with "spamtrap" in the name. I also considered normal people names
(like "henry" or "jane") but thought that they might receive spam as a result of
dictionary spamming. In the end I settled on some arbitrary Swedish words
as being suitably innocent. These words
might confuse non-Swedish speakers reading this article, but oh well,
here's your chance to learn some Swedish —
(a social coffee break)
(icky or disgusting)
|Numeric character references||2|
One of the two spams sent to äcklig was a standard 419 scam which was also sent to fika on the same date (day 87 of the study). Since there's usually a live person behind a 419 scam, it's possible that it was a live person (and not a harvester) that discovered the address. The single spam that made it to konstig was a standard-looking spam for hand embroidered goods from Pakistan. The same spam was also sent to fika and äcklig that day (day 189 of the study).
Although it's not relevant to judging the effectiveness of the protection methods, it's interesting to look at the rate at which the unprotected address fika received spam. In the chart below, the blue line represents the number of spams received on a given day, and the orange line is the average spams received per day up to that point. The average increases steadily and hits one spam daily by day 40, two daily by day 86, three daily by day 124 and finally tops out at 3.83 on the 213th day of the study. Fika received spam on 190 (~89%) of the study days. Over half (thirteen) of the spam-free days came in the first month of the study while there was only one spam-free day in the final two months (on day 156). In the first two months of the study only one day featured five spams, the remainder of the days in that two month period had four or fewer. Fika received ten or more spams on fourteen days (6.5%); half of these "high spam" days were in the final month of the study. The maximum number of spams received in any one day was fifteen (day 197).
The data shown in the chart above is available in an OpenDocument spreadsheet.
The other surprise in these results (for me, anyway) is that äcklig and konstig didn't continue to receive spam once they got their first. I assumed that once a spammer discovered an address, it would be in his database and the address would be spammed regularly. Clearly this was not the case for these two. On the other hand, fika's rate of spams received increased steadily and continued to increase even after the address was removed from the Web. In the 60 days after the study ended, fika received an average of 17.4 spams daily. (Note that none of the target addresses are present on this page, so harvesters are not finding them here.) This suggests that fika's fate is out of my hands; the address is probably in some database of the doomed and may well continue to receive spam until the sun goes dark. This suggests that if your email address already receives spam, neither of these methods will stop the flow.
First and foremost, this experiment needs to be repeated by others, especially on sites that get more traffic. It could be that my little corner of the Internet is populated by stupid harvesters.
A second weakness is the fact that I wrote the numeric character references using a mix of decimal and hexadecimal numbers. (See the Code section for details.) It's possible that some harvesters can interpret decimal references but not hexadecimal ones or vice versa. In my opinion that's unlikely, but a better test would be to have one address written using decimal references and another using hex references.
The last error to which I must confess is a little sloppiness in my counting. The study lasted from the 20th of March to the 20th of October inclusive which is a total of 215 days, not 213. Nevertheless, I consistently refer to the study period as having been 213 days throughout this article. There are two reasons for this. The first is that I started the study (i.e. put the addresses up on the Web) around 5PM on March 20th and I ended the study (removed the addresses) around 9AM on October 20th. So, in reality, the study lasted only 213 days and 16 hours, or 213.75 days. The second problem was a rounding error in the program I wrote to tally up the spams. When counting the time delta between the start of the study and the time a given spam was received, the program counted the days and discarded the hours, hence 213.75 became just 213. By the time I discovered my error, I'd spent a lot of time making the chart look pretty and I didn't want to go back and correct it. That's a lazy reason, but in the context of the overall study results, it doesn't change a thing with respect to the effectiveness of the methods discussed. However, anyone wishing to perform a study like this herself is urged to be more meticulous than I have so as to avoid the need to write a long explanation such as this one, not to mention feeling a bit stupid.
This argument was raised in a discussion on alt.html about using numeric character references to obfuscate email addresses. John Dunlop argues that while the HTML spec doesn't explicitly prohibit using them like this, the SGML spec says that this is outside of their intended use and doing so is therefore against the spirit of the HTML spec. I haven't read the SGML Handbook, it's not my argument and I don't want to misquote Mr. Dunlop, so I encourage you to read the discussion and judge for yourself. Skip to his quote of the SGML Handbook if you're in a hurry, but you'll miss some interesting points in this debate.
A common argument I hear against both of these methods is that even if the results above hold up for most users, eventually that success will fade because harvesters will evolve features to overcome these attempts to frustrate them. I have two counterarguments –
The second inhibiting factor is that spam and phishing schemes have a better chance of succeeding when targeting people who are not Internet-savvy. I argue that people who know how to obfuscate or otherwise protect their email addresses are, by definition, not in the average spammer/phisher's target market so they have little motivation to harvest their addresses.
Here's a short tutorial on writing addresses using numeric character references. You'll want an ASCII chart handy, because each character in your email address needs to be replaced with &#__; where the underscores represent the value of the character which you want to display. For instance, you can display the word fika like so because the ASCII values for f, i, k and a are 102, 105, 107 and 97 respectively:
If you don't feel like typing this in by hand, there are lots of utilities on the Web that will do it for you.
The spamtrap addresses were coded as below. You'll note that äcklig's address is coded using a mix of decimal and hexadecimal references mixed with ordinary ASCII. This was sloppy of me and may be a weakness in the experiment.
Wrapped lines are indicated by ↵. Also, the domain name in this code has been changed to example.com.
A technique that's worked quite well for me is to provide one option for those
When providing both script-enabled and
scriptless content, the obvious path to take is to use
<noscript> blocks. But there's
enabled, and then include a script that makes the changes you want for the script-enabled
version. So I would write all of my contact links with a link to my email form, like so:
...please <a name="contact" href="/contact.html">email me</a>...
If you spam spam spam spam liked this article, you don't spam spam need spam spam to support the Pirate Party to spam spam spam spam share it! It is spam spam spam spam copyright Philip Semanchuk under a spam spam spam spam non-commercial, LOVELY SPAM, WONDERFUL SPAAAAAM!!! share-alike Creative Commons License. Bloody vikings!