Nikita the Spider

Ingen Reklam, Tack (No Spam, Please)

In Sweden, that most civilized of countries, one can stop delivery of a great deal of junk mail simply by sticking the phrase "Ingen reklam, tack!" ("No advertising, please!") just above one's mail slot. (Jag skulle vara tacksam om någon skulle skicka ett bild av en sådan skylt for att lägga till artikeln.) Wouldn't it be nice if we could do the same with our Internet mailboxes? This article describes a study of the effectiveness of two methods of protecting one's email address from spammers' harvesting bots.

If you're in a hurry, you can just jump ahead to the results, the discussion of the results, or the summary.

The Enemy

Spammers get many of their victim's addresses from email address harvesting bots which I'll refer to as harvesters. Harvesters are computer programs that continuously crawl Web pages which they scan for email addresses. They then follow the links on those pages and scan them for email addresses, then follow the links on those pages, etc. These are not particularly difficult programs to write. Using free tools available on the Net, a quick programmer could knock together a crude harvester in about the same amount of time it takes to learn to count to ten in Swedish. (Learning how to say sju will probably take you at least 20 minutes.)

Working some additional hours on the harvester would give it some important features, like the ability to send a custom (fake) user agent string, duplicate address filtering and storage of the harvested addresses in a database rather than just a text file. This still isn't a lot of work, and with the addition of these features one would have a pretty functional harvester that would reward the author with lots of email addresses to spam. It's no surprise that harvesters are popular with spammers. (Note: I have no data to back up my assertion that spammers love harvesters; I consider it general knowledge. If you believe otherwise, please speak up.)

The First Defense – Obfuscation using Numeric Character References

Many people believe that harvesters can be defeated by obfuscating one's address. (To obfuscate means to make something unclear or difficult to understand, and in this context we want to make things unclear for the harvester but not for people.) Here's a couple of examples:

   nikita at example dot com

or this:

   nikitaSTRINDBERG@example.com (remove dead Swedish playwright)

These are just two of many obfuscation techniques, and they're not ideal because those addresses have to be edited before one can send mail to them. But there's another method that's transparent to people viewing the Web page – writing the address with numeric character references (like n). This is the first method studied herein. (See the code section for the implementation details of this technique.)

The Second Defense – Invisibility Via JavaScript

Another popular technique for protecting one's email is to write the address using JavaScript. The theory here is that most harvesters don't execute JavaScript and thus won't see the address. (See the code section for the implementation details of this technique.)

Originality Disclaimer

I should point out that I saw the character reference trick mentioned a few places before I ever used it, and although I invented the Javascript technique myself, I soon realized that many, many others had already done so before me. In other words, these ideas aren't new and my implementations are probably not unique. All I've done is study their effectiveness.

Methodology

In the spring of 2006 I created three dedicated email addresses and placed each on the home page of my family's Web site. The site has been online since the late 1990s so it is well established although not heavily trafficked. The home page got only about 20 hits per day over the course of the test. (The low traffic might be a weak spot in this study, because a page that receives so little traffic might not be exposed to the full variety of harvesters on the Net.) The first address was the experiment's control; it was naked and undefended against harvesters. The second address was written with numeric character references and the third address was written by JavaScript. The study ran from from March 20 to October 20, 2006, a span of 213 days. The study began the day I put the email addresses up on the Web site and ended the day I removed them. These email addresses have not been on the Web before or since.

I chose the names of the addresses with some care. At first I was going to use something like spamtrap1/2/3 but then I thought that a paranoid harvester might reject addresses with "spamtrap" in the name. I also considered normal people names (like "henry" or "jane") but thought that they might receive spam as a result of dictionary spamming. In the end I settled on some arbitrary Swedish words as being suitably innocent. These words might confuse non-Swedish speakers reading this article, but oh well, here's your chance to learn some Swedish —

  1. fika was the unprotected (control) address. It is both a verb and a noun and it means (to take) a social coffee break.
  2. äcklig (which was spelled acklig in order to be ASCII-friendly) was the address written with numeric character references. It means icky or disgusting.
  3. konstig was the JavaScript-protected address. It means strange.

The Results

Over the course of this experiment, the control address fika received 815 spams which is an average of about 3.8 per day. The character reference address äcklig received just two, and the JavaScript address konstig only one.

NameProtectionSpams
fika
(a social coffee break)
None815
äcklig
(icky or disgusting)
Numeric character references2
konstig
(strange)
JavaScript1

One of the two spams sent to äcklig was a standard 419 scam which was also sent to fika on the same date (day 87 of the study). Since there's usually a live person behind a 419 scam, it's possible that it was a live person (and not a harvester) that discovered the address. The single spam that made it to konstig was a standard-looking spam for hand embroidered goods from Pakistan. The same spam was also sent to fika and äcklig that day (day 189 of the study).

Spams Over Time

Although it's not relevant to judging the effectiveness of the protection methods, it's interesting to look at the rate at which the unprotected address fika received spam. In the chart below, the blue line represents the number of spams received on a given day, and the orange line is the average spams received per day up to that point. The average increases steadily and hits one spam daily by day 40, two daily by day 86, three daily by day 124 and finally tops out at 3.83 on the 213th day of the study. Fika received spam on 190 (~89%) of the study days. Over half (thirteen) of the spam-free days came in the first month of the study while there was only one spam-free day in the final two months (on day 156). In the first two months of the study only one day featured five spams, the remainder of the days in that two month period had four or fewer. Fika received ten or more spams on fourteen days (6.5%); half of these "high spam" days were in the final month of the study. The maximum number of spams received in any one day was fifteen (day 197).

This graph shows the spams received daily by fika during the course of the study as well as a daily average of spams received.

Raw(er) Data

The data shown in the chart above is available in an OpenDocument spreadsheet.

Discussion of Results

I was pleasantly surprised by the effectiveness of the methods tested. In short, these methods were extremely effective at preventing spam from being sent. Both writing the address with numeric character references and writing the address with JavaScript prevented almost all of the spam sent to the control address from reaching the protected addresses. However, this might only work with addresses that haven't already been exposed to spammers (see below).

The other surprise in these results (for me, anyway) is that äcklig and konstig didn't continue to receive spam once they got their first. I assumed that once a spammer discovered an address, it would be in his database and the address would be spammed regularly. Clearly this was not the case for these two. On the other hand, fika's rate of spams received increased steadily and continued to increase even after the address was removed from the Web. In the 60 days after the study ended, fika received an average of 17.4 spams daily. (Note that none of the target addresses are present on this page, so harvesters are not finding them here.) This suggests that fika's fate is out of my hands; the address is probably in some database of the doomed and may well continue to receive spam until the sun goes dark. This suggests that if your email address already receives spam, neither of these methods will stop the flow.

Summary and Practical Recommendations

In this experiment, both numeric character references and JavaScript did an outstanding job of preventing spam from being sent to new email addresses. If you'd like to use one but are unsure which to try, keep in mind that the numeric character reference technique has a couple of advantages over JavaScript – it's easier to implement and the references work in everyone's browser. JavaScript is a little more work because some users have JavaScript turned off and those users still need a way to get in touch with you. On this site I use a technique that provides a fallback to a contact form if JavaScript isn't available. See the code section for implementation details.

Weaknesses in the Study

First and foremost, this experiment needs to be repeated by others, especially on sites that get more traffic. It could be that my little corner of the Internet is populated by stupid harvesters.

A second weakness is the fact that I wrote the numeric character references using a mix of decimal and hexadecimal numbers. (See the Code section for details.) It's possible that some harvesters can interpret decimal references but not hexadecimal ones or vice versa. In my opinion that's unlikely, but a better test would be to have one address written using decimal references and another using hex references.

The last error to which I must confess is a little sloppiness in my counting. The study lasted from the 20th of March to the 20th of October inclusive which is a total of 215 days, not 213. Nevertheless, I consistently refer to the study period as having been 213 days throughout this article. There are two reasons for this. The first is that I started the study (i.e. put the addresses up on the Web) around 5PM on March 20th and I ended the study (removed the addresses) around 9AM on October 20th. So, in reality, the study lasted only 213 days and 16 hours, or 213.75 days. The second problem was a rounding error in the program I wrote to tally up the spams. When counting the time delta between the start of the study and the time a given spam was received, the program counted the days and discarded the hours, hence 213.75 became just 213. By the time I discovered my error, I'd spent a lot of time making the chart look pretty and I didn't want to go back and correct it. That's a lazy reason, but in the context of the overall study results, it doesn't change a thing with respect to the effectiveness of the methods discussed. However, anyone wishing to perform a study like this herself is urged to be more meticulous than I have so as to avoid the need to write a long explanation such as this one, not to mention feeling a bit stupid.

Arguments Against These Methods

1) It's An Abuse of the HTML Standard

This argument was raised in a discussion on alt.html about using numeric character references to obfuscate email addresses. John Dunlop argues that while the HTML spec doesn't explicitly prohibit using them like this, the SGML spec says that this is outside of their intended use and doing so is therefore against the spirit of the HTML spec. I haven't read the SGML Handbook, it's not my argument and I don't want to misquote Mr. Dunlop, so I encourage you to read the discussion and judge for yourself. Skip to his quote of the SGML Handbook if you're in a hurry, but you'll miss some interesting points in this debate.

2) Not Everyone Has JavaScript Enabled

This is a valid argument against using the JavaScript-protected addresses. One must be careful to provide a method of contact for users who don't use JavaScript.

3) Sure, It Works Now, But What About the Future?

A common argument I hear against both of these methods is that even if the results above hold up for most users, eventually that success will fade because harvesters will evolve features to overcome these attempts to frustrate them. I have two counterarguments –

  1. First, I'm not convinced that many harvesters will evolve these features anytime soon, if ever. I see several inhibiting factors. The first is laziness. I think harvesters have enough low-hanging fruit to pick from in the form of unprotected email addresses, and new addresses are constantly showing up on the Internet to provide a fresh crop. It'll take a shortage of addresses to motivate harvesters to evolve these features, and I don't believe that shortage exists or is coming anytime soon.

    The second inhibiting factor is that spam and phishing schemes have a better chance of succeeding when targeting people who are not Internet-savvy. I argue that people who know how to obfuscate or otherwise protect their email addresses are, by definition, not in the average spammer/phisher's target market so they have little motivation to harvest their addresses.

    The third inhibiting factor I see applies only to JavaScript-protected email addresses. In order to read these addresses, spammers will have to embed a JavaScript interpreter in their harvesters. People are quick to point out that it's not difficult to do so, and they're right. But the difficulty of implementation isn't relevant when compared to the long-term costs for the harvester. A JavaScript interpreter would consume memory and CPU on the harvesting machine which would decrease throughput. In addition, if the interpreter is not properly sandboxed, it could expose the harvester's machine to exploitation. I believe it's these factors (combined with the above) that will discourage harvesters from interpreting JavaScript any time soon.

  2. My second counterargument to the prediction that the effectiveness of these methods will fade is a simple, "So what?" Assuming that the argument is true and that these methods will block less spam over time, they still provide a benefit now and will provide some in the future. In fact, the same "It won't work forever" argument was made against Bayesian filtering, and even though that technique is less effective today than it used to be, it's still valuable.

The Code

How to Write an Address Using Numeric Character References

Here's a short tutorial on writing addresses using numeric character references. You'll want an ASCII chart handy, because each character in your email address needs to be replaced with &#__; where the underscores represent the value of the character which you want to display. For instance, you can display the word fika like so because the ASCII values for f, i, k and a are 102, 105, 107 and 97 respectively:

   fika

If you don't feel like typing this in by hand, there are lots of utilities on the Web that will do it for you.

The Addresses

The spamtrap addresses were coded as below. You'll note that äcklig's address is coded using a mix of decimal and hexadecimal references mixed with ordinary ASCII. This was sloppy of me and may be a weakness in the experiment.

Wrapped lines are indicated by ↵. Also, the domain name in this code has been changed to example.com.

<ul>
<li><a href="mailto:&#x61;&#x63;&#107;&#x6c;&#x69;g@   ↵
&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#46;   ↵
&#99;&#111;&#109;">Ms. Acklig</a> doesn't read her
mail very often.</li>
<li><a href="mailto:fika@example.com">Mr. Fika</a>  doesn't
read his mail very often.</li>
<li>
   <script type="text/javascript">
    var s = "<a href='mailto:";
    var a = [107, 111, 110, 115, 116, 105, 103, 64, 101, 120, ↵
97, 109, 112, 108, 101, 46, 99, 111, 109];

    for (var i = 0; i < a.length; i++)
       s += String.fromCharCode(a[i]);

    s += "'>Mr. Konstig<\/a> doesn't read his mail very often.";

    document.write(s);

    </script>
</li>
</ul>

Protecting Addresses with Javascript -- A Better Implementation

One must be mindful of those who browse with JavaScript disabled. (Like myself – thank you NoScript!) Page authors must provide them with a method of contact that doesn't rely on JavaScript. An email form is the obvious solution, but it requires a little server-side programming which isn't an option for everyone. And would-be server script writers should be aware that improperly written mail form scripts can be exploited by spammers via SMTP header injection, so be careful!

A technique that's worked quite well for me is to provide one option for those that have JavaScript enabled and another for those that don't. When providing both script-enabled and scriptless content, the obvious path to take is to use <script> and <noscript> blocks. But there's a less cumbersome path, which is to write one's page assuming JavaScript is not enabled, and then include a script that makes the changes you want for the script-enabled version. So I would write all of my contact links with a link to my email form, like so:
...please <a name="contact" href="/contact.html">email me</a>...

At the bottom of each page, I'd then include JavaScript which finds each element in the page with a name of "contact" and sets the href attribute to my email address. I'd include this script at the end of the page to ensure that all of the elements that I want to change have been created when the script runs.