Gotcha CAPTCHA!

The Challenge

SPAM!

No, not SPAM the food

4891
Image source: capl@washjeff.edu.

SPAM, the stuff you get unsolicited in your email inbox!

SPAM, often generated by computers termed SPAMbots, can originate from many places on the web.  For instance, some SPAMmers troll webpages looking for email addresses to add to their databases, while others use online forms to generate unwanted comments on blogs or to generate unwanted form submissions.

CAPTCHA

One of the many security solutions developed to address this problem is called CAPTCHA, or Completely Automated Public Turing test to tell Computers and Humans Apart.  Wait a completely automated what?  The Turing test, introduced in 1950 by Alan Turing, is considered to be an operational test of intelligence to determine if computers can effectively parse language (i.e. is a computer able to “think” in the same way as a human?).  In the Turing Test, computers are programmed to convince human interactors that they can indeed think.  If a computer passes the Turing Test it is deemed to be capable of thought. CAPTCHA is actually a form of reverse Turing test where a human individual is being asked to prove to a computer that they are indeed human.

turing_test[1]
http://xkcd.com/329/

Captcha is used as a SPAM prevention mechanism on countless (the number is growing everyday) forms, forums and polling websites across the web.  This amounts to millions of people everyday solving CAPTCHA challenges.  Actually according to von Ahn at al. (2008) “humans around the world type more than 100 million CAPTCHAs every day …, in each case spending a few seconds typing the distorted characters. In aggregate, this amounts to hundreds of thousands of human hours per day.”

CrowdSourcing + CAPTACHA = reCAPTCHA

But what if all those completed CAPTCHAs could be put to good use?  What if every time someone on the internet completed a CAPTCHA it actually was part of something bigger?  What if you could CrowdSource with your CAPTCHA?

This is where reCAPTCHA comes in!  According to Google:

“reCAPTCHA is a free CAPTCHA service that protects your site against spam, malicious registrations and other forms of attacks where computers try to disguise themselves as a human; a CAPTCHA is a Completely Automated Public Turing test to tell Computers and Human Apart. reCAPTCHA comes in the form of a widget that you can easily add to your blog, forum, registration form, etc.

In addition to protecting your site, reCAPTCHA also helps us digitize old books and newspapers.”

Wait, did you just say ‘helps to digitize old books?’

Many people have noted that not all of the “words” appear to actually be words in the reCAPTCHA challenge. This is partly true in that all are words, but not all are correct. The reCAPTCHA challenge system actually works with book scanning projects in an effort to interpret words that book scanning software cannot recognize due to typos in original book editions or variations in typesetting.  The reCAPTCHA system always presents you with one word that the computer knows is correct and another that the computer cannot interpret.  When you answer the known word correctly, a correct answer for the unknown word is then registered at the same time.  This pair of words (the known and the unknown) is presented to many people over time until there is a high confidence interval in the answer.  The explanation here is actually quite good:  http://www.google.com/recaptcha/learnmore.

reCAPTCHA was acquired by Google in 2009 and has been used by Google to assist in the digitization of older novels and the entire New York Times archive.

Want to know more about the science behind RECAPTCHA?

von Ahn, L., Maurer, B., McMillen, C.,  Abraham, D.,  Blumre, M. (2008). CAPTCHA: Human-Based Character Recognition via Web Security Measures. Science, 132, 1465-1468.  Retrieved from http://www.google.com/recaptcha/static/reCAPTCHA_Science.pdf March 21, 2011.