Why would you need an automated CAPTCHA solver?

There are both legitimate and illegal reasons to use automated CAPTCHA solving. I’ll start with the illegal ones. For spammers, it’s in their interest to harvest as many email addresses as possible because they are paid based on the numbers of spam they generate and CAPTCHA is getting in their way. Therefore, they really need a cost effective way to overcome the CAPTCHA protection. Another illegal use case scenario is when a party wants to “skew” the result of online polling to suit their needs—where the polling data entry protected by CAPTCHA. As for the legal ones, it could be a new business partner wanting to automate access to the service of a certain company but the service is protected by CAPTCHA (to prevent abuse). However, the service provider has yet to provide an Application Programming Interface (API) for its service to be used by the new business partner—maybe due to the time constraint or budget constraint to provide the API. In this case the new business partner doesn’t have a choice but resort to automate the CAPTCHA solving needs.

Approaches to implement an automated CAPTCHA solver

There are two major approaches to implement an automated CAPTCHA solver:

Using a third party CAPTCHA solving service. Creating a bot that uses Optical Character Recognition (OCR) to try solving the CAPTCHA characters.

There are several providers of third party CAPTCHA solving services at the moment, for example: Death by CAPTCHA (http://deathbyCAPTCHA.com), de-captcher (http://www.de-captcher.com/) and decaptcher2 (http://decaptcher2.com/). Most of these services work by using “human automation”, i.e. they use human automation to recognize the CAPTCHA characters and send back the result to you. The pros and cons of using third party CAPTCHA service like these are:

The pros: the accuracy probability is higher than using an OCR approach because human automation is inherently better in recognizing CAPTCHA than machines and the service providers usually provides you with easy to use API to interface with their CAPTCHA solving service over the net. The cons: the cost for a high number of CAPTCHA solving needs is quite prohibitive because it adds up quickly over time and there’s the problem of latency. Where the speed at which the CAPTCHA is solved doesn’t meet your solving “timeout” requirement—in the latter case, the CAPTCHA is solved correctly but it takes too much time that the session for the CAPTCHA solving page has expired.

In my experience, CAPTCHA solving services tend to be better at solving CAPTCHAs—relative to OCR approach—but have the aforementioned latency problem. The second approach is much more complex than the first—than using third party CAPTCHA solving services. However, it lacks in precision compared to the first approach. Moreover, the second approach could not solve complex CAPTCHAs in many situations. However, for rather trivial CAPTCHAs, the second approach is much more cost effective and more or less usable. You might be surprised that in practice, trivial CAPTCHAs are still widely used, especially for websites for very specific services, such as mobile (cellphone) operator—usually prepaid ones where subscribers can top-up their account via web, another example is online ticketing for events and so on. These service providers don’t have lots of hits because only those wanting to use their services would go to their websites. Perhaps, that’s the reason why they don’t employ sophisticated CAPTCHAs, or maybe the present (trivial) CAPTCHA is good enough for them. The focus of this article is the second approach, i.e. using OCR to defeat the CAPTCHA. Of course this solution cannot solve even “simple” CAPTCHA one hundred percent of the time. Nonetheless, this article is only meant to be introductory material to understand the architecture of such a solution. It’s not meant to be a guide to “fight” CAPTCHA used by the big boys like Google, Facebook or Twitter. That would require far more advanced CAPTCHA solving solutions.

Implementing our simple CAPTCHA solver

We are going to use a readily available OCR library to build our CAPTCHA solver bot. Details of the tools to get the CAPTCHA images are not going to be explained here. The focus is only on building a small program to solve the readily available CAPTCHA image. Nonetheless, this article explains the generic architecture of a complete CAPTCHA solver solution.

Prerequisites

This section assumes that you are quite proficient in using a C/C++ Integrated Development Environment (IDE), or using a C/C++ compiler via command line directly. It also assumes that you know the basics on creating Windows DLLs and linking with them. If you are still confused, you can use your favorite search engine to look for relevant articles on the subject.

The Big Picture

Now, let’s start with the big picture. The overall architecture of a CAPTCHA solver solution looks like 1. There are two main components of a CAPTCHA solver solution, the web “scraper” and the CAPTCHA solver itself, as shown in 1. Figure: CAPTCHA Solver Solution Basic Architecture

The purpose of the web scraper is to scrape the target web page, i.e. “browse” the target web page as if a human would browse a webpage, extract data required to process the page and sending “automated” feedback to the target web page. For example, if a web form is on the target web page, the web scraper would extract the form entries from the web page, then the web scraper fills the required data to the form entries and sends the “response” to the target web page—as if human enters required data and then clicking on the submit button on the target web page. In a more complicated target web page, the data entry process is protected by a CAPTCHA. Therefore, the web scraper must call or implement a CAPTCHA solver to fulfill the CAPTCHA check requirement. Let’s take a look the solution in 1 in more detail. These are the steps carried out in 1:

The web scraper fetches the contents of the target web page. The web scraper extracts the CAPTCHA image from the target web page. The CAPTCHA image is sent to the CAPTCHA solver. The CAPTCHA solver solves the CAPTCHA and emits CAPTCHA string as the result. The CAPTCHA string is sent back to the web scraper. The web scraper sends the feedback—including the CAPTCHA string—to the target web page URL.

This article only focuses on the CAPTCHA solver component. As for the web scraper, it’s a completely different subject and it varies depending on the web site that’s being scraped.

Using open source OCR library to solve CAPTCHA

One of the ways to defeat CAPTCHA automatically is to use OCR library to recognize the string in the CAPTCHA. Contrary to what you might think; OCR library recognizes string not just by trying to recognize individual letters (and digits) but also by using context information. For example, if you know that the string you’re trying to recognize contains only letters, you can feed that information to the OCR library to boost the recognition accuracy. Similarly, if the target string contains only digits with no alphabet, you can instruct the library to recognize only digits, not letters. Other possible context is the language of the string you’re trying to solve. Now, let’s move to the concrete implementation. This article shows you how to implement the CAPTCHA solver by using the open source Tesseract OCR library. The library is available at https://code.google.com/p/tesseract-ocr/. Tesseract is written in C++. Therefore, the most natural way to use it is to write your CAPTCHA solver in C++ or C. You have to be aware though, that C++ uses name mangling, i.e. the function name seen on the source code is not the same as the one in the compiled object file, dll or executable produced by the compiler. Anyway, Tesseract depends on Leptonica, another open source library that handles various image file formats. Therefore, you need to link to Leptonica as well as Tesseract in your program in order to use Tesseract OCR for CAPTCHA solving. The implementation provided here is Windows specific. You can download the Visual Studio 2008 code for Tesseract in this link: https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02-vs2008.zip&can=2&q=. Additionally, you can download the Leptonica v1.68 dependency here: https://code.google.com/p/leptonica/downloads/list. For the sake of portability between different languages, the implementation here is in the form of a “plain C” Windows DLL that interfaces to Tesseract DLL—and indirectly to Leptonica DLL because Tesseract depends on Leptonica. I will also provide the code of a simple test application to test the DLL. Perhaps you’re still confused about this; 2 should clarify what I meant. Figure: Our CAPTCHA Solver Implementation Architecture

It is clear form 2 that we have to create two things, first is the Windows DLL wrapper code and the second is the test application to make sure our DLL is working as intended. The Windows DLL wrapper code consists of two files: CAPTCHA_solver_dll.h and CAPTCHA_solver_dll.cpp . 2 shows the presence of Tesseract “learning” Data. If you install Tesseract in your machine, this data is placed in tessdata directory in the Tesseract installation directory. You don’t need to install Tesseract if you want to use it in your own program. However, you need to have the Tesseract “learning” data—the tessdata directory and its contents—somewhere in the machine that would run your program and you must set the TESSDATA_PREFIX environment variable to the absolute path of the directory containing the tessdata directory, not the path of the tessdata directory. You can do that via Control Panel|System|Advanced System Settings|Environment Variables|System variables. After that, it’s highly advisable to log-off and log-on again or to restart the machine because sometimes the new environment variable is not updated as we wished if you don’t do so. Setting TESSDATA_PREFIX environment variable is needed because Tesseract requires this environment variable when it runs to query the “learning” data. Now, let’s move to the details of using Tesseract in the CAPTCHA_solver_dll.cpp file. Using Tesseract is quite easy. These are the logical steps to solve a CAPTCHA image with Tesseract:

Initialize tesseract API object to be used. Check the whether the input file format is supported or not. Process the input image file to obtain the CAPTCHA string. Copy the result string to the output buffer. This is required because Tesseract uses an internal representation for a string which is not guaranteed to be compatible with the string format we want—plain C string, i.e. null-terminated string.

Now that the algorithm to use Tesseract is clear, I’ll show you the C++ code that implements the algorithm. 1 shows the solve_CAPTCHA() function which invokes Tesseract to “solve” (read) the CAPTCHA string passed in the input CAPTCHA image passed to the function via the image_file_path input parameter. This is the only function an application needs to use Tesseract via our Windows DLL wrapper. The image_file_path input parameter in solve_CAPTCHA() function contains path to the CAPTCHA image to be solved. 1 doesn’t show the entire code in CAPTCHA_solver_dll.cpp, only those important to implement the very thin wrapper to Tesseract. The implementation of the steps/algorithm above in 1 is very straight forward. Listing: solve_CAPTCHA() Function Listing in CAPTCHA_solver_dll.cpp File [c] #include “stdafx.h” #include “CAPTCHA_solver_dll.h” … // Variable to store the result of CAPTCHA processing static char g_CAPTCHA_string[MAX_CAPTCHA_STRING_LENGTH + 1]; … // This is an exported function. /// /// This function invokes tesseract library function to solve the CAPTCHA image /// in the image_file_path parameter. /// ///Path of the CAPTCHA image file. /// Pointer to string that will hold the CAPTCHA string result /// CAPTCHA_SOLVER_DLL_API char* solve_CAPTCHA( const char* image_file_path ) { // // STEP 1: Initialize tesseract object to be used. // const char* lang = “eng”; char* config_file_path = “digits”; /* Hardcode the config file to be used to “$TESSDATA_PREFIX/configs/digits” NOTE: As long as $TESSDATA_PREFIX has been exported to as Windows environment variable, using only the word “digits” here should work. / tesseract::TessBaseAPI api; api.Init(image_file_path / datapath /, lang / language /, tesseract::OEM_DEFAULT / OcrEngineMode mode /, &config_file_path / char *configs /, 1 / configs_size — only config_file_path /, NULL / const GenericVector vars_vec /, NULL / const GenericVector vars_values /, false / bool set_only_non_debug_params /); tesseract::PageSegMode pagesegmode = tesseract::PSM_AUTO; if (api.GetPageSegMode() == tesseract::PSM_SINGLE_BLOCK) api.SetPageSegMode(pagesegmode); // // STEP 2: Check the whether the input file format is supported or not // FILE fin = fopen(image_file_path, “rb”); if (fin == NULL) { return NULL; } fclose(fin); PIX pixs; if ((pixs = pixRead(image_file_path)) == NULL) { return NULL; } pixDestroy(&pixs); // // STEP 3: Process the image. // The result is a STRING object pointed by text_out variable below. // STRING text_out; if (!api.ProcessPages(image_file_path, NULL, 0, &text_out)) { return NULL; } // // STEP 4: Copy the result string to the output buffer // a. Use text_out.strdup() to get a pointer to copy of the CAPTCHA solver result. // b. Free the heap consumed by the duplicate of the CAPTCHA string result. // memset(g_CAPTCHA_string, ‘\0’, sizeof(g_CAPTCHA_string)); char result = text_out.strdup(); strncpy(g_CAPTCHA_string, result, sizeof(g_CAPTCHA_string)); free(result); return g_CAPTCHA_string; } [/c] The PIX object in 1 is a Leptonica object. PIX object handles the input image to be passed to Tesseract. Most of the image-related processing in Tesseract is handled by Leptonica. The CAPTCHA_SOLVER_DLL_API identifier in 1 is a macro to define the linkage type of the function. You can see the details of this identifier in 2 (CAPTCHA_solver_dll.h and). CAPTCHA_SOLVER_DLL_API identifier in 1 maps to __declspec(dllexport) because the CAPTCHA_SOLVER_DLL_EXPORTS constant is defined in the preprocessor setting of the Visual Studio project containing the CAPTCHA_solver_dll.cpp file. As you can see in 2, if CAPTCHA_SOLVER_DLL_EXPORTS constant is defined, CAPTCHA_SOLVER_DLL_API identifier resolves to __declspec(dllexport). 1, gives a “context” hint—a.k.a heuristic—to Tesseract in the form of language setting and configuration file setting. The language is set to English and the configuration file is set to digits only, i.e. Tesseract should interpret the inputs as digit only. This is done in step 1 in In 1. This can be done because it is assumed that we have done preliminary assessment on the target CAPTCHA and the result is the input CAPTCHA always consists of digits. Listing: CAPTCHA_solver_dll.h File [c] #define CAPTCHA_SOLVER_DLL_H // The following ifdef block is the standard way of creating macros which make exporting // from a DLL simpler. All files within this DLL are compiled with the // CAPTCHA_SOLVER_DLL_EXPORTS symbol defined on the command line. // This symbol should not be defined on any project that uses this DLL. // This way any other project whose source files include this file see // CAPTCHA_SOLVER_DLL_API functions as being imported from a DLL, whereas this DLL sees // symbols defined with this macro as being exported. #ifdef CAPTCHA_SOLVER_DLL_EXPORTS #define CAPTCHA_SOLVER_DLL_API __declspec(dllexport) #else #define CAPTCHA_SOLVER_DLL_API __declspec(dllimport) #endif #ifndef MAX_CAPTCHA_STRING_LENGTH #define MAX_CAPTCHA_STRING_LENGTH 256 #endif #ifdef __cplusplus extern “C” { #endif CAPTCHA_SOLVER_DLL_API char solve_CAPTCHA( const char image_file_path ); #ifdef __cplusplus } #endif #endif // CAPTCHA_SOLVER_DLL_H [/c] With the Windows DLL wrapper completed, we can now move to the test application source code. 3 shows the source code of the test application for our Tesseract wrapper library. This test application is again, Windows-specific. If you are using Visual Studio to compile the code in 3, set the character set in the project setting to Multi-Byte Character Set (MBCS)—via the “Project Properties”|Configuration Properties|Project Defaults|Character Set setting. This setting instructs Visual Studio to compile the project in MBCS mode, i.e. ANSI C-compatible mode. Thus, the string handling in the code would be set to ANSI C string “mode”. This is important to do because by default, Visual Studio sets the character set to Unicode, which is not compatible with the output from the Tesseract wrapper library we built earlier. Listing Test Application (CAPTCHA_solver_dll_test_app) Linked to CAPTCHA_solver_dll.dll [c] // CAPTCHA_solver_dll_test_app.cpp : Defines the entry point for the console application. // #include “stdafx.h” #include “CAPTCHA_solver_dll.h” int _tmain(int argc, _TCHAR argv[]) { char CAPTCHA_string[MAX_CAPTCHA_STRING_LENGTH]; /// Invocation rule: test_app [image_file_path] if (argc != 2) { printf(“Error! Wrong input parametersn”); printf(“Usage: %s [image_file_path]n”, argv[0]); return 0; } /// Step 1: solve CAPTCHA memset(CAPTCHA_string, ‘\0’, sizeof(CAPTCHA_string)); strncpy_s(CAPTCHA_string, sizeof(CAPTCHA_string), solve_CAPTCHA(argv[1]), _TRUNCATE); /// Step 2: show CAPTCHA string printf(“CAPTCHA string = %sn”, CAPTCHA_string); return 0; } [/c] The code in 3 is a Windows-specific C source code because the string function is Windows-specific—a secure version of the default C string function. The line in 3 that invokes solve_CAPTCHA() function in the wrapper DLL we built earlier is: [c] strncpy_s(CAPTCHA_string, sizeof(CAPTCHA_string), solve_CAPTCHA(argv[1]), _TRUNCATE); [/c] You can look up the details of the strncpy_s() secure string copy function at: http://msdn.microsoft.com/en-us/library/5dae5d43(v=vs.80).aspx while the _TRUNCATE constant is explained here: http://msdn.microsoft.com/en-us/library/ms175769(v=vs.80).aspx. This function is a secure version of the strncpy() function. As you can see, using the wrapper DLL involve only one function call in the code that uses the library. Of course, you have to link against the wrapper library in your Visual Studio project or in other type of IDE that you use. Nothing is out of the ordinary in the code in 3. Therefore, you should be able to grasp it right away.

Testing our CAPTCHA solver application

At this point, the entire CAPTCHA solver solution is complete. It’s time to put it into test. 3 shows the CAPTCHAs I used to test the CAPTCHA solver solution explained in the previous sections. Figure: CAPTCHA Samples Used for Testing (lumped together into one image)

4 shows how I invoke the test application to solve the CAPTCHA string in image 8.jpg and 9.jpg respectively. As you can see, the test application correctly reads the CAPTCHA string. Figure: Running the CAPTCHA Solver Test Application

As mentioned in 1explanation, the Tesseract wrapper DLL gives heuristics to Tesseract that the input consists of digits and it should be regarded as English in nature, not other character sets such as Chinese, Thais or Japanese. 1 shows the result of invoking our test application with the above input (CAPTCHA) files. Table: CAPTCHA Solving Result 1 show that the precision of our CAPTCHA solving test application is 40%, against ten input CAPTCHA images. That’s not that bad for a first try, isn’t it? Moreover, there are 40% almost correct guesses, with only one character missed or there is one extra character. In several cases, it seems Tesseract mistook the digit six as digit five. CAPTCHA String Anyway, the automated CAPTCHA solver solution I presented here is very rudimentary. It doesn’t do any preprocessing to the input image which could improve the CAPTCHA solver accuracy, albeit maybe just a little. But, with 40% near miss, that could boost the accuracy to a whopping 80% accuracy.

Closing thoughts

There are several possible ways to improve the CAPTCHA solver accuracy, first we could do preprocessing to make the CAPTCHA image clearer and second, we can add one more “context” as heuristic to the CAPTCHA solving solution, such as giving a hint to Tesseract that the input is always six characters. In the end, automated CAPTCHA solving is a gray area because it’s not clear in terms of legality in many places. In Indonesia (where I live), it’s legal only due to absence of regulation at the moment, because the basic premise in Indonesian Law is something not yet regulated deemed legal. I hope that this article opens up a new understanding on how automated CAPTCHA solving might be carried-out. INTERESTED IN LEARNING MORE? CHECK OUT OUR ETHICAL HACKING TRAINING COURSE. FILL OUT THE FORM BELOW FOR A COURSE SYLLABUS AND PRICING INFORMATION.