Scraping Google for fun and profit
SEO Content Machine downloads google search results as part of the process of finding content for your keyword.
For those that poke around at the install directory, you will notice that there is a “phantomjs” folder.
SCM uses phantomjs to run an invisible browser in the background to load google and request data. As fancy as this was, it would cause a very disturbing .NET crash if ran for a couple of hours. For a very long time I tried everything to fix the issue but to no avail.
The problem was with CLR .NET crashing with a strange error code that was somehow tied to the garbage collector.
Today, I have finally put some time in to completely re-write the google download engine and move it out on phantomjs and back in Windows Internet Explorer.
There is one large gotcha with this:
Your internet explorer settings will now affect SCM. So make sure that your settings are default out of the box. Disabling images/javascript etc will cause the google scraper to fail or return wrong results.
Testing the new engine
With phantomjs gone, I did a big test on the new engine.
Scraping 100 keywords in Google and seeing the result.
I also have SCM captcha credits on the account so you will see “solve captcha” where SCM is solving a google captcha request.
The important thing here is that in 100 requests, there was 11 requests for captchas.
This was basically the same repetitive keyword, a longer wait time and different keywords would change this value.
For accounting purposes you can see the log of the download task below.
http://www.google.com/search?q=red+alert+0
http://www.google.com/search?q=red+alert+1
http://www.google.com/search?q=red+alert+2
http://www.google.com/search?q=red+alert+3
http://www.google.com/search?q=red+alert+4
http://www.google.com/search?q=red+alert+5
http://www.google.com/search?q=red+alert+6
Solving captcha
http://www.google.com/search?q=red+alert+7
http://www.google.com/search?q=red+alert+8
http://www.google.com/search?q=red+alert+9
http://www.google.com/search?q=red+alert+10
http://www.google.com/search?q=red+alert+11
http://www.google.com/search?q=red+alert+12
http://www.google.com/search?q=red+alert+13
http://www.google.com/search?q=red+alert+14
http://www.google.com/search?q=red+alert+15
http://www.google.com/search?q=red+alert+16
http://www.google.com/search?q=red+alert+17
http://www.google.com/search?q=red+alert+18
http://www.google.com/search?q=red+alert+19
http://www.google.com/search?q=red+alert+20
http://www.google.com/search?q=red+alert+21
Solving captcha
Exception thrown: ‘System.Net.WebException' in System.dll
Retries left 4
Solving captcha
http://www.google.com/search?q=red+alert+22
Solving captcha
http://www.google.com/search?q=red+alert+23
http://www.google.com/search?q=red+alert+24
http://www.google.com/search?q=red+alert+25
http://www.google.com/search?q=red+alert+26
http://www.google.com/search?q=red+alert+27
http://www.google.com/search?q=red+alert+28
http://www.google.com/search?q=red+alert+29
http://www.google.com/search?q=red+alert+30
http://www.google.com/search?q=red+alert+31
http://www.google.com/search?q=red+alert+32
http://www.google.com/search?q=red+alert+33
http://www.google.com/search?q=red+alert+34
http://www.google.com/search?q=red+alert+35
http://www.google.com/search?q=red+alert+36
http://www.google.com/search?q=red+alert+37
Solving captcha
http://www.google.com/search?q=red+alert+38
http://www.google.com/search?q=red+alert+39
http://www.google.com/search?q=red+alert+40
http://www.google.com/search?q=red+alert+41
http://www.google.com/search?q=red+alert+42
http://www.google.com/search?q=red+alert+43
http://www.google.com/search?q=red+alert+44
http://www.google.com/search?q=red+alert+45
Solving captcha
http://www.google.com/search?q=red+alert+46
http://www.google.com/search?q=red+alert+47
http://www.google.com/search?q=red+alert+48
http://www.google.com/search?q=red+alert+49
http://www.google.com/search?q=red+alert+50
http://www.google.com/search?q=red+alert+51
http://www.google.com/search?q=red+alert+52
http://www.google.com/search?q=red+alert+53
Solving captcha
http://www.google.com/search?q=red+alert+54
http://www.google.com/search?q=red+alert+55
http://www.google.com/search?q=red+alert+56
http://www.google.com/search?q=red+alert+57
http://www.google.com/search?q=red+alert+58
http://www.google.com/search?q=red+alert+59
http://www.google.com/search?q=red+alert+60
http://www.google.com/search?q=red+alert+61
http://www.google.com/search?q=red+alert+62
http://www.google.com/search?q=red+alert+63
http://www.google.com/search?q=red+alert+64
http://www.google.com/search?q=red+alert+65
http://www.google.com/search?q=red+alert+66
http://www.google.com/search?q=red+alert+67
http://www.google.com/search?q=red+alert+68
Solving captcha
http://www.google.com/search?q=red+alert+69
Solving captcha
http://www.google.com/search?q=red+alert+70
http://www.google.com/search?q=red+alert+71
http://www.google.com/search?q=red+alert+72
http://www.google.com/search?q=red+alert+73
http://www.google.com/search?q=red+alert+74
http://www.google.com/search?q=red+alert+75
http://www.google.com/search?q=red+alert+76
http://www.google.com/search?q=red+alert+77
http://www.google.com/search?q=red+alert+78
http://www.google.com/search?q=red+alert+79
http://www.google.com/search?q=red+alert+80
http://www.google.com/search?q=red+alert+81
http://www.google.com/search?q=red+alert+82
http://www.google.com/search?q=red+alert+83
Solving captcha
http://www.google.com/search?q=red+alert+84
http://www.google.com/search?q=red+alert+85
http://www.google.com/search?q=red+alert+86
http://www.google.com/search?q=red+alert+87
http://www.google.com/search?q=red+alert+88
http://www.google.com/search?q=red+alert+89
http://www.google.com/search?q=red+alert+90
http://www.google.com/search?q=red+alert+91
http://www.google.com/search?q=red+alert+92
http://www.google.com/search?q=red+alert+93
Solving captcha
http://www.google.com/search?q=red+alert+94
Solving captcha
http://www.google.com/search?q=red+alert+95
http://www.google.com/search?q=red+alert+96
http://www.google.com/search?q=red+alert+97
http://www.google.com/search?q=red+alert+98
http://www.google.com/search?q=red+alert+99
I imagine with operations such as “site:” etc the results will vary.
Release schedule
For now the changes are still in internal testing but they should be released to the public in the next few days.
Once released you will get an update notification.
Specifically for Windows Server users, this should fix a long standing problem with SCM freezing after running anything Google related eg Article Downloader/T1 Content Creator.