Using PHP CURL Library to Scrape the Internet
Have you ever though how much information is there in DMOZ? Your entire life won’t be enough to collect and sort it.
Well, we had to do part of that. P.I.M. Team Bulgaria was involved in scraping the technology directories of DMOZ, google, yahoo and many more. We had a request to scrape several technology directories, to map them in a master structure, to get all the company infos in the directories and to get the URLs of all these companies. We found this task amazing!
At the beginning
The first thing you need to know when you have to scrape the net is to know how to do it
There are various technologies, but the most important is to know the basis of the process:
– screen scrape
– parse the input
– sort and fulfill the output
– save the results
This is a process in which you get the content of any website thru a script. One good grabber script should be able to get the content of any site regardless if it is static HTML or contains dynamic generated pages. In P.I.M. we are working mostly with PHP so this was our choice. PHP has a great supporting library called CURL (“Client URL Library”) which allow us to do that.
We created a simple grabber class which has a constructor doing to the scraping job and few methods for parsing the result:
$ch = curl_init ();
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
$this->content = curl_exec ($ch);
The input we receive when calling the grabber is the HTML (static or generated) of the page.
Parse the input
Can you imagine that each web page has its own soul? Some pages are coded ‘by hand’ by their authors and all of them has different style. Some are generated with a software which, guess what, has also different style. So the deal with parsing the output is to find specifics on the page which to help us get out the scrap and exctract the usefull information. Lets get, for example, Alexa (www.alexa.com). We have to extract the technology directories only so we pointed our grabber to http://www.alexa.com/browse/categories?catid=4. The result we got had header and footer, which we did not needed, so we had to take it off. Easy: we just removed everything which is not between 2 HTML sctrings on the page: “<span class=”bodyBold” Browse”> and the unique string “Languages available for this subjec” near the end.
Thus we have the core. What to do with it? Hey, that’s easy! Break all of it on rows:
... then go thru rows:
foreach($rows as $row)
if(strstr($row,">b<")) continue; //get name $name=cut("\">","",$row);
Why we do this? Well, we noticed that all the rows we need contain ‘href’ and DO NOT contain any bold tags. So we got rid off all rows which did not meet our requirements. Then we had to parse a little the rest of the rows. using our function ‘cut()’ that was very easy. Thus we got all the directories on the first level! But there are many levels, now what? Well, we had the URLs on each directory. And guess all the directories, of course, have similar page structure! So we had only to go in several cycles and to get all of. The entire Alexa was in our hands!
We were adding all the nested subdirectories as arrays whcih were parts of the main directory array.
Thus each directory in $dirs had its element – $dir[subdirs]. Each one from $subdirs has its own $subdir[subdirs]. Thus we had everything structured yet at the time of grabbing.
With Alexa everything was done. But we have another sites in whose directories were tons on company names… Company names without URLs. Useless for our customer. We had to find all company URLs! But how? How do you find the url of any company? I don’t know. I know how I do it. I use Google. Nice, eh? Go in google and type in some company name. If this company is listed in a big directory like Alexa oe DMOZ in most of the cases it will appear on the first page in Google’s results.
This of course is not enough. Our script checks first 3 pages with Google results. For each result it opens the corresponding site. It checks for 2 things – if the company name apears in the title AND in the page content. This occures to be enough. More than 95% success. That was better than most people would do it manually. We had even a better approach (to clasify the site topic in a similar to Google way), but this made the script too slow. So we all agree more than 95% retrieved URLs is excellent.
Save the results
There is a lot of data in Google, DMOZ, Alexa, Yahoo etc. We had to save all of this. A HUGE database!
Well fortunately we had to get only the technology directories. And we didn’t need all the info, but only the sctructure and the company URLs. So, we saved the info in 2 formats – XML files which were greatly representing the directory structure and MySQL database. Well, for the human eye the relational database was very unapropriate to store direcory structure. But, the thruth is that is it the very best when all of that info was processed with machines. So, don’t thing you should always store the tree in a tree-structure. The wood are better transported with trucks
Grabbing the net is possible. We used PHP and CURL library. You can prefer another technology.
The most important is how you are going to parse the HTML. Remember, each page has its soul. If you can talk with it, you can get what you need.
Bobby Handzhiev is a senior developer in PIM Team Bulgaria