PHP Scraping and Mining

PHP Scraping and Mining

Scraping web pages can be a very effective way to analyze data within the moment, or, over a period of time. Using regex (regular expressions) is one of the most effective ways to do this. Scraping web pages can help you analyze data in various ways. This complete scraping tutorial explains how to scrape one of my blogs by finding patterns from the source code. The basic code is setup, and it could be easily modified to scrape just about anything else.

Store Scrape
You can store data in a database and monitor prices of certain items. Then, if the price drops you can be alerted with an email telling you this. This can be considered 'Smart Shopping'. With scraping, you can find deals that the website will not tell you about. For example, if you go to a website and browse a desired item, is there any way you can naturally be alerted when it sells for 1/2 price? Probably not. The store wants you back to shop, they don't want to offer alerts only when items become discounted.

Auction Scrape
Another example where scraping can be valuable is to monitor auction items. You can monitor specific items at an auction site and send yourself an alert when the auction is in the final 10 minutes and the item is below a certain price. A simple cron job that runs every 15 minutes for your scraping page can do this. This method would let your scraper find the item you would want to buy at the price you want to pay when the auction is almost complete. Now, you only need to wait a few minutes to see if you can squeak in the last bid.

The key to scraping web sites like bookstores and auctions is to view the source code and find the patterns for the text you want to capture. Any site that outputs books about a certain subject or products will use a patterns to display the organized html code.

The code below was written to scrape links and page titles from one of my own blogs. I wanted to grab the title and author of each entry and display them together. This sounds easy. But, here was the process. I grabbed the web page using the file_get_contents() function. I could have used curl, but, I used the file_get_contents() function. Then, I made two regex arrays; one for the title and one for the author. Then, I created a foreach loop to parse each regex array. Although you may think that the array only contains one key, it actually contains two. The new arrays for each regex will be the entire string and the actual text you are seeking.

For now, just remember that we have 40 items in the $posts_array. We just want 2 items and to match them up. The good news is that we can easily get what we want with a little math and custom sorting.

$posts_array

The $posts_array is the main array we want to alter. We run the $posts_array through a foreach loop and discard the two items within the array we do not want.The code to skip the unwanted items is shown below. To see what we want to remove, you need to create conditions based on the html source code.

if(strpos($value,'display:none;')==true  || strpos($value,'Posted by:') == true ){
continue;
}

Asides from filtering out what we don't want, we create make a variable called $key2 which is the array key for the author which matches the $key for the article. Once the match is made, we have our value (the web page title) and the key matching author element. The code below shows how to match the $value for the page title with other element from the array using $posts_array[$key2].

if(array_key_exists($key2,$posts_array)){
// Since the link was a relative url, we use str_replace() function to make it an absolute url
$value = str_replace("blog.php","http://lampload.com/bookmarks/blog.php",$value);
echo $value."-".$posts_array[$key2]."<br/>";
}else{
echo "This should not display!";
}

Entire Code

<?php
$data = file_get_contents('http://lampload.com/bookmarks/blog.php');

## NOTE: each array will become a multidimensional array containing the two values. One value is the entire tag and elements between the tags and the other will be the precise match. 
$regexs = array('/<div style="font-size:18px; margin-left:5px; display:none;">(.*)<\/div>/', '/<div style="margin-left:5px; margin-top:0px;">Posted by:(.*)<\/div>/'); //finds writer tags and gets content between tags WORKS 

$count = 0;
	foreach ($regexs as $regex) {
	$count = $count + 1;

	preg_match_all($regex,$data,$posts, PREG_SET_ORDER);


	//print_r($posts);
	
	//echo '<br/><br/>';
	
	$cnt = count($posts);
		
	$cnt_keys = count(array_keys($posts));
	
	//echo "Yikes-".$cnt_keys."Yikes";
	
	$loops = 0;
	for($i=0; $i < $cnt; $i++){
	foreach ($posts[$i] as $post) {
	$loops = $loops + 1;
	
	//view source code and customize
	/*$post = str_replace('<p>','',$post);
	$post = str_replace('</p>','',$post);
	$post = str_replace('<p>','',$post);*/
	//$post = str_replace('</a>','',$post);
	$post = trim($post);	
	//echo $key."-".$post."<br/>";
	$posts_array[] = $post;
	}
	}
	}

//print_r($posts);

//print_r($posts_array);


//echo "Posts array count = ".count($posts_array)."<br/>";

//echo count($regexs);


echo "<br/><br/>";

foreach ($posts_array as $key => $value){
//echo "<br/>hello".$value."hello<br/>";

// Make conditions here. We want the values from the $posts_array array which do not contains 'display:none' or 'Posted By'; since they are the array elements we don't want. Viewing the source code can make this obvious. Therefore, we skip all items with display:none or with text 'Posted by:'. 
if(strpos($value,'display:none;')==true  || strpos($value,'Posted by:') == true ){
continue;
}

$items = count($posts_array) / 2;

// $key2 will give us the key which matches the author's name for the exact article title. Since there are 10 blog entries and 40 items in the $posts_array, we add 20 which matches the article title with the author's name. 
$key2 = $key + (count($posts_array) / 2);

if($key <= $items){

if($value == ''){
echo "empty";}else{
if(array_key_exists($key2,$posts_array)){
// Since the link was a relative url, we use str_replace() function to make it an absolute url
$value = str_replace("blog.php","http://lampload.com/bookmarks/blog.php",$value);
echo $value."-".$posts_array[$key2]."<br/>";
}else{
echo "This should not display!";
}

}
}
}
?>