PHP Scraping Webpage Tutorial


PHP Scrape Webpage Lesson

Scarping a web page with php is a procedure for which you get the html code of a foreign website and retrieve specific parts of data. Scraping a website could be a method for which to use data for analysis when no rss feeds exist. Another reason one may wish to scrape a website is to analyze page titles of highly ranked websites.

The codes below show two methods for which you can captures all text within h1 tags. Both examples use arrays to capture all blocks of desired text. With an array, anything is possible from adding the content to a database or to analyze text. Both examples use the function preg_match_all() to create arrays from a particular website.

The second example is a little longer and lengthier and uses the string_replace() function and trim() function within the foreach loop to create unique arrays.

Scraping can be a little controversial since you are capturing somebody else's copyrighted material. Pickybacking off somebody else's hard work may not be taken lightly. If the website used copyscape.com and finds such abuse; small or large complaints could exist. In extreme cases, Google could significantly ban your website for using black hat seo.

Sample #1
Let's go through the code.The file_get_contents() function gets the html code we want to scrape. The $regex variable picks the data between the h1 tags. Then, the preg_match_all() function creates the $posts array. The foreach loop runs each value in the array and checks for '<h1'. If it exists, it outputs the value.

$data = file_get_contents('http://example.com');

$regex = '/<h1(.*)<\/h1>/'; //finds h1 tags and gets content between and WORKS

preg_match_all($regex,$data,$posts, PREG_SET_ORDER);
	
	//print_r($posts);
	
	echo '<br/>';

foreach ($posts[0] as $post) {
   	
if(strstr($post, '<h1')){
echo $post."<br/>";
}else{
    // do something with data
	//echo "hoy".$post."<br/>";
	}
}

Sample #2
Let's analyze the code.The file_get_contents() function aquires the html code we want to scrape. The $regex variable picks the data between the h1 tags. Then, the preg_match_all() function creates the $posts array based on finding the h1 tags within the desired file. After that, all posts are counted. What we end up with is a mutlidemsional array. Therefore, the for loop is used to separate the two main arrays and the foreach loop is used to sort the arrays within each arrray.

With the foreach loop, web can add str_replace() functions to remove blocks of text we do not want. If you open the page with a browser and view the source, you can remove the codes you do not want want. In our example below, we used various str_replace() functions so that we ended up with just 'pure text'.

Within the foreach loop arrays are created that we displayed outside the loop.

$data = file_get_contents('http://example.com');

$regex = '/<h1(.*)<\/h1>/'; //finds h1 tags and gets content between and WORKS

preg_match_all($regex,$data,$posts, PREG_SET_ORDER);
	
	//print_r($posts);
	
	echo '<br/><br/>';
	
	$cnt = count($posts);
	//echo $cnt;
	
	
	$loops = 0;
for($i=0; $i < $cnt; $i++){
foreach ($posts[$i] as $post) {
   	
	//view source code and customize
	$post = str_replace('<h1 style="margin-top: 0px; margin-bottom: 0px; padding-bottom: 0px; font-size: 22px; color: rgb(90, 89, 95); text-shadow: 0.07em 0.07em 0.07em rgb(251, 144, 40);">','',$post);
	$post = str_replace('style="margin-top: 0px; margin-bottom: 0px; padding-bottom: 0px; font-size: 22px; color: rgb(90, 89, 95); text-shadow: 0.07em 0.07em 0.07em rgb(251, 144, 40);">','',$post);
	$post = str_replace('</h1>','',$post);
	$post = trim($post);	
	echo $post."<br/>";
	//$post = trim($post);
	$posts_array[] = $post;

}
}

echo "<br/>Array Details:<br/>";
$posts_final = array_unique($posts_array);
print_r($posts_final);