Scrape Website PHP

PHP Website Scraping

The following example explains how to scrape a page with <p> tags.

Sample
Let's analyze the code.The file_get_contents() function aquires the html code we want to scrape. The $regex variable picks the data between the <p></p> tags. Then, the preg_match_all() function creates the $posts array based on finding the <p> tags within the website url. After the array is built, the array called $posts is counted. The resulting array is a mutlidemsional array which can be shown with print_r($posts). The for loop is used to examine the two main arrays and the foreach loop is used to sort the arrays within the main arrays.

Within the foreach loop, we add str_replace() functions to remove blocks of text we do not want. If you open the page with a browser and view the source, you can see the output of each array and make custom str_replace() functions. In our example below, we used various str_replace() functions so that we ended up with the desired text.

Within the foreach loop arrays are created that we displayed outside the loop.

$data = file_get_contents('http://example.com');

$regex = '/<p(.*)<\/p>/'; //finds p tags and gets content between tags WORKS 

preg_match_all($regex,$data,$posts, PREG_SET_ORDER);
	//var_dump($posts);
	
	print_r($posts);
	
	echo '<br/><br/>';
	
	$cnt = count($posts);
	echo $cnt;
	
	
	$loops = 0;
for($i=0; $i < $cnt; $i++){
foreach ($posts[$i] as $post) {
	$loops = $loops + 1;
	//echo "loops"."-".$loops."<br/>";

	//view source code and customize
	$post = str_replace('<p class="art-page-footer"><a href="/"></a>','',$post);
	$post = str_replace('class="art-page-footer"><a href="/"></a>','',$post);
	$post = str_replace('><a href="http://www.macromedia.com/go/getflashplayer">','',$post);
	$post = str_replace('<a href="http://www.macromedia.com/go/getflashplayer">','',$post);
	$post = str_replace('<p>','',$post);
	$post = str_replace('</p>','',$post);
	$post = str_replace('<p','',$post);
	$post = str_replace('</a>','',$post);
	$post = trim($post);	
	echo $post."<br/>";
	//$post = trim($post);
	$posts_array[] = $post;
}
}

$posts_final = array_unique($posts_array);
print_r($posts_final);