丁香婷婷中文字幕,特一级黄色片,91在线免费视频在线观看

當(dāng)前位置：首頁 > 范文|應(yīng)用文 > IT技術(shù)專欄 > 網(wǎng)絡(luò)編程

PHP抓取及分析網(wǎng)頁的方法詳解

來源：易賢網(wǎng) 閱讀：897 次日期：2016-08-25 15:44:32

溫馨提示：易賢網(wǎng)小編為您整理了“PHP抓取及分析網(wǎng)頁的方法詳解”,方便廣大網(wǎng)友查閱！

本文實例講述了PHP抓取及分析網(wǎng)頁的方法。分享給大家供大家參考，具體如下：

抓取和分析一個文件是非常簡單的事。這個教程將通過一個例子帶領(lǐng)你一步一步地去實現(xiàn)它。讓我們開始吧！

首先，我首必須決定我們將抓取的URL地址。可以通過在腳本中設(shè)定或通過$QUERY_STRING傳遞。為了簡單起見，讓我們將變量直接設(shè)在腳本中。

<?php

$url = 'http://www.php.net';

第二步，我們抓取指定文件，并且通過file()函數(shù)將它存在一個數(shù)組里。

<?php

$url = 'http://www.php.net';

$lines_array = file($url);

好了，現(xiàn)在在數(shù)組里已經(jīng)有了文件了。但是，我們想分析的文本可能不全在一行里面。為了解決這個文件，我們可以簡單地將數(shù)組$lines_array轉(zhuǎn)化成一個字符串。我們可以使用implode(x,y)函數(shù)來實現(xiàn)它。如果在后面你想用explode(將字符串變量數(shù)組)，將x設(shè)成"|"或"!"或其它類似的分隔符可能會更好。但是出于我們的目的，最好將x設(shè)成空格。y是另一個必要的參數(shù)，因為它是你想用implode()處理的數(shù)組。

<?php

$url = 'http://www.php.net';

$lines_array = file($url);

$lines_string = implode('', $lines_array);

現(xiàn)在，抓取工作就做完了，下面該進(jìn)行分析了。出于這個例子的目的，我們想得到在<head>到</head>之間的所有東西。為了分析出字符串，我們還需要叫做正規(guī)表達(dá)式的東西。

<?php

$url = 'http://www.php.net';

$lines_array = file($url);

$lines_string = implode('', $lines_array);

eregi("<head>(.*)</head>", $lines_string, $head);

讓我們看一下代碼。正如你所見，eregi()函數(shù)按下面的格式執(zhí)行：

eregi("<head>(.*)</head>", $lines_string, $head);

"(.*)"表示所有東西，可以解釋為，"分析在<head>和</head>間的所以東西"。$lines_string是我們正在分析的字符串，$head是分析后的結(jié)果存放的數(shù)組。

最后，我們可以輸數(shù)據(jù)。因為僅在<head>和</head>間存在一個實例，我們可以安全的假設(shè)數(shù)組中僅存在著一個元素，而且就是我們想要的。讓我們把它打印出來吧。

<?php

$url = 'http://www.php.net';

$lines_array = file($url);

$lines_string = implode('', $lines_array); eregi("<head>(.*)</head>", $lines_string, $head);

echo $head[0];

這就是全部的代碼了。

<?php

//獲取所有內(nèi)容url保存到文件

function get_index ( $save_file , $prefix = "index_" ){

$count = 68 ;

$i = 1 ;

if ( file_exists ( $save_file )) @ unlink ( $save_file );

$fp = fopen ( $save_file , "a+" ) or die( "Open " . $save_file . " failed" );

while( $i < $count ){

$url = $prefix . $i . ".htm" ;

echo "Get " . $url . "..." ;

$url_str = get_content_url ( get_url ( $url ));

echo " OK/n" ;

fwrite ( $fp , $url_str );

++ $i ;

}

fclose ( $fp );

}

//獲取目標(biāo)多媒體對象

function get_object ( $url_file , $save_file , $split = "|--:**:--|" ){

if (! file_exists ( $url_file )) die( $url_file . " not exist" );

$file_arr = file ( $url_file );

if (! is_array ( $file_arr ) || empty( $file_arr )) die( $url_file . " not content" );

$url_arr = array_unique ( $file_arr );

if ( file_exists ( $save_file )) @ unlink ( $save_file );

$fp = fopen ( $save_file , "a+" ) or die( "Open save file " . $save_file . " failed" );

foreach( $url_arr as $url ){

if (empty( $url )) continue;

echo "Get " . $url . "..." ;

$html_str = get_url ( $url );

echo $html_str ;

echo $url ;

exit;

$obj_str = get_content_object ( $html_str );

echo " OK/n" ;

fwrite ( $fp , $obj_str );

}

fclose ( $fp );

}

//遍歷目錄獲取文件內(nèi)容

function get_dir ( $save_file , $dir ){

$dp = opendir ( $dir );

if ( file_exists ( $save_file )) @ unlink ( $save_file );

$fp = fopen ( $save_file , "a+" ) or die( "Open save file " . $save_file . " failed" );

while(( $file = readdir ( $dp )) != false ){

if ( $file != "." && $file != ".." ){

echo "Read file " . $file . "..." ;

$file_content = file_get_contents ( $dir . $file );

$obj_str = get_content_object ( $file_content );

echo " OK/n" ;

fwrite ( $fp , $obj_str );

}

fclose ( $fp );

}

//獲取指定url內(nèi)容

function get_url ( $url ){

$reg = '/^http:////[^//].+$/' ;

if (! preg_match ( $reg , $url )) die( $url . " invalid" );

$fp = fopen ( $url , "r" ) or die( "Open url: " . $url . " failed." );

while( $fc = fread ( $fp , 8192 )){

$content .= $fc ;

}

fclose ( $fp );

if (empty( $content )){

die( "Get url: " . $url . " content failed." );

}

return $content ;

}

//使用socket獲取指定網(wǎng)頁

function get_content_by_socket ( $url , $host ){

$fp = fsockopen ( $host , 80 ) or die( "Open " . $url . " failed" );

$header = "GET /" . $url . " HTTP/1.1/r/n" ;

$header .= "Accept: */*/r/n" ;

$header .= "Accept-Language: zh-cn/r/n" ;

$header .= "Accept-Encoding: gzip, deflate/r/n" ;

$header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)/r/n" ;

$header .= "Host: " . $host . "/r/n" ;

$header .= "Connection: Keep-Alive/r/n" ;

//$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-/r/n/r/n";

$header .= "Connection: Close/r/n/r/n" ;

fwrite ( $fp , $header );

while (! feof ( $fp )) {

$contents .= fgets ( $fp , 8192 );

}

fclose ( $fp );

return $contents ;

}

//獲取指定內(nèi)容里的url

function get_content_url ( $host_url , $file_contents ){

//$reg = '/^(down.*?/.html|/d+_/d+/.htm.*?)$/i';

$rex = "/([hH][rR][eE][Ff])/s*=/s*['/"]*([^>'/"/s]+)[/"'>]*/s*/i" ;

$reg = '/^(down.*?/.html)$/i' ;

preg_match_all ( $rex , $file_contents , $r );

$result = "" ; //array();

foreach( $r as $c ){

if ( is_array ( $c )){

foreach( $c as $d ){

if ( preg_match ( $reg , $d )){ $result .= $host_url . $d . "/n" ; }

}

return $result ;

}

//獲取指定內(nèi)容中的多媒體文件

function get_content_object ( $str , $split = "|--:**:--|" ){

$regx = "/href/s*=/s*['/"]*([^>'/"/s]+)[/"'>]*/s*(.*?<//b>)/i" ;

preg_match_all ( $regx , $str , $result );

if ( count ( $result ) == 3 ){

$result [ 2 ] = str_replace ( "多媒體： " , "" , $result [ 2 ]);

$result [ 2 ] = str_replace ( " " , "" , $result [ 2 ]);

$result = $result [ 1 ][ 0 ] . $split . $result [ 2 ][ 0 ] . "/n" ;

}

return $result ;

}

希望本文所述對大家PHP程序設(shè)計有所幫助。

更多信息請查看網(wǎng)絡(luò)編程

上一篇：PHP調(diào)用存儲過程返回值不一致問題的解決方法分析

下一篇：深入淺析yii2-gii自定義模板的方法

易賢網(wǎng)手機(jī)網(wǎng)站地址：PHP抓取及分析網(wǎng)頁的方法詳解

由于各方面情況的不斷調(diào)整與變化，易賢網(wǎng)提供的所有考試信息和咨詢回復(fù)僅供參考，敬請考生以權(quán)威部門公布的正式信息和咨詢?yōu)闇?zhǔn)！

相關(guān)閱讀網(wǎng)絡(luò)編程

Shell中如何刪除文本比較長的行的實現(xiàn)方法10月30日

vue.js語法及常用指令10月30日

python 讀寫中文json的實例詳解10月30日

Objective-C Json 實例詳解10月30日

bootstrap table sum總數(shù)量統(tǒng)計實現(xiàn)方法10月30日

python生成二維碼的實例詳解10月30日

Python批量更改文件名的實現(xiàn)方法10月30日

解決出現(xiàn)Incorrect integer value的問題10月30日

jQuery實現(xiàn)切換隱藏與顯示同時切換圖標(biāo)功能10月30日

docker python api 安裝配置的詳解10月30日

javascript按鈕禁用和啟用的效果實例代碼10月30日

vue.js todolist實現(xiàn)代碼10月30日

vue.js 父向子組件傳參的實例代碼10月30日

apache 開啟重定向 rewrite的實現(xiàn)方法10月30日

Vue.js劃分組件的方法10月30日

python logging日志模塊的詳解10月30日

vue中的scope使用詳解10月30日

docker cgroup 資源監(jiān)控的詳解10月30日

使用Android Studio 開發(fā)自己的SDK教程10月23日

linux系統(tǒng)下MongoDB單節(jié)點(diǎn)安裝教程10月23日

易賢網(wǎng)移動網(wǎng)站

2026國考·省考課程試聽報名

報班類型
姓名
手機(jī)號
驗證碼