網絡抓取類應用開髮入門（一） - 技術

摘要：互聯網上的信息中web佔瞭很大一部分，因其良好的界麵也導緻瞭機器穫取信息難度的增加，雖然很多網站都提供JSON、XML等規範格式的數據，但很多情況下沒有這種標準格式的數據，需要我們通過各種變通的形式去整理數據，間接穫得自己需要的數據，這時就需要類似網絡爬蟲的簡單網絡抓取工具瞭。本文主要結閤作者經驗介紹幾種環境下網絡採集工具的開髮思路。

互聯網上的信息中web佔瞭很大一部分，因其良好的界麵也導緻瞭機器穫取信息難度的增加，雖然很多網站都提供JSON、XML等規範格式的數據，但很多情況下沒有這種標準格式的數據，需要我們通過各種變通的形式去整理數據，間接穫得自己需要的數據，這時就需要類似網絡爬蟲的簡單網絡抓取工具瞭。本文主要結閤作者經驗介紹幾種環境下網絡採集工具的開髮思路。

應用場景：特定區域信息的採集（如學校新聞、微博等信息整閤、天氣的定時採集）、博客遷移（博文、留言等採集）、網頁信息第三方嵌入（多繫統信息集中查詢工具）、內容格式準換（編碼、編排方式修改）等。

基本的思路就是編程實現網絡文本的穫取，記録cookie信息實現登陸驗證，通過正則匹配截取到特定內容，將截取到的內容進行整閤再處理形成需要的信息格式，進行輸齣或保存（數據庫、文件、髮送給第三方等）。

作者已有的樣例：某高校選課繫統客戶端（自動化操作）、某高校多站點新聞整閤以及編碼轉換（供其他站點調用）、天氣的定時採集、學生信息查詢客戶端（移動端）等需要源碼蔘考的可以髮郵件索取。

實現時看需求選擇語言，如需要做成windows客戶端可以選擇C#，如需要被其他站點引用可以選擇PHP、JSP等，如需實現定時任務可配閤腳本語言如Python等，如需做移動端則選著相應的移動應用開髮平颱。總之基本每種編程語言都是可以方便的實現的。本文以PHP開髮一箇博客遷移應用爲例進行開髮思路的介紹。

最終效果：如下圖，根據設定好的規則進行文章的抓取保存。

一、實現技術

PHP、正在匹配、數據庫操作。

二、實現思路

（1）穫得相關的設置蔘數如列錶頁地址以及地址的推導規則，穫得文章鏈接的方式，文章內容，文章標題。瀏覽量等信息的穫取規則。暫時不考慮文章的分頁。需要註意網頁的編碼，圖片的抓取，圖片地址的轉換等。

（2）根據列錶頁規則遍歷列錶頁穫得文章頁url。

（3）穫取文章頁的內容對作者等信息進行匹配穫取。

（4）對圖片等進行準換，圖片抓取，引用地址替換等。

三、代碼實現

實現環境爲禪知企業門戶繫統的博客遷移插件。

control類的方法爲

public function setting()
    {
        $testResult = '';
        if($_POST)
        {
            $category       = $_POST['category'];
            $listLink       = $_POST['listLink'];
            $listLinkNum    = intval($_POST['listLinkNum']);
            $viewLinkPre    = $_POST['viewLinkPre'];
            $viewLinkRegex  = $_POST['viewLinkRegex'];
            $viewLinkFollow = $_POST['viewLinkFollow'];
            $titleRegex     = $_POST['titleRegex'];
            $contentRegex   = $_POST['contentRegex'];
            if($listLink != '') 
            {   
                for($i=1; $i<=$listLinkNum; $i++)
                {   
                    $currentListLink = $listLink . $i; 
                    $listContent = file_get_contents($currentListLink); 
                    preg_match_all($viewLinkRegex,$listContent, $res);
                    if(isset($res[1]))
                    {   
                        foreach($res[1] as $r) 
                        {   
                            $viewLink    = $viewLink = $viewLinkPre . $r . $viewLinkFollow;
                            $testResult  = $testResult . "\r\n" . $viewLink;
                            $viewContent = file_get_contents($viewLink);
                            preg_match_all($titleRegex, $viewContent, $titles);
                            if(isset($titles[1][0]))
                            { 
                                $title = $titles[1][0];
                                $testResult .= $title . '\r\n';
                            }
                            preg_match_all($contentRegex, $viewContent, $contents);
                            if(isset($contents[1][0]))
                            {
                                $content = '文章內容：' . $contents[1][0];
                                $testResult = $testResult . $content;
                            }
                            //將穫得的文章內容存入數據庫
                        }
                    }
                }
            }
            $this->view->category       = $category;
            $this->view->listLink       = $listLink;
            $this->view->listLinkNum    = $listLinkNum;
            $this->view->viewLinkPre    = $viewLinkPre;
            $this->view->viewLinkRegex  = $viewLinkRegex;
            $this->view->viewLinkFollow = $viewLinkFollow;
            $this->view->titleRegex     = $titleRegex;
            $this->view->contentRegex   = $contentRegex;
            $this->view->testResult     = $testResult;
        }
        $this->display();
    }

view代碼：

<div class='panel'>
  <div class='panel-heading'>
    <strong><i class='icon-building'></i><?php echo $lang->crawler->crawler;?></strong>
  </div>
  <div class='panel-body'>
    <form method='post' >
      <table class='table table-form'>
        <tr>
          <th ><?php echo $lang->crawler->category;?>      </th>
          <td colspan='2'><?php echo html::input('category', isset($category) ? $category : '')?></td>
        </tr>
        <tr>
          <th ><?php echo $lang->crawler->listLink;?>      </th>
          <td ><?php echo html::input('listLink', isset($listLink) ? $listLink : '');?></td>
          <td class='text-info'><?php echo $lang->crawler->listLinkInfo;?></td>
        </tr>
        <tr>
          <th ><?php echo $lang->crawler->listLinkNum;?>   </th>
          <td colspan='2'><?php echo html::input('listLinkNum', isset($listLinkNum) ? $listLinkNum : '');?></td>
        </tr>
        <tr>
          <th ><?php echo $lang->crawler->viewLinkPre;?>   </th>
          <td colspan='2'><?php echo html::input('viewLinkPre', isset($viewLinkPre) ? $viewLinkPre : '');?></td>
        </tr>
        <tr>
          <th ><?php echo $lang->crawler->viewLinkRegex;?> </th>
          <td colspan='2'><?php echo html::input('viewLinkRegex', isset($viewLinkRegex) ? $viewLinkRegex : '');?></td>
        </tr>
        <tr>
          <th ><?php echo $lang->crawler->viewLinkFollow;?></th>
          <td colspan='2'><?php echo html::input('viewLinkFollow', isset($viewLinkFollow) ? $viewLinkFollow : '');?></td>
        </tr>
        <tr>
          <th ><?php echo $lang->crawler->titleRegex;?>    </th>
          <td colspan='2'><?php echo html::input('titleRegex', isset($titleRegex) ? $titleRegex : '');?></td>
        </tr>
        <tr>
          <th ><?php echo $lang->crawler->contentRegex;?>    </th>
          <td colspan='2'><?php echo html::input('contentRegex', isset($contentRegex) ? $contentRegex : '');?></td>
        </tr>
        <tr>
          <th> </th><td colspan='2'><?php echo html::submitButton();?></td>
        </tr>
        <tr>
          <th> </th><td colspan='3'><?php echo html::textarea('', isset($testResult) ? $testResult : '', 'height=100px');?></td>
        </tr>
      </table>
    </form>
  </div>
</div>

全部代碼可以郵件索取：chujilu1991@163.com

由於時間問題僅將大緻的實現思路代碼實現，後續的編碼基本都差不多。