用selenium+php-webdriver实现抓取淘宝页面

By minirplus on 2015-05-08 in DEV

功能：实现抓取淘宝店铺商品信息，绕过淘宝防抓取策略

环境

Windows 8.1
selenium-2.44.0
ChromeDriver 2.15
php-webdriver
Chrome

安装selenium

在http://selenium-release.storage.googleapis.com/index.html找到最新的版本，下载selenium-server-standalone-X.XX.X.jar文件

安装ChromeDriver

在https://sites.google.com/a/chromium.org/chromedriver/downloads下载最新的ChromeDriver压缩包，解压得到chromedriver.exe文件

安装php-webdriver

访问https://github.com/facebook/php-webdriver，可以通过Composer安装，也可以直接下载后手动加载

selenium服务初始化

将上述selenium和ChromeDriver的两个文件放入一个文件夹中，运行cmd，输入如下命令初始化

1	java -jar D:\selenium\selenium-server-standalone-2.44.0.jar -port 8888 -Dwebdriver.chrome.driver="D:\selenium\chromedriver.exe"

其中，D:\selenium\selenium-server-standalone-2.44.0.jar对应selenium文件的位置，D:\selenium\chromedriver.exe对应ChromeDriver文件的位置，-port 8888是监听的端口

执行完毕上述命令后，selenium服务就开始运行并在后台监听端口了，之后通过php发送命令到刚才设定的8888端口，selenium就会将指令发送给ChromeDriver并执行浏览器操作。

php-webdriver初始化

在使用php操作selenium前，需要先安装php-webdriver，一个由facebook维护的selenium插件，用于通过php来和selenium通信，composer方式安装这里不多介绍，这里仅介绍手动加载方式。

首先将刚才下载的压缩包解压，然后在php中添加一行代码，手动加载插件

1	require_once "D:/selenium/php-webdriver-master/lib/__init__.php";

其中D:/selenium/php-webdriver-master就是解压后的文件夹地址

然后在下方添加如下代码，初始化php-webdriver对象

$wd_host = 'http://localhost:8888/wd/hub';

$desired_capabilities = DesiredCapabilities::chrome();

$driver = RemoteWebDriver::create($wd_host, $desired_capabilities);

其中8888就是刚才初始化时设置的监听端口地址，DesiredCapabilities::chrome()用于指定打开的浏览器类型，$desired_capabilities可以有多种参数，详见这里

通过上述步骤，就完成了php-webdriver的初始化，并且定义了一个webdriver对象$driver，接下来就可以通过$driver来进行实例操作。

执行浏览器操作

打开淘宝登陆网页

1	$driver->get('https://login.taobao.com/member/login.jhtml');

自动登陆

$element = $driver->findElement(

WebDriverBy::name('TPL_username')

);

$element->clear(); //清空

$element->sendKeys("");//自动填写淘宝用户名

$element = $driver->findElement(

WebDriverBy::name('TPL_password')

);

$element->clear(); //清空

$element->sendKeys("");//自动填写淘宝密码

$driver->findElement(WebDriverBy::id('J_SubmitStatic'))->click();

获取页面内容

1 2	$driver->get($url); $html_selenium = $driver->getPageSource();

更多操作范例可以看这里，或者访问API手册http://facebook.github.io/php-webdriver/index.html

登陆淘宝后，浏览器会保留cookies，这样就可以绕开淘宝的防抓取限制

内容分析

在通过php-webdriver获取到网页内容之后，可以直接用它的API来进行内容分析，但是因为习惯了用php-simple-html-dom-parser，所以这里还是用它来对抓取的内容进行分析。

安装php-simple-html-dom-parser，https://github.com/sunra/php-simple-html-dom-parser

手动引用插件

1	include_once('D:/simplehtmldom/simple_html_dom.php');

将php-webdriver抓取的内容传递给php-simple-html-dom-parser

$driver->get($url);

$html_selenium = $driver->getPageSource();

$html = str_get_html ( $html_selenium );

php-simple-html-dom-parser支持通过str字符串获取网页内容，只需将php-webdriver获取的内容保存为字符串后传递给str_get_html函数

执行分析

$items = $html->find('.shop-hesper-bd .item');

unset($item_array);

foreach($items as $table) {

if ($table->find('.item-name', 0)) {

$item['link'] = $table->find('.item-name', 0)->href;

$item_name = $table->find('.item-name', 0)->plaintext;

$item['name'] = $item_name;

$item['id'] = $table->getAttribute('data-id');

$img_src = $table->find('.photo img', 0)->getAttribute('src');

$item['img_src'] = substr($img_src, 0, strlen($img_src) - 12);

$price = $table->find('.c-price', 0)->plaintext;

$price = str_replace(' ', '', $price);

$item['price'] = $price;

}

php-simple-html-dom-parser的详细API操作手册见这里：http://simplehtmldom.sourceforge.net/manual.htm

总结

selenium可以实现很多功能，网页抓取只是其中很小的一部分，但是却非常实用。selenium的特点是直接可以在浏览器中看到所有的操作，可以实现很多curl无法完成的功能，特别是有部分操作需要人工介入的过程，比如抢火车票、秒拍等等。相信只要有想象力，selenium绝对是一款非常值得学习的网络工具。

5 comments

Hello, guest

米扑博客
• 8 years ago
写得非常仔细，之前我的米扑代理在Python实现了自动化selenium今天参考本文，在我的米扑导航里实现了PHP自动化selenium感谢博主分享！在我的博客里，也分享PHP Selnium

0 0 • Reply
- minirplus • author
  • 8 years ago
  ?，好久没有用selenium了
  
  0 0 • Reply
laba
• 8 years ago
好难。我用linux折腾失败了

0 0 • Reply
- minirplus • author
  • 8 years ago
  确实很难，现在应该已经有了其他更加方便的工具
  
  0 0 • Reply
  - 米扑科技
    • 8 years ago
    
    博主写得很明白了，一步一步配置也不难如果有过基础，曾经做过，就很简单了
    
    0 0

Top Menu

Navigation