Justin's Words

Scrapy 初步学习

这几天一直都在学 Scrapy,反正觉得它是很有用的,至少比以前独立写爬虫的时候写的代码量少多了。

安装

Scrapy 不适用于 Python 3.x 以上,所以还是得用 Python 2.x,安装如下:

1
$ pip install scrapy

记得把 Python2x/Scripts 路径加入环境变量

第一个 Project

1
$ scrrapy startproject hrtencent

名字自取,一般取和爬取的站点相同的名字,上面那个就是 http://hr.tencent.com/position.php

目录树

hrtencent/
    scrapy.cfg
    hrtencent/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

首先是 items.py,它将保存爬下来的数据。

1
2
3
4
5
6
7
8
from scrapy.item import Item, Field

class TencentItem(Item):
name = Field() # 职位名称
type = Field() # 职位类型
number = Field() # 职位需求数目
location = Field() # 职位所在
publish_date = Field() # 信息发布时间

当然可以定义多个 Item,用于处理不同种类的数据

接下来创建 HRTencentSpider.py 在 spiders/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from tencent.items import TencentItem

class TecentSpider(CrawlSpider):
name = "hrtencent"
allowed_domains = ["tencent.com"]
start_urls = [
"http://hr.tencent.com/position.php"
]

rules = (
Rule(
LinkExtractor(
allow=("/position.php\?&start=\d0#a",),
restrict_xpaths=('//a[@href="position.php?&start=\d0#a"]')
),
# callback="parse_positions",
callback="parse_position_url",
follow=True
),
Rule(
LinkExtractor(
allow=("/position_detail.php?id=\d*.*",),
restrict_xpaths=('//a[@href*="position_detail.php?id="]')
),
callback="parse_position_details",
follow=True
)
)

def parse_positions(self, response):
positions = response.css(".even") + response.css('.odd')
for position in positions:
item = TencentItem()
item['name'] = position.xpath("td[1]/a/text()").extract()[0]
item['type'] = position.xpath("td[2]/text()").extract()[0]
item['number'] = position.xpath("td[3]/text()").extract()[0]
item['location'] = position.xpath("td[4]/text()").extract()[0]
item['publish_date'] = position.xpath("td[5]/text()").extract()[0]
yield item

保存文件为 json 用的,这是我的做法,pipelines.py 其实作用很大,就看你怎么用它:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import json
import codecs
import os

from scrapy.exceptions import DropItem

class TencentPipeline(object):
def process_item(self, item, spider):
return item

class JsonWriterPipeline(object):

def __init__(self):
pass

# when open the spider
def open_spider(self, spider):
self.file = codecs.open('items.json', 'wb', encoding='utf-8') # create/open the file encoded with utf-8
self.file.write("[") # json array begins

# process the items
def process_item(self, item, spider):
line = json.dumps(dict(item)) + ",\n" # parse to json, add '\n' to the end of a line
self.file.write(line.decode('unicode_escape')) # unescape the string and write in the file handle
return item

def close_spider(self, spider):
self.file.seek(-2, os.SEEK_END) # seek ',\n' of the last line
self.file.truncate() # remove ',\n;' of the last line to be a standard json file
self.file.write("]") # json end
self.file.close()

使用 pipelines.py 得将其加入 settings.py 使其生效:

1
2
3
ITEM_PIPELINES = {
'tencent.pipelines.JsonWriterPipeline': 800,
}

这样一个递归爬虫就搞定了,操作如下即可,名字对应 spider.name

1
$ scrapy crawl hrtencent

为不同的 parse 设置不同的 pipeline

可以如下在 pipelines 处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def __init__(self):
pass

def open_spider(self, spider):
if spider.name is 'hrtencent':
self.file = codecs.open('items.json', 'wb', encoding="utf-8")
self.file.write("[")

def process_item(self, item, hrtencent):
if spider.name is not 'hrtencent':
return item

line = json.dumps(dict(item)) + ",\n"
self.file.write(line.decode('unicode_escape'))
return item

def close_spider(self, spider):
if spider.name is 'hrtencent':
self.file.seek(-2, os.SEEK_END)
self.file.truncate()
self.file.write("]")
self.file.close()

def __exit__(self):
pass

或者多个选项:

1
2
if spider.name in ['hrtencent', 'hrtencent_details']:
pass

使用 Request 进一步爬取:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def parse_lists(self, response):
tuigirls = response.css("#piclist .cl li")

for tuigirl in tuigirls:
item = LeisimaoItem()
item['id'] = tuigirl.css('h3 a::text')[0].extract().strip()[15:18]

title = tuigirl.css('h3 a::text')[0].extract().strip()
if ("/" in title):
item['title'] = title[12:title.find('/')]
else:
item['title'] = title[12:]

item['link'] = "http://bbs.leisimao.com/" + tuigirl.css('a::attr(href)')[0].extract()
item['count'] = tuigirl.css('.idata .idc::text')[0].extract().strip()

# 可以传递参数,指定 callback
request = Request(item['link'], meta={'item': item}, callback=self.parse_imgs)

yield request

def parse_imgs(self, response):
item = response.meta['item']
item["imgs"] = response.css("#pic-list li img::attr(src)").extract()
return item

断点调试

在需要断点的地方加入, 将会进入一个 python shell,拥有当前 response

1
inspect_response(response)

看看我写的也无妨,不过水平很菜,且看且过:https://github.com/youngdze/ScrapyExecise