在Item中定义自己要抓取的数据:
movie_name就像是字典中的“键”,爬到的数据就像似字典中的“值”。在继承了BaseSpider的类中会用到:
2、然后在spiders目录下编辑Spider.py那个文件按上面【入门教程】来写就行了,我这边给个例子,跟我上面的item是匹配的:【入门教程】你没有给出start_requests这个方法,稍后我会讲到这个方法。另外这里的代码我都是截图,后面我会用代码面板显示我的代码,有需要的人可以复制下来玩玩。
3、编辑pipelines.py文件,可以通过它将保存在TutorialItem中的内容写入到数据库或者文件中下面的代码示例是写到文件(如果要写到数据库中去,这里有个示例代码)中去:对json模块的方法的注释:dump和dumps(从Python生成JSON),load和loads(解析JSON成Python的数据类型);dump和dumps的唯一区别是dump会生成一个类文件对象,dumps会生成字符串,同理load和loads分别解析类文件对象和字符串格式的JSON。(注释来于http://www.jb51.net/article/52224.htm )
4、爬虫开始上述三个过程后就可以爬虫了,仅需上述三个过程哟,然后在dos中将目录切换到tutorial下输入scrapy crawl douban就可以爬啦:
#coding=utf-8import sysreload(sys)#python默认环境编码时asciisys.setdefaultencoding('utf-8')from scrapy.spider import BaseSpiderfrom scrapy.http import Requestfrom scrapy.selector import HtmlXPathSelectorfrom tutorial.items import TutorialItemimport reclass DoubanSpider(BaseSpider): name = 'douban' allowed_domains = ['movie.douban.com'] start_urls = [] def start_requests(self): file_object = open('movie_name.txt','r') try: url_head = 'http://movie.douban.com/subject_search?search_text=' for line in file_object: self.start_urls.append(url_head + line) for url in self.start_urls: yield self.make_requests_from_url(url) finally: file_object.close() #years_object.close() def parse(self, response): #open('test.html','wb').write(response.body) hxs = HtmlXPathSelector(response) #movie_name = hxs.select('//*[@id='content']/div/div[1]/div[2]/table[1]/tr/td[1]/a/@title').extract() movie_link = hxs.select('//*[@id='content']/div/div[1]/div[2]/table[1]/tr/td[1]/a/@href').extract() #movie_desc = hxs.select('//*[@id='content']/div/div[1]/div[2]/table[1]/tr/td[2]/div/p/text()').extract() if movie_link: yield Request(movie_link[0],callback=self.parse_item) def parse_item(self,response): hxs = HtmlXPathSelector(response) movie_name = hxs.select('//*[@id='content']/h1/span[1]/text()').extract() movie_director = hxs.select('//*[@id='info']/span[1]/span[2]/a/text()').extract() movie_writer = hxs.select('//*[@id='info']/span[2]/span[2]/a/text()').extract() #爬取电影详情需要在已有对象中继续爬取 movie_description_paths = hxs.select('//*[@id='link-report']') movie_description = [] for movie_description_path in movie_description_paths: movie_description = movie_description_path.select('.//*[@property='v:summary']/text()').extract() #提取演员需要从已有的xPath对象中继续爬我要的内容 movie_roles_paths = hxs.select('//*[@id='info']/span[3]/span[2]') movie_roles = [] for movie_roles_path in movie_roles_paths: movie_roles = movie_roles_path.select('.//*[@rel='v:starring']/text()').extract() #获取电影详细信息序列 movie_detail = hxs.select('//*[@id='info']').extract() item = TutorialItem() item['movie_name'] = ''.join(movie_name).strip().replace(',',';').replace('\'','\\\'').replace('\'','\\\'').replace(':',';') #item['movie_link'] = movie_link[0] item['movie_director'] = movie_director[0].strip().replace(',',';').replace('\'','\\\'').replace('\'','\\\'').replace(':',';') if len(movie_director) > 0 else '' #由于逗号是拿来分割电影所有信息的,所以需要处理逗号;引号也要处理,否则插入数据库会有问题 item['movie_description'] = movie_description[0].strip().replace(',',';').replace('\'','\\\'').replace('\'','\\\'').replace(':',';') if len(movie_description) > 0 else '' item['movie_writer'] = ';'.join(movie_writer).strip().replace(',',';').replace('\'','\\\'').replace('\'','\\\'').replace(':',';') item['movie_roles'] = ';'.join(movie_roles).strip().replace(',',';').replace('\'','\\\'').replace('\'','\\\'').replace(':',';') #item['movie_language'] = movie_language[0].strip() if len(movie_language) > 0 else '' #item['movie_date'] = ''.join(movie_date).strip() #item['movie_long'] = ''.join(movie_long).strip() #电影详情信息字符串 movie_detail_str = ''.join(movie_detail).strip() #print movie_detail_str movie_language_str = '.*语言: (.+?)