先查看要爬取的页面豆瓣电影的分类页地址:movie.douban.com/tag/#/,下面加载更多的按钮就是通过ajax加载电影信息
打开Python开发工具IDLE,新建‘pdouban.py’文件,测试爬虫,写代码如下:import urllib.request url = 'xxx:movie.douban.com/tag/#/' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3', 'Referer':'xxx://movie.douban.com/', 'Connection':'keep-alive'} req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) content = res.read().decode('utf8') print (content) xxx是对应的超文本传输协议
F5运行代码,成功打印出网页内容
F12打开页面开发者模式,点击加载更多按钮,查看network加载时发送请求数据。
观察发现请求url参数中每次步进为20Request URL:xxx://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0xxx://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=20
观察发现请求返回值是json格式字符串
为打印出电影名,修改代码如下:import urllib.request import json url = 'xxx://movie.douban.com/j/new_search_subjects?sort=U&range=0,10&tags=&start=0' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3', 'Referer':'xxx://movie.douban.com/', 'Connection':'keep-alive'} req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) content = res.read().decode('utf8') dcontent = json.loads(content) for item in dcontent['data']: print (item['title'])
F5运行代码,成功打印出电影名,外层再加上步进20的循环就可以打印多次请求的电影名了