scrapy案例:某爱读
任务:模拟登陆:http://www.woaige.net/
回顾命令:
1、创建项目:scrapy startproject xxxx
2、生成项目:scrapy genspider xxx 项目名称.com
一、爬取的方案一:
此时运行时, 显示的是该用户还未登录. 不论是哪个方案. 在请求到start_urls里面的url之前必须得获取到cookie. 但是默认情况下, scrapy会自动的帮我们完成其实request的创建.
查看scrapy源码,查看初始的start_url是如何工作的:
# 以下是scrapy源码
def start_requests(self):
cls = self.__class__
if not self.start_urls and hasattr(self, 'start_url'):
raise AttributeError(
"Crawling could not start: 'start_urls' not found "
"or empty (but found 'start_url' attribute instead, "
"did you miss an 's'?)")
if method_is_overridden(cls, Spider, 'make_requests_from_url'):
warnings.warn(
"Spider.make_requests_from_url method is deprecated; it "
"won't be called in future Scrapy releases. Please "
"override Spider.start_requests method instead (see %s.%s)." % (
cls.__module__, cls.__name__
),
)
for url in self.start_urls:
yield self.make_requests_from_url(url)
else:
for url in self.start_urls:
# 核心就这么一句话. 组建一个Request对象.我们也可以这么干.
yield Request(url, dont_filter=True)
可以重写start_url
def start_requests(self):
print("我是万恶之源")
yield Request(
url=LoginSpider.start_urls[0],
callback=self.parse
)
- 方案一, 直接从浏览器复制cookie过来
import scrapy
class DengSpider(scrapy.Spider)