最後更新: 2017-12-21
介紹
Scrapy is an application framework for crawling web sites and extracting structured data
The requests are scheduled and processed asynchronously
功能
* Built-in support for selecting and extracting data from HTML/XML sources
using extended CSS selectors and XPath expressions
* An interactive shell console
* Wide range of built-in extensions
- cookies and session handling
- HTTP features like compression, authentication, caching
- Feed exports (JSON, JSON lines, CSV, XML)
* A Telnet console for hooking into a Python console running inside your Scrapy process
(debug your crawler)
HomePage
https://scrapy.org/
Components
- Scrapy Engine
- Scheduler
- Downloader
- Spiders
- Item Pipeline (for processing the items(storing the item in a database) once they have been extracted by the spiders)
- Downloader middlewares (sit between the Engine and the Downloader)
- Spider middlewares (sit between the Engine and the Spiders)
- Event-driven networking (twisted <- event-driven networking framework(asynchronous) )
Install
apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
virtualenv scrapy
source scrapy/bin/activate
pip install scrapy
Example 1
quotes_spider.py:
import scrapy class QuotesSpider(scrapy.Spider): # identifies the Spider name = "quotes" # A shortcut to the start_requests method start_urls = [ 'http://quotes.toscrape.com/tag/humor/', ] # callback method "parse", # handle the response downloaded for each of the requests made def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.xpath('span/small/text()').extract_first(), } next_page = response.css('li.next a::attr("href")').extract_first() if next_page is not None: yield response.follow(next_page, self.parse)
# Run
# runspider: Scrapy looked for a Spider definition inside it and ran it through its crawler engine
scrapy runspider quotes_spider.py -o quotes.json
Settings
# environment variable: SCRAPY_SETTINGS_MODULE
# Command line options (-s | --set)
scrapy crawl myspider -s LOG_FILE=scrapy.log
# per-spider options: 'custom_settings = {}'
class MySpider(scrapy.Spider): name = 'myspider' custom_settings = { 'SOME_SETTING': 'some value', }
# Project settings module: settings.py
# Access Settings
class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['http://example.com'] def parse(self, response): print("Existing settings: %s" % self.settings.attributes.keys())
Example 2 - Creating a project
# create a tutorial directory & File
scrapy startproject tutorial
# a directory where you'll later put your spiders
位置: tutorial/tutorial/spiders/quotes_spider.py
import scrapy class QuotesSpider(scrapy.Spider): # identifies the Spider name = "quotes" # must return an iterable of Requests def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) # to handle the response downloaded for each of the requests made def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)
# go to the project’s top level directory
scrapy crawl quotes
Scrapy shell
scrapy shell 'http://quotes.toscrape.com/page/1/'
Firefox add-ons
- XPath Checker
- Tamper Data
- Firecookie
Doc
https://docs.scrapy.org/en/latest/