最後更新: 2017-12-21



Scrapy is an application framework for crawling web sites and extracting structured data

requests are scheduled and processed asynchronously


 * Built-in support for selecting and extracting data from HTML/XML sources
   using extended CSS selectors and XPath expressions

* An interactive shell console

* Wide range of built-in extensions
  - cookies and session handling
  - HTTP features like compression, authentication, caching
  - Feed exports (JSON, JSON lines, CSV, XML)

* A Telnet console for hooking into a Python console running inside your Scrapy process
   (debug your crawler)


  • Scrapy Engine
  • Scheduler
  • Downloader
  • Spiders
  • Item Pipeline (for processing the items(storing the item in a database) once they have been extracted by the spiders)
  • Downloader middlewares (sit between the Engine and the Downloader)
  • Spider middlewares (sit between the Engine and the Spiders)
  • Event-driven networking (twisted <- event-driven networking framework(asynchronous) )




apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

virtualenv scrapy

source scrapy/bin/activate

pip install scrapy


Example 1

import scrapy

class QuotesSpider(scrapy.Spider):

    # identifies the Spider
    name = "quotes"

    # A shortcut to the start_requests method
    start_urls = [

    # callback method "parse", 
    # handle the response downloaded for each of the requests made
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),

        next_page = response.css(' a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

# Run

# runspider: Scrapy looked for a Spider definition inside it and ran it through its crawler engine

scrapy runspider -o quotes.json




# environment variable: SCRAPY_SETTINGS_MODULE

# Command line options (-s | --set)

scrapy crawl myspider -s LOG_FILE=scrapy.log

# per-spider options: 'custom_settings = {}'

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',

# Project settings module:


# Access Settings

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['']

    def parse(self, response):
        print("Existing settings: %s" % self.settings.attributes.keys())


Example 2 - Creating a project


# create a tutorial directory & File

scrapy startproject tutorial

# a directory where you'll later put your spiders

位置: tutorial/tutorial/spiders/

import scrapy

class QuotesSpider(scrapy.Spider):
    # identifies the Spider
    name = "quotes"

    # must return an iterable of Requests
    def start_requests(self):
        urls = [
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # to handle the response downloaded for each of the requests made
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
        self.log('Saved file %s' % filename)

# go to the project’s top level directory

scrapy crawl quotes


Scrapy shell


scrapy shell ''


Firefox add-ons


  • XPath Checker
  • Tamper Data
  • Firecookie