scrapy

最後更新: 2017-12-21

介紹

Scrapy is an application framework for crawling web sites and extracting structured data

The requests are scheduled and processed asynchronously

功能

 * Built-in support for selecting and extracting data from HTML/XML sources
    using extended CSS selectors and XPath expressions

* An interactive shell console

* Wide range of built-in extensions
  - cookies and session handling
  - HTTP features like compression, authentication, caching
  - Feed exports (JSON, JSON lines, CSV, XML)

* A Telnet console for hooking into a Python console running inside your Scrapy process
   (debug your crawler)

HomePage

https://scrapy.org/

Components

  • Scrapy Engine
  • Scheduler
  • Downloader
  • Spiders
  • Item Pipeline (for processing the items(storing the item in a database) once they have been extracted by the spiders)
  • Downloader middlewares (sit between the Engine and the Downloader)
  • Spider middlewares (sit between the Engine and the Spiders)
  • Event-driven networking (twisted <- event-driven networking framework(asynchronous) )

 


Install

 

apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

virtualenv scrapy

source scrapy/bin/activate

pip install scrapy

 


Example 1

 

quotes_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):

    # identifies the Spider
    name = "quotes"

    # A shortcut to the start_requests method
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    # callback method "parse", 
    # handle the response downloaded for each of the requests made
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

# Run

# runspider: Scrapy looked for a Spider definition inside it and ran it through its crawler engine

scrapy runspider quotes_spider.py -o quotes.json

 


Settings

 

# environment variable: SCRAPY_SETTINGS_MODULE

# Command line options (-s | --set)

scrapy crawl myspider -s LOG_FILE=scrapy.log

# per-spider options: 'custom_settings = {}'

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

# Project settings module: settings.py

 

# Access Settings

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        print("Existing settings: %s" % self.settings.attributes.keys())

 


Example 2 - Creating a project

 

# create a tutorial directory & File

scrapy startproject tutorial

# a directory where you'll later put your spiders

位置: tutorial/tutorial/spiders/quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
    # identifies the Spider
    name = "quotes"

    # must return an iterable of Requests
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # to handle the response downloaded for each of the requests made
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

# go to the project’s top level directory

scrapy crawl quotes

 


Scrapy shell

 

scrapy shell 'http://quotes.toscrape.com/page/1/'

 


Firefox add-ons

 

  • XPath Checker
  • Tamper Data
  • Firecookie

 

 


Doc

 

https://docs.scrapy.org/en/latest/

 

 


 

 

Creative Commons license icon Creative Commons license icon