scrapy

由 datahunter 在四, 21/12/2017 - 09:49 發表

最後更新: 2017-12-21

介紹

Scrapy is an application framework for crawling web sites and extracting structured data

The requests are scheduled and processed asynchronously

功能

* Built-in support for selecting and extracting data from HTML/XML sources
using extended CSS selectors and XPath expressions

* An interactive shell console

* Wide range of built-in extensions
- cookies and session handling
- HTTP features like compression, authentication, caching
- Feed exports (JSON, JSON lines, CSV, XML)

* A Telnet console for hooking into a Python console running inside your Scrapy process
(debug your crawler)

HomePage

https://scrapy.org/

Components

Scrapy Engine
Scheduler
Downloader
Spiders
Item Pipeline (for processing the items(storing the item in a database) once they have been extracted by the spiders)
Downloader middlewares (sit between the Engine and the Downloader)
Spider middlewares (sit between the Engine and the Spiders)
Event-driven networking (twisted <- event-driven networking framework(asynchronous) )

Install

apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

virtualenv scrapy

source scrapy/bin/activate

pip install scrapy

Example 1

quotes_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):

    # identifies the Spider
    name = "quotes"

    # A shortcut to the start_requests method
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    # callback method "parse", 
    # handle the response downloaded for each of the requests made
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

# Run

# runspider: Scrapy looked for a Spider definition inside it and ran it through its crawler engine

scrapy runspider quotes_spider.py -o quotes.json

Settings

# environment variable: SCRAPY_SETTINGS_MODULE

# Command line options (-s | --set)

scrapy crawl myspider -s LOG_FILE=scrapy.log

# per-spider options: 'custom_settings = {}'

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

# Project settings module: settings.py

# Access Settings

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        print("Existing settings: %s" % self.settings.attributes.keys())

Example 2 - Creating a project

# create a tutorial directory & File

scrapy startproject tutorial

# a directory where you'll later put your spiders

位置: tutorial/tutorial/spiders/quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
    # identifies the Spider
    name = "quotes"

    # must return an iterable of Requests
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # to handle the response downloaded for each of the requests made
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

# go to the project’s top level directory

scrapy crawl quotes

Scrapy shell

scrapy shell 'http://quotes.toscrape.com/page/1/'

Firefox add-ons

XPath Checker
Tamper Data
Firecookie

Doc

https://docs.scrapy.org/en/latest/

瀏覽次數： 1242

夢想家