Python Scrapy Library

August 21, 2017 By globalsqa No comments yet Python

What is Scrapy??

Scrapy is an open source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way.

But what do you mean be scraping data?

Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

In python, web scraping can be done using scrapy.

Let’s get started.

Installation first.

You can easily install using pip. For other installation option, click here. Type the following to your command prompt.

pip install scrapy

1	pip install scrapy

Now, let’s get our hand on some coding.

Let’s start off by creating a scrapy project. Enter the directory of your choice and type in the following.

scrapy startproject tutorial

1	scrapy startproject tutorial

Something like this prints out for you.

New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:

/home/enfa/Desktop/BS4/tutorial

New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:

/home/enfa/Desktop/BS4/tutorial

You can start your first spider with:

cd tutorial

scrapy genspider example example.com

cd tutorial

scrapy genspider example example.com

This will create a directory tutorial with the following contents.

tutorial/

scrapy.cfg            # deploy configuration file

tutorial/             # project’s Python module, you’ll import your code from here

__init__.py

items.py          # project items definition file

pipelines.py      # project pipelines file

settings.py       # project settings file

spiders/          # a directory where you’ll later put your spiders

__init__.py

Now let’s create a spider, but what are spiders?

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must have a subclass scrapy. Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

We will be using examples from the official doc.

So save the following code in a file named quotes_spider.py under the tutorial/spiders directory in your project:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider

Let’s see wha teach of the attributes and methods mean.

name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

Now let’s run our spider.

Go to the top level directory and type in the following in your cmd.

scrapy crawl quotes

1	scrapy crawl quotes

This command runs the spider with name quotes that we’ve just added, that will send some requests for the quotes.toscrape.com domain. You will get an output similar to this:

… (omitted for brevity)

2017-7-1 21:24:05 [scrapy.core.engine] INFO: Spider opened

2017-7-1 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2017-7-1 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)

2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)

2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-1.html

2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-2.html

2017-7-1 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

2017-7-1 21:24:05 [scrapy.core.engine] INFO: Spider opened

2017-7-1 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2017-7-1 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)

2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)

2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-1.html

2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-2.html

2017-7-1 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

…

Source:https://doc.scrapy.org/en/latest/intro/tutorial.html

Note:

Two new files have been created in the directory you were at. quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

Beautiful! Isn’t it?

Benefits of Scrapy:

Scrapy is a full framework for web crawling which has the tools to manage every stage of a web crawl,
Comparing with Beautiful Soup, you need to provide a specific url, and Beautiful Soup will help you get the data from that page. You can give Scrapy a start url, and it will go on, crawling and extracting data, without having to explicitly give it every single URL.
It can crawl the contents of your webpage prior to extracting.”

Challenges of Scrapy:

To parse just a few webpages, Scrapy is an overkill. Beautiful soup is better.

To learn to play with scrapy, check out

https://doc.scrapy.org/en/latest/intro/tutorial.html

Python Scrapy Library

Leave a ReplyCancel reply

Recent Posts

Footer Widget Area 1

Footer Widget Area 2

Footer Widget Area 3

Footer Widget Area 4

Python Scrapy Library

What is Scrapy??

But what do you mean be scraping data?

Let’s get started.

Benefits of Scrapy:

Challenges of Scrapy:

Sharing is Caring:

Related

Leave a ReplyCancel reply

Recent Posts

Footer Widget Area 1

Footer Widget Area 2

Footer Widget Area 3

Footer Widget Area 4