Python Scrapy Library

What is Scrapy??

 

Scrapy is an open source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way.

 

But what do you mean be scraping data?

 

Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

In python, web scraping can be done using scrapy.

 

Let’s get started.

Installation first.

You can easily install using pip. For other installation option, click here. Type the following to your command prompt.

Now, let’s get our hand on some coding.

Let’s start off by creating a scrapy project. Enter the directory of your choice and type in the following.

Something like this prints out for you.

You can start your first spider with:

This will create a directory tutorial with the following contents.

tutorial/

scrapy.cfg            # deploy configuration file

 

tutorial/             # project’s Python module, you’ll import your code from here

__init__.py

 

items.py          # project items definition file

 

pipelines.py      # project pipelines file

 

settings.py       # project settings file

 

spiders/          # a directory where you’ll later put your spiders

__init__.py

 

Now let’s create a spider, but what are spiders?

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must have a  subclass scrapy. Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

We will be using examples from the official doc.

So save the following code in a file named quotes_spider.py under the tutorial/spiders directory in your project:

As you can see, our Spider subclasses scrapy.Spider

Let’s see wha teach of the attributes and methods mean.

  • name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
  • start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
  • parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

 

Now let’s run our spider.

Go to the top level directory and type in the following in your cmd.

This command runs the spider with name quotes that we’ve just added, that will send some requests for the quotes.toscrape.com domain. You will get an output similar to this:

… (omitted for brevity)

 

Source:https://doc.scrapy.org/en/latest/intro/tutorial.html

Note:

Two new files have been created in the directory you were at. quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

Beautiful! Isn’t it?

Benefits of Scrapy:

  • Scrapy is a full framework for web crawling which has the tools to manage every stage of a web crawl,
  • Comparing with Beautiful Soup, you need to provide a specific url, and Beautiful Soup will help you get the data from that page. You can give Scrapy a start url, and it will go on, crawling and extracting data, without having to explicitly give it every single URL.
  • It can crawl the contents of your webpage prior to extracting.”

 

Challenges of Scrapy:

  • To parse just a few webpages, Scrapy is an overkill. Beautiful soup is better.

 

To learn to play with scrapy, check out

https://doc.scrapy.org/en/latest/intro/tutorial.html

www.tutorialspoint.com/scrapy/

 

Leave a Reply