Scrapy is an open source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way.
Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).
In python, web scraping can be done using scrapy.
You can easily install using pip. For other installation option, click here. Type the following to your command prompt.
pip install scrapy
Now, let’s get our hand on some coding.
Let’s start off by creating a scrapy project. Enter the directory of your choice and type in the following.
scrapy startproject tutorial
Something like this prints out for you.
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
You can start your first spider with:
scrapy genspider example example.com
This will create a directory tutorial with the following contents.
scrapy.cfg # deploy configuration file
tutorial/ # project’s Python module, you’ll import your code from here
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you’ll later put your spiders
Now let’s create a spider, but what are spiders?
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must have a subclass scrapy. Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
We will be using examples from the official doc.
So save the following code in a file named quotes_spider.py under the tutorial/spiders directory in your project:
name = "quotes"
urls = [
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
self.log('Saved file %s' % filename)
As you can see, our Spider subclasses scrapy.Spider
Let’s see wha teach of the attributes and methods mean.
Now let’s run our spider.
Go to the top level directory and type in the following in your cmd.
scrapy crawl quotes
This command runs the spider with name quotes that we’ve just added, that will send some requests for the quotes.toscrape.com domain. You will get an output similar to this:
… (omitted for brevity)
2017-7-1 21:24:05 [scrapy.core.engine] INFO: Spider opened
2017-7-1 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-7-1 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2017-7-1 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
Two new files have been created in the directory you were at. quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.
Beautiful! Isn’t it?
To learn to play with scrapy, check out