Scrapy is an open source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way.
Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).
In python, web scraping can be done using scrapy.
Installation first.
You can easily install using pip. For other installation option, click here. Type the following to your command prompt.
1 |
pip install scrapy |
Now, let’s get our hand on some coding.
Let’s start off by creating a scrapy project. Enter the directory of your choice and type in the following.
1 |
scrapy startproject tutorial |
Something like this prints out for you.
1 2 3 |
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in: /home/enfa/Desktop/BS4/tutorial |
You can start your first spider with:
1 2 3 |
cd tutorial scrapy genspider example example.com |
This will create a directory tutorial with the following contents.
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project’s Python module, you’ll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you’ll later put your spiders
__init__.py
Now let’s create a spider, but what are spiders?
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must have a subclass scrapy. Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
We will be using examples from the official doc.
So save the following code in a file named quotes_spider.py under the tutorial/spiders directory in your project:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename) |
As you can see, our Spider subclasses scrapy.Spider
Let’s see wha teach of the attributes and methods mean.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.
Now let’s run our spider.
Go to the top level directory and type in the following in your cmd.
1 |
scrapy crawl quotes |
This command runs the spider with name quotes that we’ve just added, that will send some requests for the quotes.toscrape.com domain. You will get an output similar to this:
… (omitted for brevity)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
2017-7-1 21:24:05 [scrapy.core.engine] INFO: Spider opened 2017-7-1 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-7-1 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None) 2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None) 2017-7-1 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None) 2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-1.html 2017-7-1 21:24:05 [quotes] DEBUG: Saved file quotes-2.html 2017-7-1 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished) |
…
Source:https://doc.scrapy.org/en/latest/intro/tutorial.html
Note:
Two new files have been created in the directory you were at. quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.
Beautiful! Isn’t it?
To learn to play with scrapy, check out
https://doc.scrapy.org/en/latest/intro/tutorial.html
www.tutorialspoint.com/scrapy/
Leave a Reply