Question:
Fetch all links from google home page
Program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import urllib2 import re req=urllib2.Request('https://www.google.com') #connect to a URL website = urllib2.urlopen(req) #read html code html = website.read() #use re.findall to get all the links links = re.findall('"((http|ftp)s?://.*?)"', html) for i in links: print i |
Explanation:
The urllib2
module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.
re module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
('http://schema.org/WebPage', 'http') ('https://www.google.co.in/imghp?hl=en&tab=wi', 'http') ('https://maps.google.co.in/maps?hl=en&tab=wl', 'http') ('https://play.google.com/?hl=en&tab=w8', 'http') ('https://www.youtube.com/?gl=IN&tab=w1', 'http') ('https://news.google.co.in/nwshp?hl=en&tab=wn', 'http') ('https://mail.google.com/mail/?tab=wm', 'http') ('https://drive.google.com/?tab=wo', 'http') ('https://www.google.co.in/intl/en/options/', 'http') ('http://www.google.co.in/history/optout?hl=en', 'http') ('https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.co.in/%3Fgfe_rd%3Dcr%26ei%3D83RYWbugK9WkvwT7j4SwCA', 'http') ('http://www.google.co.in/services/', 'http') ('https://plus.google.com/104205742743787718296', 'http') ('https://www.google.co.in/setprefdomain?prefdom=US&sig=__xO-fFja9LlrL0EjCUtIDcyG3flI%3D', 'http') |
Leave a Reply