If you ever wanted to figure out all the redirect urls that scrapy spider hopped on or what is the currently requested URL by the spider then you easily get that using following example code.
Scrapy's Response Object
When you start scrapy spider for crawling, it stores response details of each url that spider requested inside response
object . The good part about this object is it remains available inside parse
method of the spider class. You can also access response object while using scrapy shell.
Response object stores information about current request inside request
object which is also available inside response.request
object . response.request
object holds useful information such as currently requested url or meta information such as all the redirect urls followed during current request. This information comes handy when we want to retrieve first redirected url or currently requested url.
Examples
- Get Currently Requested URL
def parse(self, response): current_url = response.request.url
- Get All Followed Redirect URLs
def parse(self, response): redirect_url_list = response.request.meta.get('redirect_urls')
- Get First URL Followed by Spiders
(Actual request url provided instart_urls
or instart_requests
) Redirect URLs)def parse(self, response): redirect_url_list = response.request.meta.get('redirect_urls')[0]
- If there is no redirect followed during crawl then above code will fail with KeyError: 'redirect_urls'. Below code would safely extract first requested url.
if response.request.meta.get('redirect_urls'): url = response.request.meta['redirect_urls'][0] else: url = response.request.url