Tech Monger

Programming, Web Development and Computer Science.

Skip to main content| Skip to information by topic

Get Currently Requested URL From Scrapy Spider

If you ever wanted to figure out all the redirect urls that scrapy spider hopped on or what is the currently requested URL by the spider then you easily get that using following example code.


Scrapy's Response Object

When you start scrapy spider for crawling, it stores response details of each url that spider requested inside response object . The good part about this object is it remains available inside parse method of the spider class. You can also access response object while using scrapy shell.

Response object stores information about current request inside request object which is also available inside response.request object . response.request object holds useful information such as currently requested url or meta information such as all the redirect urls followed during current request. This information comes handy when we want to retrieve first redirected url or currently requested url.


Examples

  • Get Currently Requested URL

    def parse(self, response):
       current_url = response.request.url
  • Get All Followed Redirect URLs

    def parse(self, response):
       redirect_url_list = response.request.meta.get('redirect_urls')
  • Get First URL Followed by Spiders
    (Actual request url provided in start_urls or in start_requests) Redirect URLs)

    def parse(self, response):
       redirect_url_list = response.request.meta.get('redirect_urls')[0]
  • If there is no redirect followed during crawl then above code will fail with KeyError: 'redirect_urls'. Below code would safely extract first requested url.

    if response.request.meta.get('redirect_urls'):
        url = response.request.meta['redirect_urls'][0]
    else:
        url = response.request.url

Tagged Under : Open Source Python Scrapy