Solving Scrapy User timeout caused connection failure

How to Solve Scrapy's User Timeout Failure

Posted November 19, 2018

If the website you are scraping is not responding to your requests then your spider would report failure due to request timeout and throw following error.

2018-11-19 12:06:25 [scrapy.downloadermiddlewares.retry] DEBUG:
Retrying  (failed 1 times):
User timeout caused connection failure:
Getting https://example.com took longer than 180.0 seconds..

By default spider will try requesting URL 3 times and give up the URL completely with the following error.

twisted.internet.error.TimeoutError:
User timeout caused connection failure:
Getting https://example.com took longer than 180.0 seconds..

Why Request is Getting Timed Out

There could be many potential reasons why server is not responding to the requested URL and the exact reason could only be figured by doing investigation on your own. Below are the most frequent causes for request getting timed out.

Server has rate limited your IP Address.
Server only responds to the IP Addresses of the the specific region.
Server is too busy or under very heavy load for long period of time.
Server responds to only specific User-Agent.
Server responds only if Cookies are present inside request header.
think more reasons on your own ....

Above list will always remain incomplete and exact cause can only be figured out by doing trial and error on most probable cases.

Get Request Details

Browser

If server is responding to the requested URL inside the web browser with same request IP address then you should open developer options (inspect element) inside the web browser and check for the request headers inside Network tab.

cURL

In unix like operating system you can use tools like CURL to check request and response details inside verbose mode.

curl -v 'https://example.com'

Cross Verify Request Details with Scrapy

You can now use scrapy's Request object to compare requested details sent by scrapy vs Borwser (cURL) .

You can check the requested headers by scrapy on some other URL of the same site.

If no URL is working for the the site in question then you can check request details on some other site for which request is working i.e. server is responding with same scrapy settings.

Scrapy stores request headers inside request.headers and request metadata inside request.meta.

>>> request.headers
{'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36']}

>>> request.meta
{'handle_httpstatus_list': , 'download_timeout': 180.0, 'depth': 0, 'download_latency': 1.1379539966583252, 'download_slot': 'example.com'}

Below are the objects available inside scrapy's request object.

>>> dir(request)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', 
'__getattribute__', '__hash__', '__init__', '__module__', '__new__', 
'__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__',
 '__slots__',  '__str__', '__subclasshook__', '__weakref__', '_body',
 '_encoding',  '_get_body', '_get_url', '_meta', '_set_body', '_set_url', '_url', 
'body',  'callback', 'cookies', 'copy', 'dont_filter', 'encoding', 
'errback',  'flags',  'headers', 'meta', 'method', 'priority', 'replace', 'url']

Setting can be Deceptive

While debugging above error you may get trapped due to the way scrapy settings take precedence and this is especially true if you are using different middleware extensions like Fake User Agent . In scrapy you can set the setting at different levels and you must be aware about which configuration will take effect while you are investigating.

Scrapy shell is your Friend

You should request the URL from scrapy shell from outside the scrapy project to avoid getting trapped with settings precedence.

For example if server is responding only to the specific user agents then you can set user agent to test with scrapy shell like below.

scrapy shell -s USER_AGENT='something-to-test' 'https://example.com'

You can also set the custom headers and cookie within scrapy shell by modifying request object within shell. Below we have added custom headers before making request from scrapy shell.

$ scrapy shell

from scrapy import Request

headers = {'Host': ['example.com'], 'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'],, 'User-Agent': ['Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36']}

url = "https://example.com/"

request_object = Request(url, headers=headers)

fetch(request_object)

view(response)

Conclusion

Request timout could be possible due to host of reasons. But to solve timeout issue you should try different request values while making request from scrapy shell and must take most frequent request timeout causes into the account.

Tagged Under : Python Scrapy Web