If sites you are crawling with scrapy dont respond to your request then you should use randomly generated user agent in your request. Scrapy Fake User Agent is one of the open source and useful extension which will help you evade bot detection programs easily.
Install Scrapy Fake Useragent
pip install scrapy-fake-useragent
Configure Fake User Agent
Fake User Agent can be configured in scrapy by disabling scapy's default UserAgentMiddleware
and activating RandomUserAgentMiddleware
inside DOWNLOADER_MIDDLEWARES
.
You can configure random user agent middleware in a couple of ways.
- Spider Level - For the individual spider.
- Project Level - Globally for the complete scrapy project.
Spider Level Configuration
To configure site specific random user agent you should override global settings by defining DOWNLOADER_MIDDLEWARES
inside custom_settings
of site's spider like below.
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES' : {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
}
def start_requests(self):
urls = ["https://example.com"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print response.request.headers
Project Level Configuration
To configure fake user agent globally at project level you should modify global settings.py
present inside project directory. This will make sure that all sites crawled using current scrapy project will be requested using fake user agent middleware.
# Enable or disable downloader middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}