Scrapy integration with curl_cffi (curl-impersonate).
pip install scrapy-curl-cffi
Another option, to enable Scrapy's support for modern HTTP compression protocols:
pip install scrapy-curl-cffi[compression]
Update your Scrapy project settings as follows:
DOWNLOAD_HANDLERS = {
"http": "scrapy_curl_cffi.handler.CurlCffiDownloadHandler",
"https": "scrapy_curl_cffi.handler.CurlCffiDownloadHandler",
}
DOWNLOADER_MIDDLEWARES = {
"scrapy_curl_cffi.middlewares.CurlCffiMiddleware": 200,
"scrapy_curl_cffi.middlewares.DefaultHeadersMiddleware": 400,
"scrapy_curl_cffi.middlewares.UserAgentMiddleware": 500,
"scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": None,
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
To download a scrapy.Request
with curl_cffi
, add the
curl_cffi_options
special key to the Request.meta
attribute. The value
should be a dict with any of the following options:
impersonate
- which browser version to impersonateja3
- ja3 string to impersonateakamai
- akamai string to impersonateextra_fp
- extra fingerprints options, in complement to ja3 and akamai stringsdefault_headers
- whether to set default browser headers when impersonating, defaults toTrue
verify
- whether to verify https certs, defaults toFalse
See the curl_cffi documentation for more info on these options.
Alternatively, you can use the curl_cffi_options
spider attribute or the
CURL_CFFI_OPTIONS
setting to automatically assign the curl_cffi_options
meta
for all requests.
class FingerprintsSpider(scrapy.Spider):
name = "fingerprints"
start_urls = ["https://tls.browserleaks.com/json"]
curl_cffi_options = {"impersonate": "chrome"}
def parse(self, response):
yield response.json()
scrapy-curl-cffi
strives to adhere to established Scrapy conventions, ensuring
that most Scrapy settings, spider attributes, request/response attributes and
meta keys configure the crawler's behavior in an expected manner.