scrapy-impersonate
is a Scrapy download handler. This project integrates curl_cffi to perform HTTP requests, so it can impersonate browsers' TLS signatures or JA3 fingerprints.
pip install scrapy-impersonate
To use this package, replace the default http
and https
Download Handlers by updating the DOWNLOAD_HANDLERS
setting:
DOWNLOAD_HANDLERS = {
"http": "scrapy_impersonate.ImpersonateDownloadHandler",
"https": "scrapy_impersonate.ImpersonateDownloadHandler",
}
By setting USER_AGENT = None
, curl_cffi
will automatically choose the appropriate User-Agent based on the impersonated browser:
USER_AGENT = None
Also, be sure to install the asyncio-based Twisted reactor for proper asynchronous execution:
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Set the impersonate
Request.meta key to download a request using curl_cffi
:
import scrapy
class ImpersonateSpider(scrapy.Spider):
name = "impersonate_spider"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"USER_AGENT": None,
"DOWNLOAD_HANDLERS": {
"http": "scrapy_impersonate.ImpersonateDownloadHandler",
"https": "scrapy_impersonate.ImpersonateDownloadHandler",
},
"DOWNLOADER_MIDDLEWARES": {
"scrapy_impersonate.RandomBrowserMiddleware": 1000,
},
}
def start_requests(self):
for _ in range(5):
yield scrapy.Request(
"https://tls.browserleaks.com/json",
dont_filter=True,
)
def parse(self, response):
# ja3_hash: 98cc085d47985d3cca9ec1415bbbf0d1 (chrome133a)
# ja3_hash: 2d692a4485ca2f5f2b10ecb2d2909ad3 (firefox133)
# ja3_hash: c11ab92a9db8107e2a0b0486f35b80b9 (chrome124)
# ja3_hash: 773906b0efdefa24a7f2b8eb6985bf37 (safari15_5)
# ja3_hash: cd08e31494f9531f560d64c695473da9 (chrome99_android)
yield {"ja3_hash": response.json()["ja3_hash"]}
You can pass any necessary arguments to curl_cffi
through impersonate_args
. For example:
yield scrapy.Request(
"https://tls.browserleaks.com/json",
dont_filter=True,
meta={
"impersonate": browser,
"impersonate_args": {
"verify": False,
"timeout": 10,
},
},
)
The following browsers can be impersonated
This project is inspired by the following projects:
- curl_cffi - Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.
- curl-impersonate - A special build of curl that can impersonate Chrome & Firefox
- scrapy-playwright - Playwright integration for Scrapy