-
-
Notifications
You must be signed in to change notification settings - Fork 830
Closed
Description
Hello everyone. Just recently upgraded to the V0.23.2(nightly build), and the proxy for Chrome is not working anymore... I live in an internet-censored place and hence proxy is a must.
Here are my config:
chrome:
image: gcr.io/zenika-hub/alpine-chrome:123
container_name: Hoarder-CHROME
restart: unless-stopped
command:
- --no-sandbox
- --disable-gpu
- --disable-dev-shm-usage
- --remote-debugging-address=0.0.0.0
- --remote-debugging-port=9222
- --hide-scrollbars
- --proxy-server='https=172.21.0.1:1080' # Note: it was - --proxy-server=172.21.0.1:1080 and it was working before
Also tried redeploying/restart, the usual drills.
Logs(trying to access Google):
2025-04-14T10:20:22.380Z info: [Crawler][2909] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"
2025-04-14T10:20:22.380Z info: [Crawler][2909] Attempting to determine the content-type for the url https://www.google.com
2025-04-14T10:20:27.382Z error: [Crawler][2909] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted.
2025-04-14T10:22:27.492Z error: [Crawler][2909] Crawling job failed: TimeoutError: Navigation timeout of 120000 ms exceeded
TimeoutError: Navigation timeout of 120000 ms exceeded
at new Deferred (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)
at Deferred.create (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)
at new LifecycleWatcher (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:65:60)
at CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:136:29)
at CdpFrame.<anonymous> (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)
at CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:590:43)
at crawlPage (/app/apps/workers/crawlerWorker.ts:3:2115)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async crawlAndParseUrl (/app/apps/workers/crawlerWorker.ts:3:9435)
at async runCrawler (/app/apps/workers/crawlerWorker.ts:3:13098)
2025-04-14T10:22:30.921Z info: [Crawler][2909] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"
2025-04-14T10:22:30.922Z info: [Crawler][2909] Attempting to determine the content-type for the url https://www.google.com
2025-04-14T10:22:35.924Z error: [Crawler][2909] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted.
2025-04-14T10:24:36.035Z error: [Crawler][2909] Crawling job failed: TimeoutError: Navigation timeout of 120000 ms exceeded
TimeoutError: Navigation timeout of 120000 ms exceeded
at new Deferred (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:59:34)
at Deferred.create (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/Deferred.js:21:16)
at new LifecycleWatcher (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/LifecycleWatcher.js:65:60)
at CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/cdp/Frame.js:136:29)
at CdpFrame.<anonymous> (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/util/decorators.js:98:27)
at CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/lib/cjs/puppeteer/api/Page.js:590:43)
at crawlPage (/app/apps/workers/crawlerWorker.ts:3:2115)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async crawlAndParseUrl (/app/apps/workers/crawlerWorker.ts:3:9435)
at async runCrawler (/app/apps/workers/crawlerWorker.ts:3:13098)
update
With proxy variable set to - --proxy-server=172.21.0.1:1080
, the logs are as following. It seemed like the only problem is "Failed to determine the content-type for the url https://www.google.com", Chrome is able to navigate and read, but just cannot crawl successfully
2025-04-14T10:49:09.299Z info: [Crawler] Connecting to existing browser instance: http://chrome:9222
2025-04-14T10:49:09.305Z info: [Crawler] Successfully resolved IP address, new address: http://172.83.0.17:9222/
2025-04-14T10:49:17.801Z info: [Crawler][2913] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"
2025-04-14T10:49:17.801Z info: [Crawler][2913] Attempting to determine the content-type for the url https://www.google.com
2025-04-14T10:49:22.801Z error: [Crawler][2913] Failed to determine the content-type for the url https://www.google.com: AbortError: The operation was aborted.
2025-04-14T10:49:35.272Z info: [Crawler][2913] Successfully navigated to "https://www.google.com". Waiting for the page to load ...
2025-04-14T10:49:37.079Z info: [Crawler][2913] Finished waiting for the page to load.
2025-04-14T10:49:37.095Z info: [Crawler][2913] Successfully fetched the page content.
2025-04-14T10:49:38.262Z info: [Crawler][2913] Finished capturing page content and a screenshot. FullPageScreenshot: true
2025-04-14T10:49:38.269Z info: [Crawler][2913] Will attempt to extract metadata from page ...
2025-04-14T10:49:39.257Z info: [Crawler][2913] Will attempt to extract readable content ...
2025-04-14T10:49:40.088Z info: [Crawler][2913] Done extracting readable content.
2025-04-14T10:49:40.268Z info: [Crawler][2913] Stored the screenshot as assetId: 70e73c50-d5b5-4a92-a168-97589ed3d483
2025-04-14T10:51:48.809Z info: [Crawler][2913] Will crawl "https://www.google.com" for link with id "ohywrfgs6c5l93scrd61t6hk"
2025-04-14T10:51:48.809Z info: [Crawler][2913] Attempting to determine the content-type for the url https://www.google.com
2025-04-14T10:51:48.820Z error: [Crawler][2913] Crawling job failed: Error: Timed-out after 150 secs
Error: Timed-out after 150 secs
at Timeout._onTimeout (/app/apps/workers/utils.ts:2:1025)
at listOnTimeout (node:internal/timers:594:17)
at process.processTimers (node:internal/timers:529:7)
My env variable:
Metadata
Metadata
Assignees
Labels
No labels