-
-
Notifications
You must be signed in to change notification settings - Fork 216
Implement Web Scraper Modules #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit introduces a new `WebScraperService` to enable programmatic interaction with web content loaded within a `<browser>` element. The service is built using a JSWindowActor pair (`NRWebScraperParent` and `NRWebScraperChild`) for robust communication between the parent process and the content process. This allows for automating tasks like data extraction and form filling. The initial implementation includes methods to: - Retrieve the full HTML of a page. - Input values into elements using CSS selectors. Supporting changes include: - Integrating the service via `BrowserGlue`. - Expanding `XULBrowserElement` TypeScript definitions to improve type safety for browser interactions. - Updating the Solid XUL runtime to allow the `messagemanagergroup` attribute required for actors.
This commit completely rewrites the `WebScraperService` to provide a more robust and modern implementation for headless browsing. The previous version used the legacy `createWindowlessBrowser` API and was largely a placeholder. The new implementation replaces this with the recommended `HiddenFrame` module. Key improvements include: - A fully asynchronous API for creating, navigating, and destroying browser instances. - Reliable page load detection using `WebProgressListener`, complete with a timeout mechanism. - Proper resource management and cleanup to prevent memory leaks. - Use of `crypto.randomUUID` for more reliable instance identifiers. The Gecko type definitions for `XULBrowserElement` have been updated to reflect the new API usage, specifically accessing `webProgress`.
- Update NRWebScraperChild to safely access and return page HTML as string - Rename getCurrentURI to getURI for clarity in WebScraper service - Implement getHTML method in WebScraper to fetch page HTML via actor query - Add getScreenshot method stub for future screenshot functionality - Enhance test scripts to verify HTML content retrieval and display previews
This commit introduces a comprehensive web scraping service to enable browser automation and data extraction. The implementation is split into two main components: - `WebScraperService`: A main process module that manages browser instances and exposes a high-level API for scraping operations. - `NRWebScraperChild`: A JSWindowActor that runs in the content process to safely execute DOM manipulations and data extraction tasks. This new service provides the following capabilities: - Get full page HTML or text from specific elements. - Interact with elements (click, input text). - Wait for an element to appear before proceeding. - Execute custom JavaScript in the page context. - Take various types of screenshots (viewport, element, full-page, region). Supporting Gecko type definitions for `CSSStyleDeclaration` and `XULBrowserElement` have also been updated.
This commit introduces a new `fillForm` method to the `WebScraperService` to allow for filling multiple form fields in a single operation. The new `WebScraper:FillForm` action is handled by the `NRWebScraperChild` actor, which receives a map of CSS selectors to their corresponding values. For each entry, it finds the element, sets its value, and dispatches an `input` event to simulate user interaction and trigger any associated JavaScript. This provides a more efficient and streamlined way to automate form interactions compared to executing separate scripts for each input field.
Delete the `WebScraperService` module, which provided a custom implementation for headless web scraping using Mozilla's internal `HiddenFrame` and `XULBrowserElement` APIs. This approach was complex and tightly coupled to internal browser components. It is being replaced by a more robust and maintainable solution that leverages OS-level APIs, introduced in the new `os-apis/` directory. This change simplifies the architecture and improves stability.
The eager initialization of the WebScraperService at startup is no longer necessary. This commit removes the import from `BrowserGlue.sys.mts` to prevent the service from loading unnecessarily. The service will now be loaded on-demand when it is first used, improving startup performance and reducing initial memory consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a comprehensive web scraper module system that adds headless browser capabilities for automated web interaction and data extraction. The implementation provides a complete solution for creating isolated browser instances, navigating web pages, and performing various scraping operations.
- Adds a WebScraperService that manages headless browser instances using HiddenFrame
- Implements actor-based communication between parent and child processes for safe DOM access
- Extends XUL browser element types to support additional properties needed for web scraping
Reviewed Changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
src/packages/solid-xul/jsx-runtime.ts | Extends XULBrowserElement interface with additional properties for browser control |
src/apps/modules/src/modules/os-apis/WebScraperService.sys.mts | Main service class providing web scraping functionality and browser instance management |
src/apps/modules/src/modules/BrowserGlue.sys.mts | Registers the NRWebScraper actor and removes commented test code |
src/apps/modules/src/actors/NRWebScraperParent.sys.mts | Parent process actor that forwards messages between service and child |
src/apps/modules/src/actors/NRWebScraperChild.sys.mts | Child process actor that performs actual DOM operations and screenshot capture |
Comments suppressed due to low confidence (2)
src/apps/modules/src/modules/os-apis/WebScraperService.sys.mts:36
- The class name 'webScraper' should follow PascalCase naming convention. It should be renamed to 'WebScraper'.
class webScraper {
src/apps/modules/src/modules/os-apis/WebScraperService.sys.mts:586
- This line creates an inconsistency where the class 'webScraper' is instantiated but exported as 'WebScraper'. Both should use consistent PascalCase naming.
export const WebScraper = new webScraper();
This commit introduces a new OS API layer designed to provide browser context, likely for features like LLM-generated workflows. A central `OSGlue` module is added to manage the initialization of various OS-level services. The first major service implemented is `BrowserInfo`, which can collect recent tabs, browsing history, and downloads. This new system is initialized during application startup. As part of this change, the existing `WebScraperService` is refactored into the new `os-apis` structure.
Rename the `WebScraperService.sys.mts` file to `WebScraperServices.sys.mts` to better reflect that the module manages multiple scraper instances. The import path in `OSGlue.sys.mts` has been updated to match the new filename.
Implement a right-click context menu for items in the Top Sites grid. This provides users with quick access to actions like pinning/unpinning sites, opening links in new tabs/private windows, and blocking a site from appearing. A new "Blocked Sites" section has been added to the Settings modal, allowing users to view and unblock sites. This also refactors data management for the new tab page. Settings are now centralized into a single object managed by `getNewTabSettings` and `saveNewTabSettings` in `utils/dataManager.ts`, simplifying state persistence. The previous `TopSites/dataManager.ts` has been removed.
This commit updates the restart functions in the RebootPanelMenu to use modern, recommended APIs, improving robustness and aligning with current platform practices. - Replace direct environment variable manipulation for cache clearing with the `Services.appinfo.invalidateCachesOnRestart()` method. - Use the `restart-in-safe-mode` observer notification instead of the direct `Services.startup.restartInSafeMode()` call. - Use the `eAttemptQuit` flag instead of `eForceQuit` to allow for a more graceful application shutdown. - Remove the unused "Restart with Profile Manager" option and its corresponding logic.
This commit updates the package version in `package.json` from 12.0.15 to 12.0.16 in preparation for a new release.
Check list