Implement Web Scraper Modules #1851

surapunoyousei · 2025-07-18T15:29:39Z

Check list

Referenced all related issues
Have tested the modifications

This commit introduces a new `WebScraperService` to enable programmatic interaction with web content loaded within a `<browser>` element. The service is built using a JSWindowActor pair (`NRWebScraperParent` and `NRWebScraperChild`) for robust communication between the parent process and the content process. This allows for automating tasks like data extraction and form filling. The initial implementation includes methods to: - Retrieve the full HTML of a page. - Input values into elements using CSS selectors. Supporting changes include: - Integrating the service via `BrowserGlue`. - Expanding `XULBrowserElement` TypeScript definitions to improve type safety for browser interactions. - Updating the Solid XUL runtime to allow the `messagemanagergroup` attribute required for actors.

This commit completely rewrites the `WebScraperService` to provide a more robust and modern implementation for headless browsing. The previous version used the legacy `createWindowlessBrowser` API and was largely a placeholder. The new implementation replaces this with the recommended `HiddenFrame` module. Key improvements include: - A fully asynchronous API for creating, navigating, and destroying browser instances. - Reliable page load detection using `WebProgressListener`, complete with a timeout mechanism. - Proper resource management and cleanup to prevent memory leaks. - Use of `crypto.randomUUID` for more reliable instance identifiers. The Gecko type definitions for `XULBrowserElement` have been updated to reflect the new API usage, specifically accessing `webProgress`.

- Update NRWebScraperChild to safely access and return page HTML as string - Rename getCurrentURI to getURI for clarity in WebScraper service - Implement getHTML method in WebScraper to fetch page HTML via actor query - Add getScreenshot method stub for future screenshot functionality - Enhance test scripts to verify HTML content retrieval and display previews

This commit introduces a comprehensive web scraping service to enable browser automation and data extraction. The implementation is split into two main components: - `WebScraperService`: A main process module that manages browser instances and exposes a high-level API for scraping operations. - `NRWebScraperChild`: A JSWindowActor that runs in the content process to safely execute DOM manipulations and data extraction tasks. This new service provides the following capabilities: - Get full page HTML or text from specific elements. - Interact with elements (click, input text). - Wait for an element to appear before proceeding. - Execute custom JavaScript in the page context. - Take various types of screenshots (viewport, element, full-page, region). Supporting Gecko type definitions for `CSSStyleDeclaration` and `XULBrowserElement` have also been updated.

This commit introduces a new `fillForm` method to the `WebScraperService` to allow for filling multiple form fields in a single operation. The new `WebScraper:FillForm` action is handled by the `NRWebScraperChild` actor, which receives a map of CSS selectors to their corresponding values. For each entry, it finds the element, sets its value, and dispatches an `input` event to simulate user interaction and trigger any associated JavaScript. This provides a more efficient and streamlined way to automate form interactions compared to executing separate scripts for each input field.

Delete the `WebScraperService` module, which provided a custom implementation for headless web scraping using Mozilla's internal `HiddenFrame` and `XULBrowserElement` APIs. This approach was complex and tightly coupled to internal browser components. It is being replaced by a more robust and maintainable solution that leverages OS-level APIs, introduced in the new `os-apis/` directory. This change simplifies the architecture and improves stability.

The eager initialization of the WebScraperService at startup is no longer necessary. This commit removes the import from `BrowserGlue.sys.mts` to prevent the service from loading unnecessarily. The service will now be loaded on-demand when it is first used, improving startup performance and reducing initial memory consumption.

Copilot

Pull Request Overview

This PR implements a comprehensive web scraper module system that adds headless browser capabilities for automated web interaction and data extraction. The implementation provides a complete solution for creating isolated browser instances, navigating web pages, and performing various scraping operations.

Adds a WebScraperService that manages headless browser instances using HiddenFrame
Implements actor-based communication between parent and child processes for safe DOM access
Extends XUL browser element types to support additional properties needed for web scraping

Reviewed Changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/packages/solid-xul/jsx-runtime.ts	Extends XULBrowserElement interface with additional properties for browser control
src/apps/modules/src/modules/os-apis/WebScraperService.sys.mts	Main service class providing web scraping functionality and browser instance management
src/apps/modules/src/modules/BrowserGlue.sys.mts	Registers the NRWebScraper actor and removes commented test code
src/apps/modules/src/actors/NRWebScraperParent.sys.mts	Parent process actor that forwards messages between service and child
src/apps/modules/src/actors/NRWebScraperChild.sys.mts	Child process actor that performs actual DOM operations and screenshot capture

Comments suppressed due to low confidence (2)

src/apps/modules/src/modules/os-apis/WebScraperService.sys.mts:36

The class name 'webScraper' should follow PascalCase naming convention. It should be renamed to 'WebScraper'.

class webScraper {

src/apps/modules/src/modules/os-apis/WebScraperService.sys.mts:586

This line creates an inconsistency where the class 'webScraper' is instantiated but exported as 'WebScraper'. Both should use consistent PascalCase naming.

export const WebScraper = new webScraper();

src/apps/modules/src/modules/os-apis/WebScraperService.sys.mts

src/apps/modules/src/actors/NRWebScraperChild.sys.mts

This commit introduces a new OS API layer designed to provide browser context, likely for features like LLM-generated workflows. A central `OSGlue` module is added to manage the initialization of various OS-level services. The first major service implemented is `BrowserInfo`, which can collect recent tabs, browsing history, and downloads. This new system is initialized during application startup. As part of this change, the existing `WebScraperService` is refactored into the new `os-apis` structure.

Rename the `WebScraperService.sys.mts` file to `WebScraperServices.sys.mts` to better reflect that the module manages multiple scraper instances. The import path in `OSGlue.sys.mts` has been updated to match the new filename.

Implement a right-click context menu for items in the Top Sites grid. This provides users with quick access to actions like pinning/unpinning sites, opening links in new tabs/private windows, and blocking a site from appearing. A new "Blocked Sites" section has been added to the Settings modal, allowing users to view and unblock sites. This also refactors data management for the new tab page. Settings are now centralized into a single object managed by `getNewTabSettings` and `saveNewTabSettings` in `utils/dataManager.ts`, simplifying state persistence. The previous `TopSites/dataManager.ts` has been removed.

This commit updates the restart functions in the RebootPanelMenu to use modern, recommended APIs, improving robustness and aligning with current platform practices. - Replace direct environment variable manipulation for cache clearing with the `Services.appinfo.invalidateCachesOnRestart()` method. - Use the `restart-in-safe-mode` observer notification instead of the direct `Services.startup.restartInSafeMode()` call. - Use the `eAttemptQuit` flag instead of `eForceQuit` to allow for a more graceful application shutdown. - Remove the unused "Restart with Profile Manager" option and its corresponding logic.

This commit updates the package version in `package.json` from 12.0.15 to 12.0.16 in preparation for a new release.

surapunoyousei added 3 commits July 18, 2025 14:30

This comment was marked as outdated.

Sign in to view

surapunoyousei added 4 commits July 19, 2025 21:31

This comment was marked as outdated.

Sign in to view

surapunoyousei requested a review from Copilot July 20, 2025 01:37

Copilot AI reviewed Jul 20, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

refactor: Rename WebScraperService to WebScraperServices

2269755

Rename the `WebScraperService.sys.mts` file to `WebScraperServices.sys.mts` to better reflect that the module manages multiple scraper instances. The import path in `OSGlue.sys.mts` has been updated to match the new filename.

This comment was marked as outdated.

Sign in to view

surapunoyousei added 2 commits July 23, 2025 23:33

chore: Bump version to 12.0.16

b4c5f2e

This commit updates the package version in `package.json` from 12.0.15 to 12.0.16 in preparation for a new release.

surapunoyousei merged commit ac24cfe into main Jul 23, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement Web Scraper Modules #1851

Implement Web Scraper Modules #1851

Uh oh!

surapunoyousei commented Jul 18, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implement Web Scraper Modules #1851

Implement Web Scraper Modules #1851

Uh oh!

Conversation

surapunoyousei commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Check list

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

surapunoyousei commented Jul 18, 2025 •

edited

Loading