Refactor link checker to use FileBatch class #557

Eric-Arellano · 2023-12-28T17:12:30Z

In #495, we will start checking historical API docs. For performance, it's important that we don't load every single file in the project at once in memory. Instead, we want to operate in batches: e.g. check all of 0.44, then drop from memory and move on to 0.43.

This PR is a pre-factor to add a new FileBatch class. For now, we only have a single FileBatch, the same as before. This class still differentiates between files "to load" vs "to check", per #496.

Eric-Arellano · 2023-12-28T17:14:32Z

I didn't add tests for FileBatch since it's too complex to test with needing actual files set up etc.

frankharkins

Thanks!

frankharkins · 2023-12-29T10:04:20Z

scripts/lib/links/FileBatch.ts

+  async load(): Promise<[File[], Link[], Link[]]> {
+    const files: File[] = [];
+    for (let filePath of this.toLoad) {
+      const [_, anchors] = await getMarkdownAndAnchors(filePath);
+      files.push(new File(filePath, anchors));
+    }
+
+    const linksToOriginFiles = new Map<string, string[]>();
+    for (const filePath of this.toCheck) {
+      const [markdown, anchors] = await getMarkdownAndAnchors(filePath);
+      files.push(new File(filePath, anchors));
+      await addLinksToMap(filePath, markdown, linksToOriginFiles);
+    }


This isn't specific to this PR so is non-blocking, but I'm a bit confused about "to load" vs "to check"; it seems we're "loading" both (through getMarkdownAndAnchors)?

Or is toCheck a subset of toLoad?

The key difference is await addLinksToMap(filePath, markdown, linksToOriginFiles), which we only do with toCheck. For toLoad, we record all of its links because they might be linked to from other files. For toCheck, we also record its own links to make sure that those links are valid.

I'll add comments.

arnaucasau

The refactor looks really good! I tested it locally and all works well. Thank you Eric!

arnaucasau · 2023-12-29T14:33:25Z

scripts/commands/checkLinks.ts

  const otherFiles = [
    ...(await globby("{public,docs}/**/*.{png,jpg,gif,svg}")).map(
-      (fp) => new File(fp, [], false),
+      (fp) => new File(fp, []),
    ),
    ...SYNTHETIC_FILES.map((fp) => new File(fp, [], true)),
  ];


Just as an idea. What do you think about having the otherFiles in the constructor of the fileBatch class and using the fromGlobs for them? By doing that, we can avoid having an extra argument in the check call. I think that the argument can be confusing because we already provided files to the class.

Yeah, I think that could work too. The reason I didn't do it is we'd end up recomputing otherFiles many times in the follow up PR to add historical APIs, since each historical version is its own FileBatch.

Oh, I see, that's a good reason. However, this list loads all the images for all the versions every time, when perhaps it's not necessary to load any image besides the ones belonging to that version, and maybe the current one. For example, Qiskit v0.33 doesn't need to have the images in public/images/api/qiskit/0.19/ load. Regardless of that, I like a lot you did the refactor, so if you think it's okay because the amount of images is not that large, I'm good with it too 😃

arnaucasau · 2023-12-29T14:33:36Z

scripts/lib/links/LinkChecker.ts

@@ -25,7 +25,7 @@ export class File {
   *    path: Path to the file
   * anchors: Anchors available in the file
   */
-  constructor(path: string, anchors: string[], synthetic: boolean) {
+  constructor(path: string, anchors: string[], synthetic: boolean = false) {


Good idea with the default value!

In Qiskit#495, we will start checking historical API docs. For performance, it's important that we don't load every single file in the project at once in memory. Instead, we want to operate in batches: e.g. check all of 0.44, then drop from memory and move on to 0.43. This PR is a pre-factor to add a new `FileBatch` class. For now, we only have a single `FileBatch`, the same as before. This class still differentiates between files "to load" vs "to check", per Qiskit#496.

Refactor link checker to use FileBatch class

55a8fff

Eric-Arellano added the infra 🏗️ label Dec 28, 2023

Fix arg order

b39712b

frankharkins approved these changes Dec 29, 2023

View reviewed changes

Document toCheck vs toLoad

d5bebc3

Eric-Arellano enabled auto-merge December 29, 2023 14:28

Eric-Arellano added this pull request to the merge queue Dec 29, 2023

Merged via the queue into main with commit a322836 Dec 29, 2023

Eric-Arellano deleted the EA/link-checker-batches branch December 29, 2023 14:30

arnaucasau reviewed Dec 29, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor link checker to use FileBatch class #557

Refactor link checker to use FileBatch class #557

Uh oh!

Eric-Arellano commented Dec 28, 2023

Uh oh!

Eric-Arellano commented Dec 28, 2023

Uh oh!

frankharkins left a comment

Uh oh!

frankharkins Dec 29, 2023

Uh oh!

frankharkins Dec 29, 2023

Uh oh!

Eric-Arellano Dec 29, 2023

Uh oh!

arnaucasau left a comment

Uh oh!

arnaucasau Dec 29, 2023

Uh oh!

Eric-Arellano Dec 29, 2023

Uh oh!

arnaucasau Dec 29, 2023

Uh oh!

arnaucasau Dec 29, 2023

Uh oh!

Uh oh!

Refactor link checker to use FileBatch class #557

Refactor link checker to use FileBatch class #557

Uh oh!

Conversation

Eric-Arellano commented Dec 28, 2023

Uh oh!

Eric-Arellano commented Dec 28, 2023

Uh oh!

frankharkins left a comment

Choose a reason for hiding this comment

Uh oh!

frankharkins Dec 29, 2023

Choose a reason for hiding this comment

Uh oh!

frankharkins Dec 29, 2023

Choose a reason for hiding this comment

Uh oh!

Eric-Arellano Dec 29, 2023

Choose a reason for hiding this comment

Uh oh!

arnaucasau left a comment

Choose a reason for hiding this comment

Uh oh!

arnaucasau Dec 29, 2023

Choose a reason for hiding this comment

Uh oh!

Eric-Arellano Dec 29, 2023

Choose a reason for hiding this comment

Uh oh!

arnaucasau Dec 29, 2023

Choose a reason for hiding this comment

Uh oh!

arnaucasau Dec 29, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!