Speed up link checker by not loading historical API docs #496

Eric-Arellano · 2023-12-07T19:33:20Z

Closes #461.

Before, the script took me about 42 seconds. Now, about 5.7:

❯ hyperfine 'npm run check:links'
Benchmark 1: npm run check:links
  Time (mean ± σ):      5.680 s ±  0.143 s    [User: 7.285 s, System: 0.471 s]
  Range (min … max):    5.548 s …  5.964 s    10 runs

This PR differentiates between files we want to check vs. files that we only need to know they exist because other files might link to them. For the latter, we do still need to read in those files to determine their anchors.

The key insight of this PR is that 99% of the historical API docs can be ignored because the stable docs don't link to them. We only need to load a couple of historical API docs because our migration guides link to them.

--

I think we probably will want to add an argument to allow checking links for historical API docs: #495. This should work well with that change; we'd update the list based off of the Arguments passed in.

Eric-Arellano · 2023-12-07T19:34:00Z

scripts/commands/checkLinks.ts

@@ -116,7 +142,6 @@ const readArgs = (): Arguments => {
    .version(false)
    .option("external", {
      type: "boolean",
-      demandOption: false,


This is the default. No need to set it.

Eric-Arellano · 2023-12-07T19:34:13Z

scripts/commands/checkLinks.ts

@@ -134,57 +159,60 @@ function markdownFromNotebook(source: string): string {
  return markdown;
 }

+async function getMarkdownAndAnchors(


Same logic as before.

arnaucasau

Thank you Eric! On my end, the execution time goes from 1 minute 40 seconds to roughly 15 seconds! Awesome work!!

Indeed, the problem with the script was the amount of historical files we were parsing its anchors. Having an extra glob to decide which ones we want to load is a really good idea, and now we can avoid having extra logic to ignore files in addition to the speed up.

scripts/commands/checkLinks.ts

…up-link-checker

Co-authored-by: Arnau Casau <47946624+arnaucasau@users.noreply.github.com>

In #495, we will start checking historical API docs. For performance, it's important that we don't load every single file in the project at once in memory. Instead, we want to operate in batches: e.g. check all of 0.44, then drop from memory and move on to 0.43. This PR is a pre-factor to add a new `FileBatch` class. For now, we only have a single `FileBatch`, the same as before. This class still differentiates between files "to load" vs "to check", per #496.

Closes Qiskit#461. Before, the script took me about 42 seconds. Now, about 5.7: ``` ❯ hyperfine 'npm run check:links' Benchmark 1: npm run check:links Time (mean ± σ): 5.680 s ± 0.143 s [User: 7.285 s, System: 0.471 s] Range (min … max): 5.548 s … 5.964 s 10 runs ``` This PR differentiates between files we want to check vs. files that we only need to know they exist because other files might link to them. For the latter, we do still need to read in those files to determine their anchors. The key insight of this PR is that 99% of the historical API docs can be ignored because the stable docs don't link to them. We only need to load a couple of historical API docs because our migration guides link to them. -- I think we probably will want to add an argument to allow checking links for historical API docs: Qiskit#495. This should work well with that change; we'd update the list based off of the `Arguments` passed in. --------- Co-authored-by: Arnau Casau <47946624+arnaucasau@users.noreply.github.com>

In Qiskit#495, we will start checking historical API docs. For performance, it's important that we don't load every single file in the project at once in memory. Instead, we want to operate in batches: e.g. check all of 0.44, then drop from memory and move on to 0.43. This PR is a pre-factor to add a new `FileBatch` class. For now, we only have a single `FileBatch`, the same as before. This class still differentiates between files "to load" vs "to check", per Qiskit#496.

Speed up link checker by not loading historical API docs

5b915d9

Eric-Arellano added the infra 🏗️ label Dec 7, 2023

Eric-Arellano requested review from frankharkins and arnaucasau December 7, 2023 19:33

github-actions bot deployed to Preview December 7, 2023 19:33 View deployment

Eric-Arellano commented Dec 7, 2023

View reviewed changes

arnaucasau approved these changes Dec 8, 2023

View reviewed changes

scripts/commands/checkLinks.ts Outdated Show resolved Hide resolved

Eric-Arellano and others added 2 commits December 8, 2023 10:17

Merge branch 'main' of github.com:Qiskit/documentation into EA/speed-…

8f1c156

…up-link-checker

Review feedback

a927fe0

Co-authored-by: Arnau Casau <47946624+arnaucasau@users.noreply.github.com>

github-actions bot deployed to Preview December 8, 2023 16:37 View deployment

Eric-Arellano added this pull request to the merge queue Dec 8, 2023

Merged via the queue into main with commit d1279aa Dec 8, 2023

Eric-Arellano deleted the EA/speed-up-link-checker branch December 8, 2023 16:43

This was referenced Dec 21, 2023

Check links for API docs (Qiskit 0.46+) #495

Closed

Refactor link checker to use FileBatch class #557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up link checker by not loading historical API docs #496

Speed up link checker by not loading historical API docs #496

Uh oh!

Eric-Arellano commented Dec 7, 2023

Uh oh!

Eric-Arellano Dec 7, 2023

Uh oh!

Eric-Arellano Dec 7, 2023

Uh oh!

arnaucasau left a comment

Uh oh!

Uh oh!

Uh oh!

Speed up link checker by not loading historical API docs #496

Speed up link checker by not loading historical API docs #496

Uh oh!

Conversation

Eric-Arellano commented Dec 7, 2023

Uh oh!

Eric-Arellano Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

Eric-Arellano Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

arnaucasau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!