Skip to content

Conversation

Eric-Arellano
Copy link
Collaborator

In #495, we will start checking historical API docs. For performance, it's important that we don't load every single file in the project at once in memory. Instead, we want to operate in batches: e.g. check all of 0.44, then drop from memory and move on to 0.43.

This PR is a pre-factor to add a new FileBatch class. For now, we only have a single FileBatch, the same as before. This class still differentiates between files "to load" vs "to check", per #496.

@Eric-Arellano
Copy link
Collaborator Author

I didn't add tests for FileBatch since it's too complex to test with needing actual files set up etc.

Copy link
Member

@frankharkins frankharkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +49 to +61
async load(): Promise<[File[], Link[], Link[]]> {
const files: File[] = [];
for (let filePath of this.toLoad) {
const [_, anchors] = await getMarkdownAndAnchors(filePath);
files.push(new File(filePath, anchors));
}

const linksToOriginFiles = new Map<string, string[]>();
for (const filePath of this.toCheck) {
const [markdown, anchors] = await getMarkdownAndAnchors(filePath);
files.push(new File(filePath, anchors));
await addLinksToMap(filePath, markdown, linksToOriginFiles);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't specific to this PR so is non-blocking, but I'm a bit confused about "to load" vs "to check"; it seems we're "loading" both (through getMarkdownAndAnchors)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is toCheck a subset of toLoad?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is await addLinksToMap(filePath, markdown, linksToOriginFiles), which we only do with toCheck. For toLoad, we record all of its links because they might be linked to from other files. For toCheck, we also record its own links to make sure that those links are valid.

I'll add comments.

@Eric-Arellano Eric-Arellano added this pull request to the merge queue Dec 29, 2023
Merged via the queue into main with commit a322836 Dec 29, 2023
@Eric-Arellano Eric-Arellano deleted the EA/link-checker-batches branch December 29, 2023 14:30
Copy link
Collaborator

@arnaucasau arnaucasau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor looks really good! I tested it locally and all works well. Thank you Eric!

Comment on lines 68 to 73
const otherFiles = [
...(await globby("{public,docs}/**/*.{png,jpg,gif,svg}")).map(
(fp) => new File(fp, [], false),
(fp) => new File(fp, []),
),
...SYNTHETIC_FILES.map((fp) => new File(fp, [], true)),
];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an idea. What do you think about having the otherFiles in the constructor of the fileBatch class and using the fromGlobs for them? By doing that, we can avoid having an extra argument in the check call. I think that the argument can be confusing because we already provided files to the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that could work too. The reason I didn't do it is we'd end up recomputing otherFiles many times in the follow up PR to add historical APIs, since each historical version is its own FileBatch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, that's a good reason. However, this list loads all the images for all the versions every time, when perhaps it's not necessary to load any image besides the ones belonging to that version, and maybe the current one. For example, Qiskit v0.33 doesn't need to have the images in public/images/api/qiskit/0.19/ load. Regardless of that, I like a lot you did the refactor, so if you think it's okay because the amount of images is not that large, I'm good with it too 😃

@@ -25,7 +25,7 @@ export class File {
* path: Path to the file
* anchors: Anchors available in the file
*/
constructor(path: string, anchors: string[], synthetic: boolean) {
constructor(path: string, anchors: string[], synthetic: boolean = false) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea with the default value!

frankharkins pushed a commit to frankharkins/documentation that referenced this pull request Jul 22, 2024
In Qiskit#495, we will start
checking historical API docs. For performance, it's important that we
don't load every single file in the project at once in memory. Instead,
we want to operate in batches: e.g. check all of 0.44, then drop from
memory and move on to 0.43.

This PR is a pre-factor to add a new `FileBatch` class. For now, we only
have a single `FileBatch`, the same as before. This class still
differentiates between files "to load" vs "to check", per
Qiskit#496.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants