Skip to content

Conversation

scottdoc
Copy link

@scottdoc scottdoc commented Nov 18, 2024

Similar to how xlsx is shortcutted based on the path filename of the zipped headers, I have done the same for pptx.

Please feel free to reject if this is totally wrong and not a safe or acceptable way to detect pptx files. In my testing this commit made detetcting this file go from 7 mintues to a few seconds.

Resolves #688

@Borewit Borewit changed the title Return pptx mime type if detected in the zipHeader.filename (#688) Return pptx mime type if detected in the zipHeader.filename Nov 18, 2024
@Borewit
Copy link
Collaborator

Borewit commented Nov 18, 2024

In many cases, the tokenizer-s3 library can significantly enhance the speed of S3 file detection.

I am not in favor of relying solely on matching the ppt/ path, as this approach feels more like an educated guess rather than a reliable method for identifying a pptx file.

A more robust solution might involve extracting and analyzing the docProps/app.xml [Content_Types].xml file, which offers a much more specialized detection mechanism. However, this approach requires decompression and XML parsing, which adds complexity. Consequently, it might be better to delegate this functionality to an add-on for Office or zipped formats.

@Borewit Borewit changed the title Return pptx mime type if detected in the zipHeader.filename Return pptx type based on /ppt* path prefix occurrence in zipped files Nov 18, 2024
@scottdoc
Copy link
Author

Yes, I do use the tokenizer/s3 library.

My code is more or less:

import { fileTypeFromTokenizer } from "file-type";
import { makeTokenizer } from "@tokenizer/s3";

const tokenizer = await makeTokenizer(s3Client, {
			Bucket: location.bucket,
			Key: location.key
		});

const type = await fileTypeFromTokenizer(tokenizer);
await tokenizer.close();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detecting a pptx taking a very long time
2 participants