-
-
Notifications
You must be signed in to change notification settings - Fork 387
Description
Description
In this ticket I want to address three topics:
- Within the readme the s3 tokenizer guide has not been updated with the exports in more recent versions of
@tokenizer/s3
, asmakeTokenizer
must be replaced withmakeChunkedTokenizerFromS3
. - I noticed that, after running my jest tests, the s3 'connection' to the file inserted in the tokenizer is still 'open', even though a file type was detected. Is there a way to enforce closing it?
- The bug described below.
I have a very small test PDF file of just 855 bytes, where I am using file-type v20.0.0 upon.
The PDF file is on a S3 environment, therefore I use v1.0.0 of the @tokenizer/s3
package to retrieve the file.
When calling fileTypeFromTokenizer
with the file I received a End-Of-File
every time.
Through some debugging, I found this line to cause it:
Line 744 in 3945d7f
throw error; |
As the comment states: if the file is not large enough, the error must be ignored. However, the error I receive is:
End-Of-File
at RangeRequestTokenizer.loadRange (/node_modules/@tokenizer/range/lib/range-request-tokenizer.js:101:19)
Which is not an instance of strtok3.EndOfStreamError
and therefore is not caught by the ignore error if statement. This looks like a bug to me.
Commenting out the throw error resolved the issue for me, but that is not a permanent one ;)
Below, I inserted snippets of my code to ease up the debugging.
To create the PDF:
import { fs } from "zx"; // which is the fs-extra package
import { PDFDocument } from 'pdf-lib'
const s3DestinationDir = ""; // Change to your liking
const createPdf = async (filePath, content = 'Random generated test PDF') => {
const asset = `${s3DestinationDir}/${filePath}`;
await fs.ensureFile(asset);
const pdfDoc = await PDFDocument.create()
const page = pdfDoc.addPage()
page.drawText(content);
const pdfBytes = await pdfDoc.save();
await fs.writeFile(asset, pdfBytes);
return asset;
}
await createPdf("somefile.pdf");
Then
import {
S3Client,
} from "@aws-sdk/client-s3";
import { fileTypeFromTokenizer } from 'file-type';
import { makeChunkedTokenizerFromS3 } from '@tokenizer/s3';
// initialize S3 client.
// Initialize S3 tokenizer
const s3Tokenizer = await makeChunkedTokenizerFromS3(client, {
Bucket: "NAME",
Key: "FILE",
});
await fileTypeFromTokenizer(s3Tokenizer);
Existing Issue Check
- I have searched the existing issues and could not find any related to my problem.
ESM (ECMAScript Module) Requirement Acknowledgment
- My project is an ESM project and my
package.json
contains the following entry:"type": "module"
.
File-Type Scope Acknowledgment
- I understand that file-type detects binary file types and not text or other formats.