Skip to content

PDF detection fails on uncaught Adobe AI check error on very small files #725

@Jimmy89

Description

@Jimmy89

Description

In this ticket I want to address three topics:

  1. Within the readme the s3 tokenizer guide has not been updated with the exports in more recent versions of @tokenizer/s3, as makeTokenizer must be replaced with makeChunkedTokenizerFromS3.
  2. I noticed that, after running my jest tests, the s3 'connection' to the file inserted in the tokenizer is still 'open', even though a file type was detected. Is there a way to enforce closing it?
  3. The bug described below.

I have a very small test PDF file of just 855 bytes, where I am using file-type v20.0.0 upon.
The PDF file is on a S3 environment, therefore I use v1.0.0 of the @tokenizer/s3 package to retrieve the file.

When calling fileTypeFromTokenizer with the file I received a End-Of-File every time.
Through some debugging, I found this line to cause it:

throw error;

As the comment states: if the file is not large enough, the error must be ignored. However, the error I receive is:

End-Of-File
        at RangeRequestTokenizer.loadRange (/node_modules/@tokenizer/range/lib/range-request-tokenizer.js:101:19)

Which is not an instance of strtok3.EndOfStreamError and therefore is not caught by the ignore error if statement. This looks like a bug to me.
Commenting out the throw error resolved the issue for me, but that is not a permanent one ;)

Below, I inserted snippets of my code to ease up the debugging.

To create the PDF:

import { fs } from "zx"; // which is the fs-extra package
import { PDFDocument } from 'pdf-lib'

const s3DestinationDir = ""; // Change to your liking

const createPdf = async (filePath, content = 'Random generated test PDF') => {
  const asset = `${s3DestinationDir}/${filePath}`;
  await fs.ensureFile(asset);
  const pdfDoc = await PDFDocument.create()
  const page = pdfDoc.addPage()
  page.drawText(content);
  const pdfBytes = await pdfDoc.save();
  await fs.writeFile(asset, pdfBytes);
  return asset;
}

await createPdf("somefile.pdf");

Then

import {
  S3Client,
} from "@aws-sdk/client-s3";
import { fileTypeFromTokenizer } from 'file-type';
import { makeChunkedTokenizerFromS3 } from '@tokenizer/s3';

// initialize S3 client.

  // Initialize S3 tokenizer
  const s3Tokenizer = await makeChunkedTokenizerFromS3(client, {
    Bucket: "NAME",
    Key: "FILE",
  });

 await fileTypeFromTokenizer(s3Tokenizer);

Existing Issue Check

  • I have searched the existing issues and could not find any related to my problem.

ESM (ECMAScript Module) Requirement Acknowledgment

  • My project is an ESM project and my package.json contains the following entry: "type": "module".

File-Type Scope Acknowledgment

  • I understand that file-type detects binary file types and not text or other formats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions