jpeg: Add parsing of DHT parameters #934

matmat · 2024-04-21T00:30:18Z

This is my try at adding parsing of Huffman table parameters for the DHT segment in JPEG files. Feel free to clean it up as do not speak Go very well :)

wader · 2024-04-21T07:27:37Z

Hey, thanks! looks good i think. Do you know if the dht tables are large or usually small? looks small when i tried it on a few images.

Please run go test ./format ./pkg/interp -update (run without -update to just see diff) to write new expected test output, review the changes and add amend to the commit if it looks good.

If you want you could also add a new test file, the dht in 4x4.jpeg looks quite simple, maybe want something more realistic?

wader · 2024-04-25T11:18:46Z

Hi again, i fmt:ed the code and updated the tests

wader · 2024-04-25T11:29:53Z

@matmat Thanks!

matmat · 2024-04-28T09:57:45Z

Thank you for merging and cleaning it up! Sorry for not coming back sooner, unfortunately I did not have the time. As to your question about the length. According to[1] "The maximum number of DCT byte codes possible in the baseline JPEG format is 348", though they observed a maximum of 277 in the datasets they looked at.

https://commons.erau.edu/jdfsl/vol13/iss2/7/

Would you accept a similar PR for missing parameters for other markers? (eg. "Ri" for DRI)

wader · 2024-04-28T11:12:47Z

Thank you for merging and cleaning it up! Sorry for not coming back sooner, unfortunately I did not have the time. As to your question about the length. According to[1] "The maximum number of DCT byte codes possible in the baseline JPEG format is 348", though they observed a maximum of 277 in the datasets they looked at.

Good 👍 mostly worried if something can decode into millions of fields then maybe decoding of that should be made optional using a format option.

https://commons.erau.edu/jdfsl/vol13/iss2/7/

Would you accept a similar PR for missing parameters for other markers? (eg. "Ri" for DRI)

Sure! will accept anything that is either in standards or used in public. The whole point of fq is to decode as detailed as possible, except maybe decode to actual pixels (maybe that also in some cases) so i'm very happy if you want to help fill in missing things! 😄

wader · 2024-06-23T17:56:56Z

@matmat just noticed https://www.diva-portal.org/smash/get/diva2:1870437/FULLTEXT02.pdf congratulations! 🥳 have only briefly scrolled thru it yet but will surely have a deeper look! how was it to use fq? is there any more info how it was used?

matmat · 2024-06-23T23:56:55Z

@wader Thank you! :) We mainly used fq to extract the marker segments and their parameters as an intermediate step towards transforming the data to tabular form suitable for ML processing. This sure saved us a lot of time! fq already suporting extracting this information in a structured way was very very helpful. So many thanks for a useful tool!

I have now documented some details here (all very hacky): https://github.com/matmat/jpeg_encoder_ml_classification/

I guess maybe the first three steps are the most relevant from an fq perspective:

jpmarkers2.py is custom script that always removes image data from a jpeg (the ECS "segment"), along with the marker segments specified with -r. This is because we are not interested in the image data and to have smaller files to work with in the next steps.

for f in *.jpg; do
    jpmarkers2.py -r APP1,APP2,APP3,APP4,APP5,APP6,APP7,APP8,APP9,APP10, \
                     APP11,APP12,APP13,APP14,APP15,RST0,RST1,RST2,RST3,RST4, \
                     RST5,RST6,RST7 \
                  -i $f -o cleaned_$f
done

Extract features with fq and pipe through jq for pretty printing.

for f in cleaned_*.jpg; do
    fq -r '.|tojson' $f | jq . > $(basename -s .jpg $f).json;
done

Transform the json output from fq to tsv and also do some slight post-processing like concatinating qtables to hexstrings among other small things.

for f in *.json; do
    transform.py < $f > tsv/$(basename -s .jpg $f).tsv;
done

matmat · 2024-06-24T00:21:56Z

@wader Also, while working on this, I stumbled on the concept of Interval Parsing Grammars. This would be very interesting to explore further to make robust parsers for binary file formats. But maybe that is a bit out of scope for fq?

wader · 2024-06-24T13:59:34Z

Great to hear it was useful! this kind of usage is the reason why fq exists to begin with :) was first created to query media files in various exotic ways while developing and debugging codec and packaging software. ... but also just a way for me to learn more about media files :)

BTW instead of fq -r '.|tojson' $f | jq . you can probably do fq tovalue $f or same thing using -V fq -V . $f (tovalue convert the decode tree into a jq value and then it gets outputted as JSON)

Nope haven't heard of IPG before, looks very interesting, thanks for sharing! something like that is very much in scope for fq. I've been exploring various ways to do "runtime" formats for fq but nothing finished yet. There is a WIP prototype to add kaitai support, and i see it's mentioned in the paper, looks a bit similar. Usage would then be something like fq -d /path/to.ksy <query> file

jpeg: Add parsing of DHT parameters

6e13b4b

wader force-pushed the dht-patch branch from dcc2b85 to 6e13b4b Compare April 25, 2024 11:17

wader merged commit b8eec40 into wader:master Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

jpeg: Add parsing of DHT parameters #934

jpeg: Add parsing of DHT parameters #934

Uh oh!

matmat commented Apr 21, 2024

Uh oh!

wader commented Apr 21, 2024

Uh oh!

wader commented Apr 25, 2024

Uh oh!

wader commented Apr 25, 2024

Uh oh!

matmat commented Apr 28, 2024

Uh oh!

wader commented Apr 28, 2024

Uh oh!

wader commented Jun 23, 2024

Uh oh!

matmat commented Jun 23, 2024

Uh oh!

matmat commented Jun 24, 2024

Uh oh!

wader commented Jun 24, 2024

Uh oh!

Uh oh!

jpeg: Add parsing of DHT parameters #934

jpeg: Add parsing of DHT parameters #934

Uh oh!

Conversation

matmat commented Apr 21, 2024

Uh oh!

wader commented Apr 21, 2024

Uh oh!

wader commented Apr 25, 2024

Uh oh!

wader commented Apr 25, 2024

Uh oh!

matmat commented Apr 28, 2024

Uh oh!

wader commented Apr 28, 2024

Uh oh!

wader commented Jun 23, 2024

Uh oh!

matmat commented Jun 23, 2024

Uh oh!

matmat commented Jun 24, 2024

Uh oh!

wader commented Jun 24, 2024

Uh oh!

Uh oh!