GitHub

This repo contains files and scripts for our Master's thesis: Exploring JPEG File Containers Without Metadata: A Machine Learning Approach for Encoder Classification

Below is the steps we used to construct the TSV used for ML processing.

The process is a bit hacky and ugly and could probably be done in much cleaner way, but we basically performed the below steps to construct the tsv. These are for the Forchheim image data set, but the process was very similar for the Floreview one. The final tsv used (floraview_forchheim.tsv) is concatenation of these two extractions.

jpmarkers2.py is custom script that always removes image data from a jpeg (the ECS "segment"), along with the marker segments specified with -r. This is because we are not interested in the image data and to have smaller files to work with in the next steps.

for f in *.jpg; do
    jpmarkers2.py -r APP1,APP2,APP3,APP4,APP5,APP6,APP7,APP8,APP9,APP10, \
                     APP11,APP12,APP13,APP14,APP15,RST0,RST1,RST2,RST3,RST4, \
                     RST5,RST6,RST7 \
                  -i $f -o cleaned_$f
done

Extract features with fq and pipe through jq for pretty printing.

for f in cleaned_*.jpg; do
    fq -r '.|tojson' $f | jq . > $(basename -s .jpg $f).json;
done

Transform the json output from fq to tsv and also do some slight post-processing like concatinating qtables to hexstrings among other small things.

for f in *.json; do
    transform.py < $f > tsv/$(basename -s .jpg $f).tsv;
done

Extract headers from all TSV files and combine them into a single list of unique headers

for file in *.tsv; do
    head -1 "$file" >> all_headers.tsv
done
sort -u all_headers.tsv > unique_headers.tsv

concatenate TSVs and align columns:

05_tsv_fix.py tsv/ combined.tsv

Join on IDs to get columns for manufacturers, date etc.

add_ids_and_more.py combined.tsv IDs.tsv combined_with_IDs.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
IDs.tsv		IDs.tsv
README.md		README.md
add_ids_and_more.py		add_ids_and_more.py
floraview_forchheim.tsv		floraview_forchheim.tsv
jpeg_source_encoder_classification.ipynb		jpeg_source_encoder_classification.ipynb
jpmarkers2.py		jpmarkers2.py
transform.py		transform.py
tsv_fix.py		tsv_fix.py

matmat/jpeg_encoder_ml_classification

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages