Notifications
Clear all

JPG Forensics

10 Posts
4 Users
0 Likes
2,477 Views
(@tootypeg)
Posts: 173
Estimable Member
Topic starter
 

Hi all,

Quick question, ive been looking into JPG forensics a little bit and I have a load of cached JPGS from a single website. They all have the same quantization values suggesting this is a trait of the site, but in addition, they also have similar 'Progressive DCT SOF2 17 byte values. I just cant find any resources to explain this to me in simple terms, what is this value?

 
Posted : 29/09/2019 4:25 pm
(@athulin)
Posts: 1156
Noble Member
 

Hi all,

Quick question, ive been looking into JPG forensics a little bit and I have a load of cached JPGS from a single website. They all have the same quantization values suggesting this is a trait of the site, but in addition, they also have similar 'Progressive DCT SOF2 17 byte values. I just cant find any resources to explain this to me in simple terms, what is this value?

'Progressive' refers to progressive coding to speed up display over the internet the image is first coded very 'coarsely' giving a kind of out-of-focus view, then coded with progressively more 'fine' resolution, showing more and more details. Good when you have a slow link, as you get a rough approximation to the image fairly quickly. And if your browser/viewed does progressive JPEG – used to be a problem in some quarters.

'DCT' = Discrete Cosine Tranform, which is the backbone of JPEG compression. Ask a mathematician to explain that …

SOF - 'Start of Frame'. A magic number if you like JPEG has multiple SOFs to flag the data described. One of these is the baseline DCT SOF (SOF0), another the progressive DCT SOF (SOF2).

Following the SOF are a number of fixed bytes, with the same content layout for all types of SOFs, (8 bytes I think) and a number of additional fields, the number of which are specified by one of the fixed fields. I guess that the particular SOF header you have … produces 17 bytes either in toto, or in the SOF fields, or in only the dynamic SOF fields …

You probably need the JPEG standard to make sense of the structure of a JPEG file.

 
Posted : 29/09/2019 5:11 pm
jaclaz
(@jaclaz)
Posts: 5133
Illustrious Member
 

Around pages 34-35 here?

https://www.w3.org/Graphics/JPEG/itu-t81.pdf

jaclaz

 
Posted : 29/09/2019 5:20 pm
(@trewmte)
Posts: 1877
Noble Member
 

From what you have stated could it be you are looking at the 17 bytes associated with DHT?

DHT
The DHT (Define Huffman Table) marker defines (or redefines) Huffman tables, which are identified by a class (AC or DC3) and a number. A single DHT marker can define multiple tables; however, baseline mode is limited to two of each type, and progressive and sequential modes are limited to four. The only restriction on the placement of DHT markers is that if a scan requires a specific table identifier and class, it must have been defined by a DHT marker earlier in a file.

The structure of the DHT marker is shown below. Each Huffman table is 17 bytes of fixed data followed by a variable field of up to 256 additional bytes. The first fixed byte contains the identifier for the table. The next 1 6 form an array of unsigned 1-byte integers whose elements give the number of Huffman codes for each possible code length (1-16). The sum of the 1 6 code lengths is the number of values in the Huffman table. The values are 1 byte each and follow, in order of Huffman code, the length counts.

The number of Huffman tables defined by the DHT marker is determined from the length field. An application needs to maintain a counter that is initialized with the value of the length field minus 2. Each time you read a table you subtract its length from the counter. When the counter reaches zero all the tables have been read. No padding is allowed in a DHT marker, so if the counter
becomes negative the file is invalid.

Field Size
1 byte - The 4 high-order bits specify the table class. A value of 0 means a DC table, a value of 1 means an AC table. The 4 low-order bits specify the table identifier. This value is 0 or 1 for baseline frames and 0, 1 , 2, or 3 for progressive and extended frames.

16 bytes - The count of Huffman codes of length 1 to 16. Each count is stored in 1 byte

Variables - The 1-byte symbols sorted by Huffman code. The number of symbols is the sum of the 1 6 code counts

 
Posted : 29/09/2019 5:45 pm
(@tootypeg)
Posts: 173
Estimable Member
Topic starter
 

Thanks everyone, all three of you have really helped me move this forward.

My reason for asking is im putting together some work and want to make sure Im doing this correctly. I have collected the quantization tables from cached browser images from 50 pornography sites online. Basically I think the quant tables can allow me to tell which site an image has come from, but im addition, I also collected the Start Of Frame (Progressive DCT) 17 byte value as this also to me (from a data/pattern matching point of view), also seemed to be of value, but Im worried that I've miss understood and therefore Im tentative to include it.

So, im guessing from the data ive got, each site seems to encode its hosted JPGs in their own way (and all hosted JPGs on each individual site have consistent Quant tables). SO, I think if someone gave me a random image from one of the 50 sites, I could match it to one via comparing the quant tables. This seems to be my interpretation but Im also not a jpg expert. Does that sound useful and my understanding seem right?

 
Posted : 29/09/2019 7:32 pm
(@trewmte)
Posts: 1877
Noble Member
 

tootypeg a bit of extra info

SOFn
The SOFn (Start of Frame) marker defines a frame. Although there are many frame types, all have the same format. The SOF marker consists of a fixed header after the marker length followed by a list of structures that define each component used by the frame. The structure of the fixed header and the structure of a component definition are shown below.

Components are identified by an integer in the range 0 to 255. The JFIF standard is more restrictive and specifies that the components be defined in the order {Y, Cb, Cr} with the identifiers {1, 2, 3} respectively. Unfortunately, some encoders do not follow the standard and assign other identifiers to the components. The most inclusive way for a decoder to match the colourspace component with the identifier is to go by the order in which the components are defined and to accept whatever identifier- the encoder assigns. There can be only one SOFn marker per JPEG file and it must precede any SOS markers.

Fixed Portion of an SOF Marker

Field Size…….Description
1 byte……….Sample precision in bits (can be 8 or 12)
2 bytes………Image height in pixels
2 bytes………Image width in pixels
1 byte………Number of components in the image

Component-Specific Area of an SOF Marker

Field Size…….Description
1 byte………..Component identifier. JPEG allows this to be 0 to 255. JFIF restricts it to 1 (Y), 2 (Cb), or 3 (Cr)

1 byte………..The 4 high-order bits specify the horizontal sampling for the component. The 4 low-order bits specify the vertical sampling. Either value can be 1 , 2, 3, or 4 according to the standard. We do not support values of 3 in our code

1 byte………..The quantization table identifier for the component. Corresponds to the identifier in a DQT marker. Can be 0, 1 , 2, or 3

 
Posted : 29/09/2019 7:46 pm
(@athulin)
Posts: 1156
Noble Member
 

So, im guessing from the data ive got, each site seems to encode its hosted JPGs in their own way (and all hosted JPGs on each individual site have consistent Quant tables).

That sounds interesting, but somewhat odd. What mechanism in JPEG encoding/compression or web delivery would account for that? A tool artifact, I could accept. A user artifact ('Our user X set up the tool …') perhaps. Web server plugins like ImageResizer – probably falls into the tool artifact category.

Or is it an artifact associated with the original data? (Difficult to see how why would tool1 2 and 3 do the same thing just because input data had a particular format?)

Or is it related to the content? Don't see how. Or image sizes?

Or is there some kind of copyright protection involved? Some kind of watermarking component involved that works depending on the identity of the publisher or the web site?

SO, I think if someone gave me a random image from one of the 50 sites, I could match it to one via comparing the quant tables. This seems to be my interpretation but Im also not a jpg expert. Does that sound useful and my understanding seem right?

It's a start, but …. What would happen if you took images from outside the master collection? Say, Corel image library, or images from some of those CDs that are archived over at archive.org? Are they identified as 'not recognized'?

Can you get the same result out of tools encoding *you* choose? Or have you found something that works only because of some kind of 'best practice' in the porn industry?

 
Posted : 30/09/2019 5:34 am
(@tootypeg)
Posts: 173
Estimable Member
Topic starter
 

Trewmte - thanks for this additional information! I think I am slowly understanding.

Athulin- I agree, it is weird, but for example, some of the cached images collected are preview titles from videos hosted. Would these stills be generated by some tech as part of the website therefore regardless of what format was uploaded, the website standardises them as part of the upload process?

 
Posted : 30/09/2019 8:02 am
jaclaz
(@jaclaz)
Posts: 5133
Illustrious Member
 

Should it be of use, this thingy here

http//www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=75264&lngWId=1

can actually - among other features - parse/decode the SOF's and show the values in human readable format.

I personally doubt that the values may be used as fingerprint of a given site, as I see it
50/<actual number of pornographic sites> =~ 1/1 petafantazillion =~0 +/-0
or if you prefer, for each among the 50 sites you analyzed there is at least 1 megazillion of similar sites using the same software to get (progressive) jpeg's out of the videos.

jaclaz

 
Posted : 30/09/2019 8:40 am
(@athulin)
Posts: 1156
Noble Member
 

Athulin- I agree, it is weird, but for example, some of the cached images collected are preview titles from videos hosted. Would these stills be generated by some tech as part of the website therefore regardless of what format was uploaded, the website standardises them as part of the upload process?

No idea … but not impossible. Does Wappalyzer or similar tools tell you anything about the web platforms? I mean, if it's all Nginx and open software you may be able to replicate the behaviour. If it's some kind of common web service provider, on the other hand …

 
Posted : 30/09/2019 3:59 pm
Share: