SAME content DIFFER...
 
Notifications
Clear all

SAME content DIFFERENT hash values - Explanation Plz

9 Posts
7 Users
0 Likes
6,232 Views
(@nerimatrixx)
Posts: 26
Eminent Member
Topic starter
 

I recently found out (through testing & reading) that files with the same content can generate different hashes. I tested this via copy & paste (to ensure same amount of whitespace) with .txt, docx, and pdf files.

https://cloudnine.com/ediscoverydaily/electronic-discovery/dont-be-duped-files-with-different-hash-values-can-still-be-the-same-ediscovery-best-practices/

I was so confident in … the odds of any two dissimilar/different files having the same MD5 hash is one in 2^128 (340 billion billion billion billion); and the odds of any two dissimilar/different files having the same SHA1 hash is one in 2^160.

I was taught that file headers, filename, etc were not calculated in the value, so my question is … 1. How is this possible and 2. Why are DFIR materials and Lecturers still teaching the FALSE doctrine?

 
Posted : 08/10/2019 10:23 am
Bunnysniper
(@bunnysniper)
Posts: 257
Reputable Member
 

Sorry, you did not understand the article you have mentioned. It says, that you can have the same text in a txt, docx and PDF file and get different hashes. And that is fine and true, because all these files have different headers, structures, meta information inside.

Think of a liter of water you can put it into a glas, a bottle or a bath - it will always be 1 liter of water, but look very different.

But every time you hash a file and do not change it, you will always get the same hash. Otherwise hashing would be nonsense. What you mean might be the hash collision, but this occurs when different files generate the same hash. This is different from what you were writing about.

I would say you read some articles and books about forensic basics.

regards, Robin

 
Posted : 08/10/2019 12:17 pm
(@rich2005)
Posts: 535
Honorable Member
 

The article you reference isn't talking about hash collisions.
It's talking about two completely different files having different hashes.
That's what you'd expect.
It doesn't matter if the text is the same.
I could write "the quick brown fox jumps over the lazy dog" on a piece of paper and then get someone else to do the same.
The TEXT would be the same but the document is different.
So, using the example from your link, creating a PDF from a Word Document is generating a completely different document.
It therefore would have a different hash.
The link you're referring to isn't wrong and I suspect neither is the material you're referring to. I've not seen DFIR/lecturers teaching nonsense (although that's possible obviously).

Hash collisions are something completely different.

 
Posted : 08/10/2019 3:03 pm
(@athulin)
Posts: 1156
Noble Member
 

I was taught that file headers, filename, etc were not calculated in the value, so my question is … 1. How is this possible and 2. Why are DFIR materials and Lecturers still teaching the FALSE doctrine?

If you were indeed taught that as a general truth (and not something that was true only in special cases, such as for EnCase hashing EnCase image files, etc.), you have teachers who don't understand what they are teaching. If that's really is the case, it must be addressed.

But I would suggest starting at the other end. While it is difficult to assume that you are wrong, it is a useful approach, as you sooner or later have to convince someone that you are not committing similar error towards them.

Your question suggest that your understanding of what 'file content' means may not be entirely in line with what is meant when the term hashing is used. That may be a good place to start how do standard hashing tool – and more particularly the one you used for your tests – work, in detail?

Once you know how your hashing tool works, check the file content (at the same level as the hashing tool works on) of the files you've been doing your tests with.

 
Posted : 08/10/2019 3:19 pm
benfindlay
(@benfindlay)
Posts: 142
Estimable Member
 

I was so confident in … the odds of any two dissimilar/different files having the same MD5 hash is one in 2^128 (340 billion billion billion billion); and the odds of any two dissimilar/different files having the same SHA1 hash is one in 2^160.

No, the odds of an MD5 collision for 2 different files are I believe 2^64 and not 2^128, but still astronomically high. This is because odds of collision and total number of combinations are NOT the same thing.

I was taught that file headers, filename, etc were not calculated in the value, so my question is … 1. How is this possible and 2. Why are DFIR materials and Lecturers still teaching the FALSE doctrine?

To refer to Brian Carrier's reference model, the only data included in the hash calculation is that which is classifed as being in the file's content category. Metadata like the filename and filesystem information like dates and times etc. are not a factor in the hash calculation. File headers (for clarity of terminology, by this we mean file signature/magic number) are because they are IN the file.

As alluded to by others, Word Docs etc. have other internal data present (like author details) which is not visible in the same manner as the file's textual content. This still classifies as "file content" in Carrier's model, but is perhaps more akin to being termed "embedded" or "internal" metadata.

Hope this helps,

Ben

 
Posted : 08/10/2019 5:01 pm
(@nerimatrixx)
Posts: 26
Eminent Member
Topic starter
 

To refer to Brian Carrier's reference model, the only data included in the hash calculation is that which is classifed as being in the file's content category. Metadata like the filename and filesystem information like dates and times etc. are not a factor in the hash calculation. File headers (for clarity of terminology, by this we mean file signature/magic number) are because they are IN the file.

As alluded to by others, Word Docs etc. have other internal data present (like author details) which is not visible in the same manner as the file's textual content. This still classifies as "file content" in Carrier's model, but is perhaps more akin to being termed "embedded" or "internal" metadata.

Hope this helps,

Ben

Oh, ok. It's the definition of File Content that was not communicated to us properly (a group of us from class had tested it). We were taught that file content is the visible text within the file. Once file header and embedded data used to calculate hash value … of course the hash will be difference. I will update my notes with this.

THANKS A MILLION!!!

 
Posted : 08/10/2019 11:38 pm
tracedf
(@tracedf)
Posts: 169
Estimable Member
 

No, the odds of an MD5 collision for 2 different files are I believe 2^64 and not 2^128, but still astronomically high. This is because odds of collision and total number of combinations are NOT the same thing.

The odds of two random files having the same MD5 hash is 1 in 2^128. Similarly, the odds of a file having the same hash as any particular file is 1 in 2^128. The difficulty of finding two files with the same hash, however, is 1 in ~2^64. The difference in the latter circumstance is that if we are trying to find *any* collision rather than a specific one, we don't care which two files match so we can hash many different files and look for any collision between them. This is referred to as the birthday problem or the birthday paradox.

If you were to ask people their birthday, you would have to ask over 180 people on average (assuming they all answer you) before you found someone with your birthday. But, you'd only have to ask 23 people on average before you found two people with the same birthday. In the first instance, you're matching to a specific birthday and there is only a 1 in 365 chance each time you ask. In the second, you are comparing each birthday to every other birthday and don't care if Person-1 matches Person 2, or Person-2 matches Person-3, or Person-3 matches Person-1, etc.

 
Posted : 08/10/2019 11:46 pm
benfindlay
(@benfindlay)
Posts: 142
Estimable Member
 

<SNIP>

This is referred to as the birthday problem or the birthday paradox.

If you were to ask people their birthday, you would have to ask over 180 people on average (assuming they all answer you) before you found someone with your birthday. But, you'd only have to ask 23 people on average before you found two people with the same birthday. In the first instance, you're matching to a specific birthday and there is only a 1 in 365 chance each time you ask. In the second, you are comparing each birthday to every other birthday and don't care if Person-1 matches Person 2, or Person-2 matches Person-3, or Person-3 matches Person-1, etc.

Nice, I like this analogy…thanks for sharing!

 
Posted : 09/11/2019 8:39 am
keydet89
(@keydet89)
Posts: 3568
Famed Member
 

I was taught that file headers, filename, etc were not calculated in the value, so my question is … 1. How is this possible

File names are not…headers, being part of the file content, are.

and 2. Why are DFIR materials and Lecturers still teaching the FALSE doctrine?

No accountability.

If it's incorrect, someone who is aware of it needs to say something, but not on social media. Make the change where the change can be made. Otherwise, what's the point?

 
Posted : 10/11/2019 1:31 pm
Share: