OpenForensics: A Digital Forensics GPU Pattern Matching Approach For The 21st Century

Ethan Bayne discusses his research at DFRWS EU 2018.

Ethan: Hello, everyone. I understand I am the last talk separating you between coffee, so I’ll try to keep it brief.

So, my name is Ethan Bayne, I’m coming today from Abertay University. Thank you very much for that introduction. For anyone that doesn’t know, that’s in Scotland. So, not sunny or what it’s like here.

Before I start, I’m just going to say a little bit about myself. I’m a lecturer at the university. But before I became a lecturer, I actually did my PhD in what I’m talking to you about today – OpenForensics. If this talk does interest you and you want to see exactly how the progression to this point in time has gone, certainly have a look at my thesis, and you’ll find all the answers in there.

Before I started my PhD, I come from a background of ethical hacking, computer security. But before I start introducing OpenForensics, I just want to go over the context that brought around this research. As crimes … more and more modern crimes now are being committed with some sort of electronic device attached to them. That’s increased the amount of digital forensic cases considerably. Alongside this, we’ve also had a ginormous increase in the capacity of storage devices that consumers can go out and buy. For example, today, anyone can just go out there online and buy an 8TB drive. That’s scary, the amount of data that someone can hold in a single hard drive.

Of course, these two points actually reflect on the current state of digital forensics today. There was a report from the HMIC that stated that delays of up to 12 months or more were not uncommon to see when they actually reviewed the case files from six police forces in the UK. That introduces a problem. There’s a very high demand for digital forensic investigation. So, it’s [calling] for better ways of actually doing forensic analysis and processing all the data that is gathered from digital forensic investigations.

And there have been many suggestions on how to do this. Before this research, looking into [GPUs] and [02:55] Aho-Corasick algorithm, we had solutions such as Scalpel that, from [03:05] that actually … that produced wonderful results when utilizing GPUs for digital forensics, particularly the physical file carving. But aside from that, people have suggested moving processing to [03:24], so grouping computers together to share that burden of processing, or uploading it to processing servers, the cloud. But even upstairs just now, the [elementary] solution is a fantastic and intuitive approach to try to solve this problem.

But this research in itself looks at the physical file carving problem in itself, and how digital forensic investigators can accelerate this on their own workstations, without the need of any specialist equipment. So, the processes and the methodology that I present here today, it’s been applied to file carving in itself, but there may be other solutions that this could be applied for.

The aim that I’ve set out with OpenForensics is to improve the performance of file carving of forensic data. First of all, to investigate the application of [parallel failure with] Aho-Corasick algorithm to digital forensics for performing [pattern] matching, but also to investigate the application of GPUs to the problem once again. The third, [more of a lesser point], is actually highlighting the way the data is read from storage devices.

The pattern matching problem that we face is that pattern matching in any sort of digital forensics context is one of the most computationally intensive tasks that we can run on data. The typical rule of thumb is that the more data that the analyst has, the more time that’s required to actually perform any sort of analysis on it. Alongside this, pattern-matching algorithms are constantly evolving. As we come into new tech, better tech, more intuitive ways of actually processing data, pattern-matching algorithms are evolving to make best use of the hardware that’s available now. To aid the computation, modern workstations, we have to make best of this, these advancements in the algorithms.

Before I start outlining the algorithm that OpenForensics use, I’m just going to explain briefly the algorithm utilized in Foremost. Foremost uses modified Boyer-Moore algorithm. Now, the Boyer-Moore … sorry. So, the approach used by Foremost is able to search for multiple patterns that might occur in data with a single read, and then it conducts a second pass to actually reproduce the physical files from the data. Acceleration of reading the data is actually gained through the use of a skip table, quite a pivotal part of the Boyer-Moore algorithm. The whole point of a skip table is … for the example up there, if I was searching for [ABCDAD], if for example, in step one, if I was searching for D and I come across an F, the skip table would advise me, there’s no possible match in the next four characters, skip forward five characters.

Of course, this approach is very quick, especially if you’re only searching for one search pattern with a single thread. But the algorithm itself can be very hard to parallelize – so, to make use of modern high-[07:39] parallelized hardware. You can split the data into several processing threads, but by doing so, you risk introducing other factors such as delays with traditional mechanical hard drives that will take longer to [seek … to separate] errors of the partition.

The algorithm that I proposed in OpenForensics uses the [parallel failureless] Aho-Corasick algorithm. I’m just going to call it PFAC, for short, from now on, because it’s quite a mouthful. Like the Boyer-Moore algorithm, it’s able to search for multiple patterns with a single [thread] of data. It relies on the tree topology state machine to search through the data, so you can see the example there – if I was searching for [team, telephone, and elephant]. That’s what the tree it creates would look like.

This tree topology utilizes states that you can see in the circle there. So, each byte of data, of the data that we input, is started in its own unique thread, starting from state one. Before the T and E on the example. This might seem massively unoptimized, but it’s not – it’s incredibly fast, when you throw this algorithm on to incredibly high, incredibly synchronous processes. If you imagine it, if you start looking for any one of the three words up there, if it doesn’t find the first letter – so if it doesn’t find a t and an e, it will immediately drop that thread, only using a few processing cycles, which is very small in the greater scheme of things.

Alongside the parallel failureless Aho-Corasick algorithm, I also looked into how to actually make it scalable with modern hardware. Because after all, workstation machines can vary in the amount of cores it might have available for the CPU to how many GPUs are available on the target machine. So, OpenForensics actually proposes using a thread for each logical core in the CPU. And the [threads and point] are synchronous – so, that is, each thread that’s created is self-managed. It can individually request a new block of data when they’re freed, when they’re doing nothing.

[Work] threads themselves can concurrently work with the PFAC algorithm, to perform [pattern] matching with or without the GPU’s assistance. The [worker] threads can also concurrently process the results once they’ve been delivered back, either the results that they’ve generated themselves or the results that they grab back from the GPU. But more importantly, each [worker] thread is not reliant on the data that other threads may have. They’re all self-sufficient.

So, the scaling works quite nicely between CPUs and GPUs. It’s rather simple, in fact. If there was a Core i7 workstation with two GPUs on, it would just simply have the threads employed between the two GPUs. So, the processing method kind of works something like this just now, in its current state.

The only difference between the CPU and the GPU processing is the actually pattern matching part. There are plans to actually expand upon the processing that’s done by the GPU, however, at this current stage, it’s more of a proof of concept and seeing whether this delivers significant speed-ups.

You can see there that if there are no patterns found within the 100MB block of data, it’s just … the threads just terminated there. It doesn’t do anything else with the data. However, if there are headers or footers found for the files that it’s looking for, it’ll actually pair them up and record them in the database. After all the data has been process, it’ll display the results back to the user.

Alongside the processing angle, we also briefly considered the reading process. Now that we [live in an era] where there’s very fast SSDs and VME drives that are capable of delivering gigabits per second, of course, the way that we actually read the data is becoming more and more important. We propose … we actually investigated how storage device benchmarking tools actually work, and kind of based our reading methods to reap just some of their techniques, by utilizing threaded and [13:38] performance.

This of course resulted in much faster throughput from the device that we tested, rather than just using standard library functions. So, the figure down there is just explaining how we actually separated out the data using the stream buffer size of four kilobytes, which fill a read instruction of 1MB, and these 1MB read instructions are queued to a maximum of 32, so only 32 read instructions can happen at once, but it’ll be a 100 of them to fill a 100MB data size.

On to the meat and bones of this research. We performed the case study that measured the time taken to perform pattern matching on data from two sources. The first one was of a 20GB external hard drive image, so a DDE image that was stored on the analyst’s SSD, on the machine. The second case study that we performed was of a physical, 256GB NVMe device that was hooked up internally on the [15:04] machine.

Before I go on to the three tools – the external hard drive had been purposely created for the study, so we actually knew the ground truth on what files were actually stored on the image. With the physical device, it’s more of a realistic case study – we don’t actually know exactly what’s stored on there, but the devices used were at least 80% full on capacity, so they presented quite a realistic case study, as it were. The three tools that were used were of course my own tool. Foremost, due to its open-source nature. But we also tried out RecoverMyFiles, which is a commercial option.

For the case studies, we searched for a small amount of 10 patterns, so 10 unique headers and footers, and a large amount of 40 patterns, so 40 unique header and footer combinations, to actually determine how these programs actually handle the size of the search, how many targets are specified. So, before I show you the results, I just want to acknowledge the limitations of this case study. OpenForensics in its current state is very, very simple. It doesn’t perform any sort of advanced file checks, simply it’s only searching for the headers and footers of a [file that’s searched for], pairs them up and shows the results. Of course, Foremost and RecoverMyFiles, on the other hand, employ far more complex file integrity checks.

The other limitations to this study is of course RecoverMyFiles is a commercial bit of software, so myself, as a researcher, I’m not very able at this moment to actually verify the search methods that they employ. The other thing about it is that I wasn’t able to specify exact headers and footers for RecoverMyFiles. So, instead, I’ve … for the [17:34] pattern test, I’ve defined four different file types, and for the larger pattern test 19 different file types. Now, as a disclaimer, you can’t [really directly] compare the results of RecoverMyFiles to the other two because of these unknowns. But it does serve as an indicative on how optimized the searching is.

With performing pattern matching on a 20GB [DD] image of an external hard drive, the graph to the left is of a workstation PC, and you can see some patterns emerging there. Foremost, expectedly, performed [the worst] in these tests, with RecoverMyFiles actually performing quite significantly better. Foremost, on the other hand, produced the best results, probably because of the simplicity of the search, and it has to be said. But looking at results, there are no … there’s no difference in the time required to search for 10 patterns and 45 patterns.

To the right is results gained from our laptop. And they’re in [19:02] 15, if you’re technically minded. The same sort of patterns emerge between the two devices used in this case study. Likewise, when investigating a physical drive, the same patterns are produced once again, with the same sort of results. Interestingly, OpenForensics also doesn’t show any deterioration in the time required to search for files within these … with the physical device.

But actually, [I’ve performed] an analysis on the [throughput], what we’re interested in. We can see that OpenForensics makes use of the full data transfer capabilities of most of the devices. Whereas both Foremost and RecoverMyFiles do produce much lower results.

On to the significant findings. The study shows that the PFAC algorithm, GPU acceleration presents some really noticeable improvements compared to the CPU methods that are employed by Foremost and RecoverMyFiles. That’s hardly surprising, considering the optimized algorithm and the fact that we’re throwing a massively parallel processor to the job. But the implementation actually showed little or no variation in some case between the time required to search for 40 and 10 search patterns. And while storage devices do admittedly limit the potential throughput, there are a clear need to actually integrate some smarter searching into current forensic tools, so that they can leverage the available computational power that might be available on the analyst’s machines.

Where to next? The source code for OpenForensics, you can browse yourself. It’s up and available on GitHub. But more so than that, if you want to try it out right away, there’s Windows binaries up there in the [21:35], so I do encourage you if you’re interested, download it – trust me, no malware. And give it a shot. You might be surprised.

The good thing about it is that it’s been programmed with OpenCL as the runner, so even if you didn’t have a discrete graphics card, you should be able to run it on your laptop’s integrated graphics processor. So, do try it out.
Moving on from here, I’m going to be doing some more algorithm optimization, including working at [22:16] PFAC, compression techniques.

Also, as mentioned, I’m looking into possibly integrating some GPU file verification processing techniques, but also, we have a project on the go just now, trying to implement some of the techniques of OpenForensics into Foremost. But also, I’m very much open to any collaborations. So, if you’re a software developer or if you’re a researcher wanting to integrate some of these methods into your own software, do get in touch with me.

Thanks for your time.

Host: Thanks, Ethan. Any questions?

Audience member: [Mike Park] here from Magnet Forensics. I’m curious – all your pattern matching, was it just static patterns or were you using [reg ex error graph] expressions at all in your matching? And have you looked at trying to run thousands of patterns across, as opposed to 10 or 40? Because when it comes to what we do, 40 is still a very small number.

Ethan: No, I acknowledge that. First of all, for the first part of the question, it was just standard header and footer files that … you can find it on any … most of them were actually sourced from Foremost’s configuration file. So, what you’ll find up on the GitHub is the actual configuration file that was used. So, you can actually peruse there. Also, in the paper, I actually specify exactly what we searched for in both the case studies that I performed.
For the second part, I personally have not tried thousands upon thousands of search targets, however, I do anticipate that you will see the same performance benefits of [importing] PFAC as well as GPU processing.

Audience member: Okay, thanks.

Ethan: Thank you.

Host: Other questions.

Ethan: I think they’re all desperate to get some coffee. [laughs]

Host: So, it is almost three o’clock. We will be back at 3:30, right? In half an hour. Just a quick reminder – if you want to play [tonight’s the rodeo], please bring with you the USB stick that is in your [pack]. Okay? Because the dataset for the [rodeo] is there. It’s encrypted. So, if you’re able to crack the zip file, you will have a little advantage. But I know that a lot of people is trying to do it. I will explain tonight the rules to play. But you can start creating your teams. The minimum is one person, the maximum is four. So, you can start building. We will have a presentation, dedicated presentation during the dinner. So, just a reminder to bring the USB stick. If you do not have your USB stick, we can provide you a copy, it’s around 1GB of data. Okay? Thank you.

[applause]

End of transcript