Using FastAI to Classify Malware using Deep Learning

Juan Cruz Alric Cortabarria
3 min readJun 21, 2021

This is one of my first Projects trying to implement a predictive model using what I’ve learned watching Jeremy Howard Fastai’s course https://course.fast.ai/.

First of all, I started reading this paper.

Secondly, I started to look for any Dataset that already contained the images from malware binary hexadecimal files and found this dropbox.

All of the heavy lifting was already done, and I could gather all my efforts in the modeling creation part.

I started by saving those images in my Google Drive so that later on I could easily access them by a Google Colab instance.

Start by importing all the necessary libraries. We are obviously going to use Fastai.

from fastai.vision.all import *from utils import *

Connect your Google Drive account.

from google.colab import drive 
drive.mount('/content/drive')

and indicate the path where your malware images are.

path = '/content/drive/MyDrive/mailing_paper_dataset_imgs'

add your base_dir and create a Path object.

root_dir = 'drive/MyDrive/' 
base_dir = root_dir + '/malimg_paper_dataset_imgs' path=Path(base_dir)

Now the fun part begins, Create the DataBlock specifying where to get the files, what to use as labels, what transformations to use, and create your augmented transformations regarding your training set.

fields = DataBlock(blocks=(ImageBlock, CategoryBlock), 
get_items=get_image_files,
get_y=parent_label,
item_tfms = Resize(224),
splitter=RandomSplitter(valid_pct=0.2, seed=42),
batch_tfms=aug_transforms())

Now that we have our DataBlock we can create the dataloaders. We only need to specify the path.

dls = fields.dataloaders(path)

Show some of the training batches

Training images

Now we only need to create the learner. We are going to use a Cnn (short from a convolutional neural network) we are also going to reduce the computation needed by limiting the decimal precision from the float type operation.

learn = cnn_learner(dls, resnet50, metrics=accuracy).to_fp16()

Once we create the learner I wanted to check which learning rate would suit better this data.

lr_mean, lr_steep = learn.lr_find()
Learning Rate graph

I decided to use 0.01. And then fitted the model.

learn.fit_one_cycle(5, 0.01)
Validation Accuracy over epochs

We got a pretty good result, but maybe we can push it a bit more.

learn.unfreeze()

Unfreeze the model and fit it again, but now try using a slice of the learning rate, like the example below.

learn.fit_one_cycle(5, lr_max=slice(1e-5, 3e-3))
Validation accuracy over epochs

WOW, 99% of accuracy… I was pretty skeptical and thought that maybe this was caused by overfitting the model… However, when plotting the confusion matrix I was astonished by the results.

interp = ClassificationInterpretation.from_learner(learn) interp.plot_confusion_matrix(figsize=(10,10))
Confusion Matrix

Conclusions:

  • You can clearly see that the dataset is not perfectly balanced, however, the model did really well generalizing the other classes with the correct label.

What do you think? What would you recommend I get better at?

Any piece of advice or criticism is always welcome. I need to continue growing and I started this journey so that I can share my progress and insights.

If you liked what you read, please help me by giving this post a clap.

--

--

Juan Cruz Alric Cortabarria

Machine Learning enthusiast, love video games and working out.