process management blog posts

Generative AI – Supercharging malware and vulnerability detection

Blog: OpenText Blogs

Generative AI, particularly Large Language Models (LLMs) are finding important applications in many industries including cybersecurity. Organizations are using LLMs and other Generative AI models effectively for the detection of Malware and Software Vulnerabilities.

Generative AI can be used in different ways for Malware and Vulnerability detection. Organizations use them for generating very powerful features for existing machine learning (ML) based detection methods. Use of LLMs for code analysis and discovering malicious intent or vulnerabilities in the code is another approach. Finally, Generative AI models, capable of generating new datasets, are effectively utilized for generating synthetic datasets to train and make the detection models more robust against adversarial attacks. We will dwell into each of these scenarios in the remaining part of the article.

Traditional detection methods

State of the art methods to detect malware and software vulnerabilities primarily make use of static and dynamic code analysis. In static analysis, developers and others examine a code without executing it. By analyzing its syntactic structure, data flow and control flow, they can discover patterns or instructions indicative of malicious intent. In dynamic analysis, the code is executed in a controlled environment and its behavior is analyzed for malicious actions.

Both these methods suffer from low recall (not able to detect all the bad code) and are easily evaded by skilled adversaries.

Large language model (LLM) based feature engineering

Due to the above-mentioned limitations of static and dynamic analysis methods, researchers had started using machine learning based detection methods for malware and vulnerability detection. The initial attempts were using hand crafted features which were found to be time consuming and less effective. LLMs offer a powerful alternative due to their unique capabilities:

  • LLMs excel at understanding the context and relationships between different parts of a source code. This allows them to automatically extract meaningful features that might be missed by manual feature engineering.
  • LLMs can process various data types, including source code, assembly code, and even behavioral logs, such as API call sequences or system call logs. This makes them versatile tools for feature engineering in different cybersecurity tasks.

Smaller LLMs, like BERT and its variants, are proving very useful for feature engineering ML models designed for malware and vulnerability detection. Some specific examples of this are:

  • BERTroid: This Android malware detection model uses BERT to generate embeddings from sequences of permissions requested by applications. These embeddings are then used to train a classifier that can distinguish between malicious and benign apps. BERTroid has shown promising results in detecting and classifying Android malware [1].
  • BERT-Cuckoo15: This model leverages BERT to analyze relationships between 15 different feature types derived from dynamic analysis of malware samples in the Cuckoo sandbox. By capturing the complex interdependencies between these features, BERT-Cuckoo15 achieves high accuracy in malware detection [2].
  • VulDeBERT: This vulnerability detection tool uses BERT to analyze C and C++ source code and identify vulnerable code patterns. It extracts abstract code fragments and uses BERT to learn representations that can effectively detect vulnerabilities [3].
  • XGV-BERT: This framework combines BERT with a Graph Neural Network (GCN) to detect software vulnerabilities. It leverages both code semantics, captured by BERT, and graph structures, captured by GCN, to achieve higher accuracy in vulnerability detection [4].

These smaller LLMs offer the advantages of reduced computational cost, faster inference, inhouse hosting for IP protection, and the ability to fine-tune with custom data for better performance.

Code analysis using large language models

LLMs are trained using data containing code samples in various programming languages. Because of this they are inherently capable of analyzing code snippets and discovering hidden functionalities, malicious intent and potential attack vectors. However, one limitation is the input token limit of LLMs.  Due to this, cybersecurity analysts need to fragment the code into snippets before sending to LLMs. This could result in missing detection of the overall functionality and behavior of the code. This is particularly true in when analyzing binary files.

First, we need to convert a binary code into its corresponding assembly language code, which are usually very large code bases. Then they are decompiled to generate the original source code, which is often only approximate. Manually reverse engineering malware has been one of the successful approaches to their detection. However, this requires deep expertise and is very time consuming.

Thanks to the recent advancements in LLM training, which enable input token numbers to exceed millions of tokens, and incorporation of massive code base, including assembly languages into the training data, it is now possible to scale and automate reverse engineering to detect malwares from binaries in minutes instead of days or weeks [4]

Prevent adversarial attacks using synthetic datasets

Another use of Generative AI in malware and vulnerability detection is in making the existing models more robust. Models such as GANs (Generative Adversarial Networks) and even LLMs can generate adversarial examples to test the robustness of malware detection models. By slightly modifying existing malware samples, these models can create variations that can evade detection. This helps researchers identify weaknesses in existing models and develop more resilient detection systems [5]

What’s next

The rise of the Generative AI paradigm has provided very powerful tools for cybersecurity professionals to combat cyber terrorism. We have now LLMs whose size ranging from a few hundred million parameters to a few hundred billion parameters. Depending on the use case one can select the appropriate Generative AI model, whether it is for feature engineering in Deep Learning model for malware detection or for reverse engineering a malware binary, or for making a detection model more robust. Inevitably, adversaries are going to use the same tools for creating new malware or for exploiting undetected vulnerabilities. Hence it is crucial that cybersecurity professionals be ahead of the curve and start incorporating Generative AI based tools in their practice to fight cyber-attacks.

Join us @ RSA 2025 where my fellow data scientist, Nakkul Khuraana, and I will be speaking about - ‘How To Use LLMs to Augment Threat Alerts with the MITRE

Join us @ RSA 2025 where my fellow data scientist, Nakkul Khuraana, and I will be speaking about - ‘How To Use LLMs to Augment Threat Alerts with the MITRE Framework.'

Sources

  1. Detecting Android Malware: From Neural Embeddings to Hands-On Validation with BERTroid
  2. BERT-Cuckoo15: A Comprehensive Framework for Malware Detection using 15 Dynamic Feature Types
  3. VulDeBERT: A Vulnerability Detection System Using BERT
  4. From Assistant to Analyst: The Power of Gemini 1.5 Pro for Malware Analysis
  5. Exploring LLMs for Malware Detection: Review, Framework Design, and Countermeasure Approaches

The post Generative AI – Supercharging malware and vulnerability detection appeared first on OpenText Blogs.