BUILDING AND EVALUATING SOFTWARE VULNERABILITY
DATASETS
Combining Static Analysis Alerts and Software Metrics to
Automatically Detect Vulnerable C/C++ Functions

Antunes, João Miguel Namorado Clímaco Henggeler

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/98395

Title:	BUILDING AND EVALUATING SOFTWARE VULNERABILITY DATASETS Combining Static Analysis Alerts and Software Metrics to Automatically Detect Vulnerable C/C++ Functions
Other Titles:	BUILDING AND EVALUATING SOFTWARE VULNERABILITY DATASETS Combining Static Analysis Alerts and Software Metrics to Automatically Detect Vulnerable C/C++ Functions
Authors:	Antunes, João Miguel Namorado Clímaco Henggeler
Orientador:	Vieira, Marco Paulo Amorim
Keywords:	Segurança de Software; Deteção de Vulnerabilidades; Análise Estática de Código; Métricas de Software; Aprendizagem Computacional; Software Security; Vulnerability Detection; Static Code Analysis; Software Metrics; Machine Learning
Issue Date:	10-Nov-2021
Serial title, monograph or event:	BUILDING AND EVALUATING SOFTWARE VULNERABILITY DATASETS Combining Static Analysis Alerts and Software Metrics to Automatically Detect Vulnerable C/C++ Functions
Place of publication or event:	DEI- FCTUC
Abstract:	As vulnerabilidades de software podem ter consequências graves caso sejam exploradas, incluindo acessos não autorizados, violações de dados, e perdas financeiras. O processo de rever código manualmente é tanto complexo como demorado, sendo por vezes inviável de aplicar dependendo do tamanho de um projeto. Por outro lado, as empresas de software são cada vez mais encorajadas a publicar e atualizar os seus produtos o mais rapidamente possível. Apesar de existirem ferramentas que encontram potenciais vulnerabilidades automaticamente no código fonte, estas geraram um número elevado de falsos positivos, ou vulnerabilidades mal classificadas. Para além disso, este tipo de técnicas nem sempre são suficientemente fiáveis para detetar vulnerabilidades.No presente trabalho apresentamos um processo automatizado capaz de recolher novas vulnerabilidades a partir do website CVE Details, selecionar ficheiros, funções, e classes afetadas do repositório de cada projeto, gerar métricas de software e alertas de segurança (i.e. potenciais vulnerabilidades), e construir datasets robustos de modo a serem processados por algoritmos de aprendizagem computacional. Este mecanismo foi usado para desenvolver datasets de unidades de código vulneráveis para cinco projetos implementados em C/C++: Mozilla, Linux Kernel, Xen Hypervisor, Apache HTTP Server, e GNU C Library.Adicionalmente, o dataset relativo a funções vulneráveis foi validado através de modelos de aprendizagem computacional, de modo a determinar quais os parâmetros que geravam os melhores classificadores. Os resultados experimentais demonstram que é possível usar tanto métricas de software como alertas de segurança para detetar funções vulneráveis, tendo sido obtidos valores de precisão, revocação, e F-score de 93.7%, 95.1%, e 93.9%, respetivamente. Foi também feita uma análise sobre a influência do ano em que as vulnerabilidades foram descobertas no desempenho destes classificadores. No entanto, não foi possível determinar se o uso de dados de anos anteriores permite a deteção de funções vulneráveis nos anos seguintes. Software vulnerabilities can have serious consequences when exploited, such as unauthorized authentication, data breaches, and financial losses. Manually reviewing an entire codebase for weaknesses is cumbersome, time-consuming, and sometimes impossible depending on a project's size. Due to the nature of this industry, companies are increasingly pressured to deploy and update software as quickly as possible. Automated tools called SATs can generate security alerts that highlight potential vulnerabilities in an application's source code, though they are prone to misidentified vulnerabilities called false positives.In this work, we present an automated process capable of collecting new vulnerabilities from the CVE Details website, retrieving affected files, functions, and classes from a project's repository, generating software metrics and security alerts (i.e. potential vulnerabilities), and building robust datasets capable of being fed to machine learning algorithms. We put this mechanism into practice by creating vulnerable code unit datasets for five large and widely known C/C++ projects: Mozilla, Linux Kernel, Xen Hypervisor, Apache HTTP Server, and GNU C Library.Additionally, the created vulnerable function dataset is validated using a wide assortment of machine learning parameters, so as to build and find the best classifiers capable of labeling functions as vulnerable, neutral, or belonging to a specific vulnerability category. Results show that it is possible to use both software metrics and security alerts to detect vulnerable function code, with precision, recall, and F-score values as high as 93.7%, 95.1%, and 93.9%, respectively. Moreover, further analysis into the influence of a vulnerability's detection year on the classifiers' performance was carried out. However, it could not be determined if using static data from previous years could be used to detect vulnerable functions in later ones.
Description:	Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
URI:	https://hdl.handle.net/10316/98395
Rights:	openAccess
Appears in Collections:	UC - Dissertações de Mestrado

Files in This Item:

File	Description	Size	Format
João-Henggeler-Dissertação-2020-2021.pdf		4.72 MB	Adobe PDF	View/Open

Show full item record

Page view(s)

100

checked on Apr 16, 2024

Download(s)

115

checked on Apr 16, 2024

Google Scholar^TM

Check

This item is licensed under a Creative Commons License

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM