CyBERT: Cybersecurity Claim Classification by Fine-Tuning the BERT Language Model

Kimia Ameri, Michael Hempel, Hamid Sharif, Juan Lopez, Kalyan Perumalla

November 2021

Abstract

We introduce CyBERT, a cybersecurity feature claims classifier based on bidirectional encoder representations from transformers and a key component in our semi-automated cybersecurity vetting for industrial control systems (ICS). To train CyBERT, we created a corpus of labeled sequences from ICS device documentation collected across a wide range of vendors and devices. This corpus provides the foundation for fine-tuning BERT’s language model, including a prediction-guided relabeling process. We propose an approach to obtain optimal hyperparameters, including the learning rate, the number of dense layers, and their configuration, to increase the accuracy of our classifier. Fine-tuning all hyperparameters of the resulting model led to an increase in classification accuracy from 76% obtained with BertForSequenceClassification’s original architecture to 94.4% obtained with CyBERT. Furthermore, we evaluated CyBERT for the impact of randomness in the initialization, training, and data-sampling phases. CyBERT demonstrated a standard deviation of ±0.6% during validation across 100 random seed values. Finally, we also compared the performance of CyBERT to other well-established language models including GPT2, ULMFiT, and ELMo, as well as neural network models such as CNN, LSTM, and BiLSTM. The results showed that CyBERT outperforms these models on the validation accuracy and the F1 score, validating CyBERT’s robustness and accuracy as a cybersecurity feature claims classifier.

Type

Journal article

Publication

Journal of Cybersecurity and Privacy

Open Access: https://www.mdpi.com/2624-800X/1/4/31.
This article belongs to the Special Issue Machine Learning and Data Analytics for Cyber Security.

Cyber-Physical Cybersecurity Energy Grid AI ML NLP CYVET

Kalyan Perumalla

Kalyan Perumalla is a computer scientist focused on research in supercomputing, quantum computing, and artificial intelligence, as research staff member, faculty, and program manager with the U.S. government, national labs, and universities. As a Federal Program Manager in Advanced Scientific Computing Research at the U.S. Dept. of Energy, Office of Science, He managed a $100-million R&D portfolio covering AI, HPC, Quantum, SciDAC, and Basic Computer Science. In his 25-year R&D leadership experience, he previously led advanced R&D as Distinguished Research Staff Member at the Oak Ridge National Laboratory (ORNL) developing scalable software and applications on the world’s largest supercomputers for 17 years, including as a line manager and a founding group leader. He has held senior faculty and adjunct appointments at UTK, GT, and UNL, and was an IAS Fellow at Durham University.

CyBERT: Cybersecurity Claim Classification by Fine-Tuning the BERT Language Model

Abstract

Kalyan Perumalla

Related