![]() ![]() A specific example of this is shown in Figure 1, where we take the embedding of the input “firee”() and use it to predict the next most likely character, “y.” Similarly, we can use the embedding produced after the classification token to predict other properties of the input sequences, such as their likelihood of maliciousness.įigure 1: High-level overview of the URL Transformer architecture Loss Functions and Training Regimes Once the model is trained, we can use the URL Transformer to perform several different tasks, such as generatively predicting the next character in the input sequence by using the sequence embedding () as an input to another neural network with as softmax output over the possible vocabulary of characters. ![]() The network outputs one embedding for each input character, which captures all information learned by the model about the character sequence up to that point in the input. This means that only input characters from the left of a given character influence that character’s representation in each layer of the attention stack. An illustration of our architecture is shown in Figure 1.Īdditionally, the URL Transformer employs a masking strategy in its Attention calculation, which enforces a left-to-right (L-R) dependence. At a high level, the Attention layers allow each input to be associated with long-distance context of other characters that are important for the classification task, similar to the notion of attention in humans, while the FFNN layers provide capacity for learning the relationships among the combination of inputs and their respective contexts. This stack of layers is similar to the architecture introduced in the original Transformers paper. Each token in the input string is then projected into a character embedding space, followed by a stack of Attention and Feed-Forward Neural Network (FFNN) layers. When a URL is input to our Transformer, it is appended with special tokens-a classification token (“CLS”) that conditions the model to produce a prediction and padding tokens (“PAD”) that normalize the input to a fixed length to allow for parallel training. Our URL Transformer operates at the character level, where each character in the URL corresponds to an input token. We then discuss loss functions we employ to guide the training of the model, and finally compare our training approaches to more conventional ML-based modeling options. In the following sections, we outline a typical Transformer architecture and discuss how we adapt it to URLs with a character-focused tokenization. ![]() Studying Transformer performance on URL detection problem is a first logical step to extending Transformers to more generic cyber security tasks, since URLs are not technically natural language sequences but share some common characteristics with NLP. In this blog post, we discuss a case study in which we apply Transformers to malicious URL detection. Given the power of these Transformer models, it seems natural to wonder if we can apply them to other types of cyber security problems that do not necessarily involve natural language, per se. For example, in previous research by the FireEye Data Science team, a NLP Transformer was fine-tuned to detect disinformation on social media sites. Transformers’ NLP modeling capabilities are apparently so powerful that they pose security risks in their own right, in terms of their potential power to spread disinformation, yet on the other side of the coin, they can be used as powerful tools to detect and mitigate disinformation campaigns. AI Dungeon, a single and multiplayer text adventure game, uses Transformers to generate plausible unlimited content in a variety of fantasy settings. Spinoff models, such as those developed for question and answering, are capable of correlating context over multiple sentences. For example, OpenAI’s latest GPT-3 model is capable of generating long segments of grammatically-correct prose from scratch. Over the past three years Transformer machine learning (ML) models, or “Transformers” for short, have yielded impressive breakthroughs in a variety of sequence modeling problems, specifically natural language processing (NLP). Introduce novel loss function that balances classification and generative loss to achieve improved performance on the malicious URL detection task.Identify issues with applying generative pre-training to malicious URL detection, which is a cornerstone of Transformer training in natural language processing (NLP) tasks.Compare our model against other deep learning methods, and show it performs on-par with other top-scoring models.Train a Transformer model to detect malicious URLs under multiple training regimes.Perform a case study on using Transformer models to solve cyber security problems. ![]()
0 Comments
Leave a Reply. |