※ Open Samizdat


Abstraction Layers for LLM Security

Abstract I introduce the concept of abstraction layers into LLM safety & security. These layers are my attempt to categorize various risks based on our perspective of LLMs. I believe this categorization can aid discussions about the nature of risks, their ownership, and the appropriate approach to addressing them.
Existing attempts to systematize LLM security & safety often result in lists enumerating possible risks . These lists are great, but one thing that constantly irks me is that the risks often just do not seem to be on the same level of abstraction. For example, a DoS attack on an LLM API and the LLMs’ ability to generate disinformation seem conceptually mismatched. A DoS attack makes me think about the underlying network infrastructure, and I would expect that network security people can handle that, for example by deploying rate limiting. On the other hand, disinformation generation is a societal problem, and it would probably be tackled by AI ethics people or even social scientists, and the solution would probably be focused on aligning the LLM somehow. The people who will own these risks will probably be different, they might think about LLMs differently, and the abstraction of the solutions to mitigate these risks might be different as well.
In this blog, I attempt to conceptualize this idea by proposing 4 layers of abstraction for how we can think about LLMs in the context of their risks. This is a work in progress, the layers are quite fuzzy and overlapping, but I still think that this can be a useful tool. Each layer requires different expertise and offers a different perspective on what can be done. The layers also form an intuitive hierarchy regarding how close to the metal they are. Attacks on lower layers can lead to problems on higher layers, and problems on higher layers can sometimes be addressed at lower layers.

Layer 1: LLMs as software

At this layer, we do not care about the text generation capabilities or the architecture of LLMs. Instead, we perceive it as just another software component that comes with all the usual software security considerations. We are abstracting away everything specific to LLMs. We can use existing cybersecurity frameworks, such as the 7 cybersecurity layers, to think about the risks. There is already a ton of pre-existing expertise that can readily be applied.
Examples: DoS attacks, Network security, Supply chain attacks, Code security.
Expertise: Cybersecurity.

Layer 2: LLMs as machine learning software

We still do not care about the text generation capabilities, but we become aware that LLMs are in fact ML models and those have some unique engineering practices. ML models have specific architectures, their behavior is defined by their weights, they use specific libraries, etc. These aspects create new attack vectors. The risks we are thinking about are usually similar to the previous layer, but they are focused on these aspects.
For example, regular DoS attacks belong to the previous layer, but here we can think about attackers DoS-ing an LLM by tricking it into generating excessively long texts and thus performing large amounts of unnecessary calculations. Knowledge about how ML software works under the hood (e.g., how batching works, what the stop conditions are, how long inputs are handled) is necessary to understand the risks here.
Examples: Adversarial DoS attacks, Weights security.
Expertise: Cybersecurity, ML engineering.

Layer 3: LLMs as text generators

We realize that LLMs are a technology created from heaps of Web data that generates natural language text based on natural language instructions. This poses a never-before-seen set of risks from the security point of view, and this is also the layer where the expertise has grown tremendously for the past year. The practice of red teaming LLMs is also mainly focused on this layer. Many attacks here are unique to LLMs and understanding them requires non-trivial NLP knowledge. We have techniques such as prompt injection, data poisoning, or adversarial examples that can be used to hijack and derail the text generation process. The goal here is to ensure that the generated texts are naturally safe for the user and that an attacker can not change that.
Examples: Copyrighted data generation, Personal data leakage, Data poisoning, Prompt injection.
Expertise: NLP.

Layer 4: LLMs as social actors

As social actors, LLMs interact with human society and influence it. The risks are connected mainly to how the texts generated by LLMs might harm society, and how attackers can orchestrate such harms. It might be very difficult to identify and monitor the harms, requiring observation of how users interact with the LLM. Researchers might use various probes and benchmarks (e.g., the GEST dataset) to get proxy measurements, but I believe that this practice is far from perfect . This is the layer where we are the least ready to face adversarial attacks. In my opinion, nobody is really prepared for bad actors poisoning Web data with disinformation or for SEO attacks on RAG systems.
Examples: Misinformation generation, Harmful stereotypical behavior, Information bubbles.
Expertise: NLP/AI ethics, Social science.
Although I mention different types of expertise for each layer, vertical knowledge transfer between them is very important to make sure that there are no gaps. Some risks might need to be addressed at different layers at the same time. Consider the case of a user being tricked by an attacker to use a prompt that generates an SQL injection attack against a database. This can be tackled at the text generation layer. We can retrain the LLM to not generate dangerous SQL commands, or we can devise some sort of self-verification step that will filter out suspicious outputs. But at the same time, a more robust and simpler approach would be to address it at the software layer by setting adequate database permissions.
Higher layers are in general more affected by risks resulting from the training data, but also by the stochastic nature of LLMs. For this reason, it is impossible to guarantee near-100% security on these layers. Only lower layers can provide such strong guarantees. On the other hand, some risks might be just too fuzzy for the lower layers, and they might require higher-layers techniques.

See also


  author       = "Matúš Pikuliak",
  title        = "Abstraction Layers for LLM Security",
  howpublished = "https://www.opensamizdat.com/posts/security_layers",
  month        = "02",
  year         = "2024",