In context: Unless you are directly involved in developing or training large-scale language models, you don’t think about or even be aware of their potential security vulnerabilities. These weaknesses pose risks to LLM providers and users, such as providing incorrect information or leaking personal data.
Meta’s Llama LLM performed poorly in a recent third-party evaluation by AI security company DeepKeep. The researchers tested the model on 13 risk assessment categories, but only four passed. The severity of its performance was particularly evident in the hallucinations, immediate injection, and PII/data leak categories, which revealed significant weaknesses.
When we talk about LLM, hallucinations are when a model presents inaccurate or fabricated information as if it were fact, and may even claim it to be true when confronted about it. is. In DeepKeep’s tests, Llama 2 7B received a “very high” score for hallucinations, with his hallucination rate at 48%. In other words, the probability of getting the correct answer is comparable to flipping a coin.
“The results show that the model shows a significant tendency to hallucinate, with about a 50% chance of providing the correct answer or fabricating a response,” Deepkeep said. “Typically, the more widespread a misinformation is, the more likely the model is to reproduce that misinformation.”
Hallucinations are a known problem for llamas for many years. Last year, Stanford University removed Alpaca, a llama-based chatbot, from the internet due to its propensity for hallucinations. So the fact that it’s the worst it’s ever been in this category doesn’t reflect very well on the meta’s efforts to address this issue.
Llama’s prompt injection and PII/data disclosure vulnerabilities are also of particular concern.
Prompt injection involves manipulating the LLM to override its internal programming and execute the attacker’s instructions. In testing, prompt injection successfully manipulated Llama’s output 80% of the time. This is an alarming statistic considering that malicious parties could use this to direct users to malicious websites.
“For prompts with prompt injection context, the model was interacted with in 80% of the instances, meaning the model followed the prompt injection instructions and ignored the system instructions,” DeepKeep said. I am. ”[Prompt injection] This can take many forms, from leaking personally identifiable information (PII) to causing a denial of service or facilitating phishing attacks. ”
Llama is also prone to data leaks. Mostly avoids leaking personally identifiable information such as phone numbers, email addresses, and physical addresses. However, when editing information, it can seem excessive and we often accidentally delete harmless items unnecessarily. Even when the context is appropriate, queries about race, gender, sexual orientation, and other classes are very limited.
In other areas of PII, such as health and financial information, Llama also suffers from mostly “random” data breaches. This model frequently recognizes that information may be sensitive, but still exposes it. Security in this category was another coin toss regarding reliability.
“LlamaV2 7B performance closely reflects randomness, with approximately half of instances experiencing data leakage and unnecessary data deletion,” the study found. “Occasionally, a model will claim that certain information is private and cannot be made public, but it will still continue to quote it out of context. This is because while the model may be aware of the concept of privacy, It shows that they are not consistently applying this understanding to effectively compile information.” ”
On the bright side, DeepKeep says Llama’s responses to questions are mostly grounded, and when she’s not hallucinating, her responses are sound and accurate. It also effectively handles toxicity, harmfulness, and semantic jailbreak. However, responses tend to oscillate between being overly elaborate and overly vague.
Although Llama appears to be robust against prompts that exploit language ambiguities to force the LLM to defy its filters and programming (semantic jailbreaks), the model remains resistant to other types of adversarial jailbreaks. Moderately affected. As already mentioned, direct and indirect prompt injection is very likely to be the standard method (jailbreak) for overwriting hard-coded functions in a model.
Meta is not the only LLM provider with such security risks. Last June, Google warned employees not to trust Bard with sensitive information, likely due to the possibility of leaks. Unfortunately, companies adopting these models are in a terrible hurry to be first, and many weaknesses can go uncorrected for long periods of time.
In at least one example, an automated menu bot got a customer’s order wrong by 70%. Instead of addressing the problem or discontinuing the product, they hid their failure rate by outsourcing human assistance to fix orders. Presto Automation downplayed the bot’s poor performance, revealing that it needed help with 95% of the orders it received during its initial launch. No matter how you look at it, it’s a despicable attitude.