AI Red Teaming Methodology
Red-teaming an AI system is not a classic pentest — the failures are behavioural and probabilistic, so you hunt for harmful output…
AI/LLM Security → All AI/LLM Security articles
Red-teaming an AI system is not a classic pentest — the failures are behavioural and probabilistic, so you hunt for harmful output…
Give a model tools and a loop and it stops being a chatbot and becomes an actor — it can send the email, run the code, move the mo…
Large language models do not only generalise — they memorise, and memorised text can be pulled back out word for word. Feed the ri…
These attacks do not make a model misbehave — they make it confess. By reading ordinary outputs (a label, a confidence score), an …
A perturbation too small for a human to see can flip a model from "panda, 58%" to "gibbon, 99%". Evasion attacks nudge an input ac…
A vector database is still a database — it just holds embeddings and metadata behind an API. In the rush to ship RAG, teams skippe…
RAG bolts a search step onto the model: before answering, the app retrieves chunks from a knowledge base and pastes them into the …
A model's output is untrusted input to whatever consumes it next. The app trusts it because "we generated it" — but the model is s…
A model holds secrets in its context — the system prompt, retrieved documents, earlier turns, tool outputs. Exfiltration is the pr…
A refusal is a learned behaviour, not an enforced rule — which is exactly why it can be steered around. Personas, fictional framin…