OpenAI introduces a new framework set on making models less susceptible to manipulation
The Instruction Hierarchy categorizes instructions into different priority levels, significantly enhancing model security by ensuring that system-generated commands override less reliable user inputs or third-party data.
Using synthetic data and a method known as "context distillation," the model is trained to recognize and prioritize instructions from higher authority levels.
It effectively disregards lower-level inputs that conflict with more critical directives.
The method increased resistance to prompt extraction by up to 63% and reduced jailbreaking attempts by 30%.
The model occasionally rejected user inputs, but its performance remained stable overall.
Developing such frameworks is crucial to dulling the double-edged nature of LLMs, increasing security, and protecting data.
Paper -> https://arxiv.org/pdf/2404.13208