OpenAI introduces a new framework set on making models less susceptible to manipulation

The Instruction Hierarchy categorizes instructions into different priority levels, significantly enhancing model security by ensuring that system-generated commands override less reliable user inputs or third-party data.

Using synthetic data and a method known as "context distillation," the model is trained to recognize and prioritize instructions from higher authority levels.

It effectively disregards lower-level inputs that conflict with more critical directives.

The method increased resistance to prompt extraction by up to 63% and reduced jailbreaking attempts by 30%.

The model occasionally rejected user inputs, but its performance remained stable overall.

Developing such frameworks is crucial to dulling the double-edged nature of LLMs, increasing security, and protecting data.

Paper -> https://arxiv.org/pdf/2404.13208

Previous
Previous

If you want to understand GPTs inner workings, this video is for you

Next
Next

Are you getting the full benefits of generative AI?