OpenAI introduces a new framework set on making models less susceptible to manipulation

The Instruction Hierarchy categorizes instructions into different priority levels, significantly enhancing model security by ensuring that system-generated commands override less reliable user inputs or third-party data.

Using synthetic data and a method known as "context distillation," the model is trained to recognize and prioritize instructions from higher authority levels.

It effectively disregards lower-level inputs that conflict with more critical directives.

The method increased resistance to prompt extraction by up to 63% and reduced jailbreaking attempts by 30%.

The model occasionally rejected user inputs, but its performance remained stable overall.

Developing such frameworks is crucial to dulling the double-edged nature of LLMs, increasing security, and protecting data.

Paper -> https://arxiv.org/pdf/2404.13208

OpenAI introduces a new framework set on making models less susceptible to manipulation

If you want to understand GPTs inner workings, this video is for you

Are you getting the full benefits of generative AI?

Navigation