LW - [Paper] Programming Refusal with Conditional Activation Steering by Bruce W. Lee

The Nonlinear Library

Контент предоставлен The Nonlinear Fund. Весь контент подкастов, включая эпизоды, графику и описания подкастов, загружается и предоставляется непосредственно компанией The Nonlinear Fund или ее партнером по платформе подкастов. Если вы считаете, что кто-то использует вашу работу, защищенную авторским правом, без вашего разрешения, вы можете выполнить процедуру, описанную здесь https://ru.player.fm/legal.

6M ago 18:30

MP3•Главная эпизода

Архивные серии ("Канал не активен" status)

When? This feed was archived on November 22, 2024 12:25 (3M ago). Last successful fetch was on September 26, 2024 16:04 (5M ago)

Why? Канал не активен status. Нашим серверам не удалось получить доступ к каналу подкаста в течении длительного периода времени.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Paper] Programming Refusal with Conditional Activation Steering, published by Bruce W. Lee on September 12, 2024 on LessWrong.
For full content, refer to the arXiv preprint at https://arxiv.org/abs/2409.05907. This post is a lighter, 15-minute version.
Abstract
Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants.
We propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context.
Using CAST, one can systematically control LLM behavior with rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse."
This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization.
We release an open-source implementation of the activation steering toolkit at https://github.com/IBM/activation-steering.
Introduction
Problem: Lack of conditional control in activation steering.
Activation steering offers a promising alternative to optimization-based techniques by directly manipulating the model's native representations, often requiring only a simple activation addition step during each forward call. Our work here builds on Refusal in LLMs is mediated by a single direction, which has shown promise in altering LLM behavior, such as removing or inducing refusal behavior. However, the key limitation of current methods is the inability to condition when and what to refuse.
That is, adding a "refusal vector" using existing activation steering methods increases refusal rates indiscriminately across all inputs, limiting the model's utility.
Contribution: Expanding activation steering formulation.
We introduce Conditional Activation Steering (CAST), a method that enables fine-grained, context-dependent control over LLM behaviors. We introduce a new type of steering vector in the activation steering formulation, the condition vector, representing certain activation patterns induced by the prompt during the inference process.
A simple similarity calculation between this condition vector and the model's activation at inference time effectively serves as a switch, determining whether to apply the refusal vector. This approach allows for selective refusal of harmful prompts while maintaining the ability to respond to harmless ones, as depicted below.
Application: Selecting what to refuse.
Many alignment goals concern contextually refusing specific classes of instructions. Traditional methods like preference modeling are resource-intensive and struggle with subjective, black-box rewards. Additionally, the definition of harmful content varies across contexts, complicating the creation of universal harm models.
The usage context further complicates this variability; for instance, discussing medical advice might be harmful in some situations but essential in others, such as in medical chatbots.
We show CAST can implement behavioral rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse", allowing
for selective modification of responses to specific content without weight optimization.
On a technical level, our primary insight is that different prompts consistently activate distinct patterns in the model's hidden states during inference. These patterns can be extracted as a steering vector and used as reference points for detecting specific prompt categories or contexts. This observation allows us to use steering vectors not only as behavior modification mechanisms but also as condition ...

2447 эпизодов

#Podcasting Education #The Nonlinear Fund