← Back to all projects

Project · LLM Alignment

LLM Steering via Content–Style Modeling

LLM Steering Activation Engineering Content–Style Modeling Toxicity Sycophancy
Content–Style decomposition for LLM activations
source: chatgpt
Status: ongoing research project.

What I'm Working On

I'm developing a lightweight content–style activation-steering method that reduces toxic and sycophantic generations from large language models — without sacrificing the semantic meaning of the original response.

The core idea is to factorize an LLM's internal activation $\boldsymbol{a} \in \mathbb{R}^d$ into a content component $\boldsymbol{c}$ — what the model is trying to say — and a style component $\boldsymbol{s}$ — how it is being expressed. Toxicity and sycophancy show up almost entirely in style; reasoning, intent, and topical content live in content. Steering the style sub-space at inference time should therefore flip the response from negative to positive expression while leaving the underlying meaning untouched.

Datasets

Models I'm Steering

Project by: Subash Timilsina.