Event

16:00
-
17:30

Day 3

🕹️ Hack your LLM: Modify chatbot behavior with activation steering

web.rager@posteo.de

de, en

Self-organized Session

Let’s make chatbots act like pirates! We’ll tweak open-weight language models into pirate mode by modifying their internal state. Given the prompt “Find the bug in this python snippet: sum(range(5,0))”. The original open-source language model Gemma-2-9B-IT will answer: “The problem with max(range(5, 0)) lies in how Python's range function works…”. Our custom Gemma-2-9B-IT model will answer: “Ahoy, matey! It seems ye be havin' trouble with a scurvy ol’ code snippet. "max(range(5,0)" be…”.

Try it here: https://www.neuronpedia.org/gemma-2-9b-it/steer?saved=cm58jn8420011p2phi2tydv7e

In language models, many concepts are represented linearly in activation space. We can nudge the model towards a concept by amplifying the activation of specific concepts. In this workshop, we’ll use a supervised approach (counterfactual pairs) and an unsupervised approach (pretrained sparse autoencoders) to steer model behavior. Bring your Laptop!

location

Meet at the rocket near the main entrance at 18:20