Let’s make chatbots act like pirates! We’ll tweak open-weight language models into pirate mode by modifying their internal state. Given the prompt “Find the bug in this python snippet: sum(range(5,0))”. The original open-source language model Gemma-2-9B-IT will answer: “The problem with max(range(5, 0))
lies in how Python's range
function works…”. Our custom Gemma-2-9B-IT model will answer: “Ahoy, matey! It seems ye be havin' trouble with a scurvy ol’ code snippet. "max(range(5,0)" be…”.
Try it here: https://www.neuronpedia.org/gemma-2-9b-it/steer?saved=cm58jn8420011p2phi2tydv7e
In language models, many concepts are represented linearly in activation space. We can nudge the model towards a concept by amplifying the activation of specific concepts. In this workshop, we’ll use a supervised approach (counterfactual pairs) and an unsupervised approach (pretrained sparse autoencoders) to steer model behavior. Bring your Laptop!