LLM Unlearning

Maching Learning for Large Language Models. Making LLMs forget Harry Potter!

In this research project, we explore the problem of unlearning in large language models (LLMs). We work on top of a recent paper that shows LLMs can be made to forget specific concepts by training them on a carefully designed dataset. We extend this work by proposing a variation for the unlearning method by targetting existing flaws show that the method is able to unlearn the concept of Harry Potter. We evaluate this method by benchmarking it against synthetic harry potter questions generated using GPT 4, while retaining the performance on common LLM benchmarks.