Skewed Memorization in Large Language Models: Quantification and Decomposition

This project investigates how large language models memorize data in a skewed way, where some sequences are far more likely to be remembered than others. We study how training duration, dataset size, and inter-sample similarity shape memorization during supervised fine-tuning, and link these effects to the token generation process. By combining theoretical analysis with empirical evaluation, we quantify and decompose memorization behaviors, providing new strategies to detect and mitigate risks. The goal is to build LLMs that are more privacy-preserving and secure, reducing the chance of unintentionally reproducing sensitive or copyrighted data.

Publications:

Full paper

Project information

  • Category: Future Healthcare, Health State Estimation, and Large Language Models
  • Contact Person: Amir M. Rahmani
  • Skewed Memorization in Large Language Models: Quantification and Decomposition

info@futurehealth.uci.edu

© Copyright 2021 UCI Institute for Future Health - All Rights Reserved