PROJECTS

Home
/
Projects
/
Skewed Memorization in Large...

Skewed Memorization in Large Language Models: Quantification and Decomposition

This project investigates how large language models memorize data in a skewed way, where some sequences are far more likely to be remembered than others. We study how training duration, dataset size, and inter-sample similarity shape memorization during supervised fine-tuning, and link these effects to the token generation process. By combining theoretical analysis with empirical evaluation, we quantify and decompose memorization behaviors, providing new strategies to detect and mitigate risks. The goal is to build LLMs that are more privacy-preserving and secure, reducing the chance of unintentionally reproducing sensitive or copyrighted data.

Publications:

Full paper

Project information

Category: Future Healthcare, Health State Estimation, and Large Language Models
Contact Person: Amir M. Rahmani

Skewed Memorization in Large Language Models: Quantification and Decomposition