Large Language Models in Healthcare
Large Language Models (LLMs) have great potential to transform healthcare, providing a range of services including symptom assessment, health recommendations, emotional support, health education, to mention a few. However, the application of such models in healthcare is hindered due to various challenges such as the perpetuation of biases from training data, privacy and security issues, and the need to address ethical implications (e.g., fairness) and the role of human expertise. Hallucination is another shortcoming, where LLMs may generate inaccurate or misleading information. Lack of sufficient and diverse datasets for training is another hurdle, as healthcare data are often limited and highly sensitive. Additionally, ensuring exhaustive evaluation is crucial for assessing LLMs performance. Evaluating LLM presents a significant challenge, as it plays a vital role in enabling researchers and organizations to grasp the limitations, potentials, risks, and capabilities inherent in these models.
In this project, IFH – in collaboration with teams from Stanford University and HealthUnity – aims to develop LLM-based methods to address the existing challenges in the integration, utilization, and evaluation of chatbots in healthcare. To achieve this, we propose three specific aims:
- We aim to develop and investigate a range of evaluation metrics for healthcare chatbots.
- We will gather, generate, and compile the largest health-related dataset for benchmarking healthcare chatbots. This dataset has over 7.2 million conversations, more than half of which are English language conversations and the rest of which are non-English. Our goal is to translate the non-English dialogues to English, which will significantly increase our English data pool and improve the quality of our LLMs.
- We will conduct a thorough benchmarking process to create a leaderboard for healthcare chatbots, leveraging the evaluation metrics along with the health-related datasets. To achieve this, we assess the performance of all major state-of-the-art LLMs including open-source or close APIs such as ChatGPT, Vicuna, and MedAlpaca. These assessments provide valuable insights that contribute to the development of healthcare chatbots.