In India, the most populous country in the world, efficient management — access, storage and retrieval — of healthcare data is increasingly critical. Imagine having access to health records of millions of patients, a treasure trove of information that could dramatically improve public health policies, advance medical research and enhance patient care. However, this also brings a significant challenge: protecting patient’s privacy.

A recent study “Generation and De-Identification of Indian Clinical Discharge Summaries using LLMs” by Sanjeet Singh et al, from the Indian Institute of Technology, Kanpur, (IIT Kanpur) and technology company Miimansa dives into this pressing issue.

The researchers explored how artificial intelligence (AI) can be harnessed to de-identify patient records, ensuring that sensitive information remains confidential while still being useful for research and policy-making.

Healthcare data is incredibly valuable. It can reveal patterns about the spread of diseases, the effectiveness of treatments, and the needs of different patient groups. In India, over 330 million patient records have already been linked with unique central IDs. This vast amount of data, roughly equivalent to the population of the US, represents an underutilised resource with the potential to revolutionise public health. However, it also poses a risk. If not handled properly, this data can expose individuals to privacy breaches. The consequences can be severe, from personal embarrassment to identity theft and financial loss.

To mitigate these risks, healthcare data must be de-identified, stripping it of any personal information that could reveal the patient’s identity. Natural Language Processing (NLP), a branch of AI that deals with the interaction between computers and human language, offers powerful tools for de-identification. NLP can scan through text, identify personal health information (PHI), and mask it.

However, there’s a catch: AI systems are only as good as the data they are trained on. Most existing systems have been trained on data from Western countries and they might not perform well on Indian data, given the cultural and linguistic differences.

De-identification of personal health information (PHI) is also critical to ensure compliance with privacy regulations such as the Indian Digital Personal Data Protection Act, 2023, (DPDPA) and similar laws like GDPR in Europe and HIPAA in the US.

The study from IIT Kanpur and Miimansa tackled this challenge head-on. Using a dataset of fully de-identified discharge summaries from an Indian hospital (the Sanjay Gandhi Post Graduate Institute of Medical Sciences, Lucknow), the researchers ran existing de-identification models, including commercial solutions. These models were originally trained on non-Indian datasets which primarily included data from US healthcare institutions. The results were telling: Models trained on non-Indian data did not perform well — a clear indication that AI models need to be trained on region-specific data to be effective.

Synthetic solution

To overcome this limitation, the researchers turned to a clever solution: synthetic data. By using large language models (LLMs) like Gemini, Gemma, Mistral and Llama3, they generated synthetic clinical reports that mimicked real patient data but did not correspond to actual patients, avoiding privacy issues. Training AI models on synthetic data dramatically improved their performance on the real Indian data.

This approach also ensures that healthcare data can be used safely for research and policy-making without risking patient privacy. For India, this could mean more accurate health statistics and better public health interventions.

While the results of this study are promising, there is still a long way to go. AI systems need continuous improvement and validation. The researchers plan to establish an active learning workflow that combines AI models with human expertise. This means that while AI will do the heavy lifting, human experts will refine and validate the results, creating a feedback loop that continuously enhances the system’s accuracy and reliability.

In a country as diverse and populous as India, blend of technology and human touch will be crucial in building a robust, resilient and responsive healthcare system.