RT Conference Proceedings T1 Named Entity Recognition for De-identifying Real-World Health Records in Spanish. A1 López-García, Guillermo A1 Moreno-Barea, Francisco J. A1 Mesa, Héctor A1 Jerez-Aragonés, José Manuel A1 Ribelles, Nuria A1 Alba-Conejo, Emilio A1 Veredas-Navarro, Francisco Javier K1 Historias clinicas - Control de acceso K1 Datos - Protección K1 Proceso en lenguaje natural (Informática) AB A growing and renewed interest has emerged in Electronic Health Records (EHRs) as a source of information for decision-making in clinical practice. In this context, the automatic de-identification of EHRs constitutes an essential task, since their dissociation from personal data is a mandatory first step before their distribution. However, the majority of previous studies on this subject have been conducted on English EHRs, due to the limited availability of annotated corpora in other languages, such as Spanish. In this study, we addressed the automatic de-identification of medical documents in Spanish. A private corpus of 599 real-world clinical cases have been annotated with 8 different protected health information categories. We have tackled the predictive problem as a named entity recognition task, developing two different deep learning-based methodologies, namely a first strategy based on recurrent neural networks (RNN) and an end-to-end approach based on transformers. Additionally, we have developed a data augmentation procedure to increase the number of texts used to train the models. The results obtained show that transformers outperform RNN on the de-identification of Spanish clinical data. In particular, the best performance was obtained by the XLM-RoBERTa large transformer, with a strict-match micro-averaged value of 0.946 for precision, 0.954 for recall and 0.95 for F1-score, when trained on the augmented version of the corpus. The performance achieved by transformers in this study proves the viability of applying these state-of-the-art models in real-world clinical scenarios. PB Springer Nature YR 2023 FD 2023 LK https://hdl.handle.net/10630/27313 UL https://hdl.handle.net/10630/27313 LA eng NO López-García, G. et al. (2023). Named Entity Recognition for De-identifying Real-World Health Records in Spanish. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 10475. Springer, Cham. https://doi.org/10.1007/978-3-031-36024-4_17 NO The authors acknowledge the support from the Ministerio de Economía y Empresa (MINECO) through grant TIN2017-88728-C2-1-R, from the Ministerio de Ciencia e Innovación (MICINN) under project PID2020-116898RB-I00, from the Universidad de Málaga and Junta de Andalucía through grant UMA20-FEDERJA-045, from the Malaga-Pfizer consortium for AI research in Cancer - MAPIC, from the Instituto de Investigación Biomédica de Málaga - IBIMA (all including FEDER funds) and from Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech. DS RIUMA. Repositorio Institucional de la Universidad de Málaga RD 20 ene 2026