Federated Retrieval-Augmented Generation for Indonesian-Language Misinformation Detection in Multi-Institutional Environments: A Review
Main Article Content
Abstract
The proliferation of misinformation in the Indonesian digital ecosystem presents a critical challenge for public discourse, democratic integrity, and social cohesion. Conventional centralized detection systems, while effective, impose significant privacy risks upon contributing institutions including media organizations, universities, and government agencies that possess unique and sensitive corpora. This review investigates the emerging paradigm of Federated Retrieval-Augmented Generation (Federated RAG), which synthesizes Federated Learning (FL) with Retrieval-Augmented Generation to enable privacy-preserving, collaborative misinformation detection across multi-institutional environments. The findings reveal that while Federated RAG represents a nascent yet promising frontier, no prior study has applied this paradigm to Indonesian-language misinformation in a cross-silo institutional setting. This review identifies key technical gaps, proposes a novel architectural taxonomy, and provides a roadmap for future empirical investigations. The framework presented herein is designed to be extensible to other low-resource languages across Southeast Asia and beyond
Article Details

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., & Chandra, V. (2018). Federated learning with non-IID data. arXiv preprint arXiv:1806.00582.
Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., & Seth, K. (2017). Practical secure aggregation for privacy-preserving machine learning. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 1175–1191. https://doi.org/10.1145/3133956.3133982
Chakraborty, A., Dahal, C., & Gupta, V. (2025). Federated Retrieval-Augmented Generation: A systematic mapping study. In Findings of the Association for Computational Linguistics: EMNLP 2025, 7362–7374, Suzhou, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.findings-emnlp.388
DataReportal. (2024). Digital 2024: Indonesia. https://datareportal.com/reports/digital-2024-indonesia (accessed 15 January 2025).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. Proceedings of the 3rd Theory of Cryptography Conference, 265–284. https://doi.org/10.1007/11681878_14
Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S., Eichner, H., Kiddon, C., & Ramage, D. (2018). Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604.
Jiang, Y., Bordia, R., Zhong, Z., Dognin, C., Singh, M., & Bansal, M. (2020). HoVer: A dataset for many-hop fact extraction and claim verification. Findings of EMNLP 2020, 3441–3460. https://doi.org/10.18653/v1/2020.findings-emnlp.309
Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., & Suresh, A. T. (2020). SCAFFOLD: Stochastic controlled averaging for federated learning. Proceedings of ICML 2020, 5132–5143.
Kominfo. (2024). Aduan Konten Negatif 2019–2024. Ministry of Communication and Information Technology of the Republic of Indonesia. https://aduankonten.id
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. Proceedings of COLING 2020, 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Smola, A., & Smith, V. (2020). Federated optimization in heterogeneous networks. Proceedings of MLSys 2020, 2, 429–450.
Liu, Y., Fan, L., Chen, C., Shen, T., Chang, B., & Sun, X. (2022). FedNLP: Benchmarking federated learning methods for natural language processing tasks. Findings of NAACL 2022, 157–175. https://doi.org/10.18653/v1/2022.findings-naacl.13
McMahan, H. B., Moore, E., Ramage, D., Hampson, S., & Agüera y Arcas, B. (2017). Communication-efficient learning of deep networks from decentralized data. Proceedings of AISTATS 2017, 54, 1273–1282.
Mironov, I. (2017). Rényi differential privacy. Proceedings of the 30th IEEE Computer Security Foundations Symposium, 263–275. https://doi.org/10.1109/CSF.2017.11
Nakamura, K., Levy, S., & Wang, W. Y. (2020). r/Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), 6149–6157. European Language Resources Association.
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting reviews. BMJ, 372, n71. https://doi.org/10.1136/bmj.n71
Republik Indonesia. (2022). Undang-Undang Nomor 27 Tahun 2022 tentang Perlindungan Data Pribadi. Lembaran Negara Republik Indonesia Tahun 2022 Nomor 196.
Schlichtkrull, M., Guo, Z., & Vlachos, A. (2023). AVeriTeC: A dataset for real-world claim verification with evidence from the web. Proceedings of NeurIPS 2023 Datasets and Benchmarks Track.
Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and VERification. Proceedings of NAACL-HLT 2018, 809–819. https://doi.org/10.18653/v1/N18-1074
Wang, J., Liu, Q., Liang, H., Joshi, G., & Poor, H. V. (2020). Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems, 33, 7611–7623.
Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, P., & Purwarianti, A. (2020). IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. Proceedings of AACL-IJCNLP 2020, 843–857.
Zhang, X., Ghosh, A., & Berber, I. (2023). Benchmarking transformer-based misinformation detectors: A cross-domain evaluation framework. Information Processing & Management, 60(1), 103184. https://doi.org/10.1016/j.ipm.2022.103184