Download PDFOpen PDF in browserIntegrating BERTopic and Large Language Models for Thematic Identification of Indonesian Legal DocumentsEasyChair Preprint 152726 pages•Date: October 21, 2024AbstractThe increasing complexity and volume of legal documents pose significant challenges for information retrieval and text analysis. Traditional text analysis methods are often inadequate, resulting in time-consuming and labor-intensive processes. This study applies advanced natural language processing (NLP) techniques, specifically BERTopic and large language models (LLMs), to cluster and identify themes within Indonesian legal paragraphs. The methodology includes data collection, preprocessing, BERTopic topic modeling, and LLM-based topic refinement. Results show that the "intfloat/multilingual-e5-large-instruct" embedding model, with a minimum cluster size of 40, achieves optimal performance with a Silhouette Score of 0.723 and a Davies-Bouldin Index of 0.340. Subsequent LLM refinement using Meta’s LLaMA-3-8B-Instruct language model enhances the readability and relevance of the extracted topics. The approach enhances the organization and analysis of complex legal documents, with practical implications for improving legal information retrieval and management. Keyphrases: BERTopic, Indonesian Legal Documents, Natural Language Processing, large language models, topic modeling
|