Integrating BERTopic and Large Language Models for Thematic Identification of Indonesian Legal Documents

EasyChair Preprint 15272

6 pages•Date: October 21, 2024

Moses Ananta, Rahayu Utari, Amany Akhyar and Gusti Ayu Putri Saptawati

Abstract

The increasing complexity and volume of legal documents pose significant challenges for information retrieval and text analysis. Traditional text analysis methods are often inadequate, resulting in time-consuming and labor-intensive processes. This study applies advanced natural language processing (NLP) techniques, specifically BERTopic and large language models (LLMs), to cluster and identify themes within Indonesian legal paragraphs. The methodology includes data collection, preprocessing, BERTopic topic modeling, and LLM-based topic refinement. Results show that the "intfloat/multilingual-e5-large-instruct" embedding model, with a minimum cluster size of 40, achieves optimal performance with a Silhouette Score of 0.723 and a Davies-Bouldin Index of 0.340. Subsequent LLM refinement using Meta’s LLaMA-3-8B-Instruct language model enhances the readability and relevance of the extracted topics. The approach enhances the organization and analysis of complex legal documents, with practical implications for improving legal information retrieval and management.

Keyphrases: BERTopic, Indonesian Legal Documents, Natural Language Processing, large language models, topic modeling

Links:

https://easychair.org/publications/preprint/zf7C

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:15272,
  author    = {Moses Ananta and Rahayu Utari and Amany Akhyar and Gusti Ayu Putri Saptawati},
  title     = {Integrating BERTopic and Large Language Models for Thematic Identification of Indonesian Legal Documents},
  howpublished = {EasyChair Preprint 15272},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser