Download PDFOpen PDF in browser

A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies

EasyChair Preprint 15251

10 pagesDate: October 18, 2024

Abstract

The Retrieval-Augmented Generation (RAG) framework has become widely adopted for its ability to tackle issues in large language models (LLMs), such as hallucination and outdated knowledge, by incorporating external knowledge. This approach is instrumental in medicine, finance, healthcare, and law. However, we are curious how the system's performance is impacted when colloquial queries are matched with documents containing specialized terminology in different languages. Additionally, we aim to explore whether LLMs can enhance retrieval in these challenging scenarios. Thus, we simulated real-world scenarios by selecting data from the Laws Articles Database of Taiwan as the knowledge source and collecting Legal questions frequently asked by the public from government websites as our testing data. Additionally, we've built a synthetic Statutory Article retrieval QA dataset with large language models based on legal Regulations and news articles to augment the dataset. We compared term-based sparse retrieval methods and dense retrieval methods as baselines. Our findings show that using LLMs to expand the query by generating relevant statutory articles, combined with a dense retriever, achieved the best performance.

Keyphrases: Cross-lingual, Generation-Augmented Retrieval, LLM, Retrieval Augmented Generation, Statutory Article Retrieval, evaluation

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:15251,
  author    = {Yen-Hsiang Wang and Feng-Dian Su and Tzu-Yu Yeh and Yao-Chung Fan},
  title     = {A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies},
  howpublished = {EasyChair Preprint 15251},
  year      = {EasyChair, 2024}}
Download PDFOpen PDF in browser