Download PDFOpen PDF in browserA Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal StudiesEasyChair Preprint 1525110 pages•Date: October 18, 2024AbstractThe Retrieval-Augmented Generation (RAG) framework has become widely adopted for its ability to tackle issues in large language models (LLMs), such as hallucination and outdated knowledge, by incorporating external knowledge. This approach is instrumental in medicine, finance, healthcare, and law. However, we are curious how the system's performance is impacted when colloquial queries are matched with documents containing specialized terminology in different languages. Additionally, we aim to explore whether LLMs can enhance retrieval in these challenging scenarios. Thus, we simulated real-world scenarios by selecting data from the Laws Articles Database of Taiwan as the knowledge source and collecting Legal questions frequently asked by the public from government websites as our testing data. Additionally, we've built a synthetic Statutory Article retrieval QA dataset with large language models based on legal Regulations and news articles to augment the dataset. We compared term-based sparse retrieval methods and dense retrieval methods as baselines. Our findings show that using LLMs to expand the query by generating relevant statutory articles, combined with a dense retriever, achieved the best performance. Keyphrases: Cross-lingual, Generation-Augmented Retrieval, LLM, Retrieval Augmented Generation, Statutory Article Retrieval, evaluation
|