Date of Award
6-9-2025
Document Type
Thesis
Publisher
Santa Clara : Santa Clara University, 2025
Department
Computer Science and Engineering
First Advisor
Yuhong Liu
Abstract
Modern large language models (LLMs) increase the speed of software development but frequently hallucinate code, forcing developers to debug and rewrite generated code manually. This project introduces Documentation-Aware Code Generation, a retrieval-augmented generation (RAG) framework that injects authoritative documentation into the LLM prompt to reduce hallucinations in generated code. We build an ingestion pipeline that scrapes, cleans, chunks, and embeds official API documentation into a Pinecone vector database, then design a retrieval and re-ranking pipeline that retrieves the most relevant snippets for each user query. Results show that augmenting prompts with documentation lowers code hallucination by up to 60%. The system is exposed through backend APIs and a Next.js frontend, offering developers a tool that reliably generates code.
Recommended Citation
Amirtharaj, Daniel and Norwood, Rahmin, "Documentation-Aware Code Generation Via Retrieval Augmented Generation" (2025). Computer Science and Engineering Senior Theses. 319.
https://scholarcommons.scu.edu/cseng_senior/319
