I'm new to this tech stuff and have been facing a tough time trying to learn from various tutorials for a while now. I have around 4,000 detailed questions and answers specifically related to construction laws, and I'm curious about how to create a chatbot that can answer questions using this data and a law library, while minimizing any inaccuracies or 'hallucination' issues. I'm eager to learn rather than just seeking a ready-made solution, and I'm open to investing in model training or paying for API usage. Any advice would be greatly appreciated!
2 Answers
While it’s nearly impossible to eliminate all hallucination, you can reduce it significantly using Retrieval Augmented Generation (RAG). This method allows your chatbot to reference your own dataset specifically, which can point out where the answer is coming from. If you're looking for something user-friendly, I recommend NotebookLM. For OpenAI models, you might want to dive into learning about RAG and experimenting with their API for building it yourself.
To start off, you’ll want to consider the size of your dataset. If your collection is about 128k tokens or less, OpenAI's o3 is an excellent choice. For larger datasets, like around 1M tokens, Gemini 2.5 Pro might be better. If you're looking for a budget-friendly option, check out GPT-4.1 or Gemini 2.5 Flash. Here’s a benchmark that could help you decide: [benchmark link]
Thanks for the insight! My dataset is closer to 3.5 million tokens, plus even more for the law library, so I appreciate the suggestions!
Thanks so much for the response! I tried using RAG to find the 5 most similar questions to build my answers, but it didn't quite work out. I'm thinking of cleaning up my dataset and narrowing down the laws I'm using.