I'm a tech trainee and relatively new to the job scene, so I could use some assistance with a project I'm working on. I've been using a template called Claude Haiku 3 to extract important details from specific documents. Recently, I've noticed that the template started giving me incorrect or missing information. This mainly happens when the data isn't available—for example, it refers to U.S. identifiers, but in my country, we use equivalents like the Social Security Number (SSN) and Employer Identification Number (EIN).
Sometimes the template works perfectly and displays the correct numbers, but other times, especially when the necessary information is incomplete, it either invents data or confuses the two identifiers. Overall, it seems to work fine about 90% of the time, but I've never had this kind of issue before, and it's only cropped up recently after several months of smooth operation.
Do you think there's a fix for this I could try with my prompts? I've already attempted to modify them for clarity and set specific conditions, but those efforts haven't really helped. Is it worth trying a different template, like Haiku 3.5, or is this a limitation of the model in use? I can't afford to switch to any more expensive templates.
1 Answer
It sounds like the issues you're facing are a bit of a common challenge with large language models (LLMs) when it comes to 'hallucination.' Have you tried using a few-shot prompt? By giving an explicit example where the SSN is missing, you can instruct the model not to create a filler value. But just keep in mind, there might be bias in the model that pushes it towards trying to fill in those gaps, which could lead to incorrect outputs.
I did give that a shot! I also made sure the temperature was set to minimum. Yet, I'm still facing issues with this specific document, especially with certain entries. For example, in a partnership document, Subject A has an SSN but no EIN, and Subject B has neither. The model seems to get confused, using SSN data for the EIN field, which shouldn't be the case. I get why it's happening—SSNs and EINs share similar formatting.