I'm dealing with a dataset that contains around 400 million entries of company-owned products, but the names and addresses come in inconsistent formats. For example, one entry might be "Company A, 5th Avenue, Product: A," while another could be "Company A inc, New York, Product B." My goal is to match these records with a ground truth dataset that contains clean names and parsed addresses for these companies. I'd like to find suitable methods to perform this match without relying on geocoding, as I don't have that data in my ground truth. Ideally, I want an approach that can take a parsed address and company name—maybe even some additional info like industry—and return the best matching candidates from the clean dataset with a score between 0 and 1. Also, there are cases where the address could be vague (like just a city name), and the Google API might not give a definitive result. Do you have recommendations for handling large datasets like this and possibly managing ambiguous address scenarios? Lastly, can the Google API handle global addresses, and would a language model be more effective in parsing addresses from various regions? Any help would be greatly appreciated!
2 Answers
You might want to check out Pylibpostal; it's a Python wrapper for a library that does great address parsing. It could help standardize things a bit before you match them up.
First, clean and normalize your data as much as possible. This means expanding abbreviations and ensuring consistency with states and countries. After that, it’s mainly about entity resolution. Keep in mind that first impressions actually matter!
Absolutely! I did something similar, and converting everything to lowercase and replacing common terms like 'trib' with 'tributary' really helped. I bet addresses will follow a similar pattern!