What’s the best way to normalize product data from over 200 retailers?

0
27
Asked By CuriousCat37 On

I'm tackling a complex technical challenge for a cross-retailer purchase memory system. The main task involves ingesting order confirmation emails from various retailers and normalizing the inconsistent product data into a standardized format. Each retailer has its own way of presenting product information, including names, sizes, SKUs, and categories, which makes it tough to compare products across platforms. For example, how do you match a product like "Men's Classic Fit Chino Pants - Khaki / 32x30" from one retailer to a similar item elsewhere?

Currently, I'm parsing confirmation emails using a read-only OAuth access for post-purchase notifications and extracting product details through a multi-LLM pipeline that uses OpenAI and Anthropic models for more accurate categorization. I then normalize the data against a product catalog of over 500,000 indexed products and assess the outcomes of purchases (such as whether items were kept or returned) based on follow-up emails.

The major hurdles include maintaining product identity across retailers (where the same product may have vastly different names), achieving consistency in category taxonomy, managing incomplete data from less structured emails, and attributing outcomes when return emails aren't clear. I'm keen to hear about strategies for handling large-scale product normalization and approaches to the fuzzy matching problem, whether it involves embedding-based similarity, structured extraction, or other methods that are effective at scale.

3 Answers

Answered By DataDude82 On

Honestly, your LLM pipeline might be more than you need. You're essentially facing an entity resolution problem, which has well-established solutions. Using techniques like Levenshtein distance along with basic NLP can get you about 80% of the way there. For the tricky part—cross-retailer identity—consider normalizing everything to a tuple of (brand, category, key attributes like size/color). This way, you can fuzzy match on that instead of trying to deal with the chaos of free-form product names. Your catalog of 500k products should help create reasonable lookup tables for each retailer once you figure out their common variants.

AnalyzerAlex -

That's a useful way to frame it! The tuple approach makes sense, but I've found even the brand field can be messy. Retailers often mangle brand names in unexpected ways, which makes fuzzy matching necessary even for that primary key. As for the LLM pipeline, I do agree that it shines more in handling the complexities of poorly structured emails rather than the matching itself.

Answered By NostalgiaNerd On

Wow, I can relate to this! Back in the dot-com boom, I worked at a startup where we faced a similar challenge. We convinced retailers to provide their product data, and each formatted it differently. It’s a really tough nut to crack; sorry I can't provide current solutions, but it definitely brings back memories from those days!

Answered By TechieTina On

You might have to create composite keys for some products and rely on SKUs for others. It's a perfect example of how AI can struggle with problems that need human insight to solve. There’s a clear distinction between intelligence and wisdom here. A model trained on unsupervised learning might help, but reaching out to machine learning communities could provide you with more tailored guidance.

CuriousCat37 -

Thanks! I appreciate the insight!

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.