I'm working on a project involving a cross-retailer purchase memory system. The main challenge I face is normalizing inconsistent product data extracted from order confirmation emails from over 200 different retailers. Each retailer formats their product information differently—things like names, variants, sizes, SKUs, categories, and prices vary significantly. For example, normalizing a product like "Men's Classic Fit Chino Pants - Khaki / 32x30" from one retailer to a similar item at another is tricky and requires a fuzzy-matching process rather than perfect matches.
Currently, my approach involves:
- Parsing email confirmations through OAuth (read-only, for post-purchase emails only).
- Using a multi-LLM pipeline that incorporates OpenAI and Anthropic for precise category-specific extraction.
- Normalizing the data against a catalog of over 500,000 indexed products.
- Classifying outcome signals (like kept, returned, replaced) from follow-up emails.
The tough parts include ensuring product identity consistency across retailers—which often use different names and SKUs—and maintaining category taxonomies. Additionally, I struggle with incomplete data from unstructured retailer emails and vague outcome attributions for returns. Has anyone else tackled the challenge of large-scale product normalization from diverse data sources? I'm curious about effective approaches for fuzzy matching, whether it's embedding-based similarity, structured extraction, or something else.
3 Answers
It sounds like your use of LLMs might be overkill for this task. What you're trying to do essentially falls into entity resolution, which has been solved effectively in many situations. A combination of Levenshtein distance and basic NLP techniques can take you a long way, often reaching around 80% accuracy. The real challenge is cross-retailer identity—don't overthink it! Normalize the product data to a standard key using (brand, category, and key attributes like size/color). This allows you to fuzzy match on that set rather than relying on free-form names. Your catalog of 500,000 products should help you create lookup tables that account for retailer-specific variants. Many people find parsing emails to be the more challenging part—some retailers are quite cryptic with their formatting!
I remember dealing with a similar situation years ago when I managed content for a startup. We faced the same inconsistency with product data from retailers—each one had their own way of presenting information. It's frustrating but very common. I wish I could provide a solution, but this discussion brings back memories of navigating all those differing formats!
You will likely need to create composite keys or rely on SKUs in some cases. It highlights the distinction between intelligence and wisdom in machine learning. Sometimes these labor-intensive problems need more than just AI solutions. Unsupervised learning could handle parts of this, but human brainpower comes into play, especially when you're analyzing complex relationships.
Thanks for the insight! It's really helpful to get different perspectives as I navigate this challenge.

You're right, normalizing to a primary key tuple is a solid approach! I agree—many retailers consider brand names and SKUs differently, which complicates things. It's crucial to account for those variations. I think leveraging embeddings for categorization might be necessary, especially for apparel where terminology shifts drastically between retailers.