Practical approaches to solving large-scale e-commerce product attribute chaos with AI

2026-01-09 11:26:21

When discussing the scaling of e-commerce, people often focus on seemingly grand technological challenges like distributed search, inventory management, and recommendation engines. But what truly troubles every e-commerce platform are the most fundamental issues: inconsistencies in product attribute values.

Attribute values drive the entire product discovery system. They support filtering, comparison, search ranking, and recommendation logic. However, in real product catalogs, attribute values are rarely clean. Duplication, inconsistent formats, and ambiguous semantics are the norm.

Take a look at the seemingly simple attribute: [“XL”, “Small”, “12cm”, “Large”, “M”, “S”]

And “Color”: [“RAL 3020”, “Crimson”, “Red”, “Dark Red”]

On their own, these may seem manageable, but when you have over 3 million SKUs, each with dozens of attributes, the problem becomes a system-level challenge. Search becomes chaotic, recommendations fail, operations are overwhelmed with manual corrections, and user experience declines steadily.

Breaking the Black Box Mindset: Design Principles of a Hybrid Intelligent System

Faced with this challenge, the key is to avoid falling into the trap of “black box AI”—systems that mysteriously sort items without human understanding or control.

The correct approach is to build a pipeline with these characteristics:

High interpretability
Predictable behavior
Scalable operation
Accepts manual intervention

The ultimate solution is a Hybrid AI Pipeline: combining LLMs’ contextual understanding with explicit rules and manual controls. It operates intelligently when needed but always remains controllable. This is AI with guardrails, not out-of-control AI.

Offline Processing: The Foundation of Scalable Architecture

All attribute processing is performed in backend offline tasks, not in real-time. This is not a compromise but a strategic architectural decision.

Real-time pipelines sound attractive, but at e-commerce scale, they lead to:

Unpredictable latency fluctuations
Fragile dependency chains
Spikes in computational costs
Fragility in operations

Offline tasks, on the other hand, offer:

High throughput: batch processing massive data without impacting customer systems
Resilience: failures never reach user traffic
Cost control: computations scheduled during low-traffic periods
Isolation: LLM latency independent of product pages
Atomic consistency: updates are fully predictable and synchronized

When handling tens of millions of SKUs, isolating customer systems from data processing pipelines is critical.

Data Cleaning: The Highest ROI Step

Before applying AI, rigorous preprocessing is necessary—this step appears simple but yields significant results.

Cleaning pipeline includes:

Removing leading/trailing spaces
Eliminating null values
Deduplication
Simplifying categorical paths into structured strings

This ensures the LLM receives clean, clear input. In large-scale systems, even minor noise can explode into major issues later. Garbage in → garbage out. This fundamental rule becomes even more brutal at the million-level data scale.

Contextual Empowerment for LLM Services

LLMs are not just sorting attribute values alphabetically. They truly understand their meanings.

This service receives:

Cleaned attribute values
Category information (breadcrumbs)
Attribute metadata

With this context, the model can understand:

“Voltage” in power tools should be sorted numerically
“Size” in clothing follows a predictable progression (S→M→L→XL)
“Color” may use RAL standards (e.g., RAL 3020 codes)
“Material” in hardware has semantic relationships (Steel → Stainless Steel → Carbon Steel)

The model returns:

Sorted value sequences
Complete attribute names
A decision flag: whether to use deterministic sorting or context-aware sorting

This enables the pipeline to handle various attribute types without hardcoding rules for each category.

Deterministic Fallback: Knowing When Not to Use AI

Not every attribute requires AI. In fact, many are better handled with deterministic logic.

Numerical ranges, normalized values, simple sets benefit from:

Faster processing
Fully predictable ordering
Lower costs
Zero ambiguity

The pipeline automatically detects these cases and applies deterministic logic, maintaining system efficiency and avoiding unnecessary LLM calls.

Power Balance: Merchant Tagging System

Merchants need to retain control, especially over key attributes. Therefore, each category can be tagged as:

LLM_SORT — model decides
MANUAL_SORT — merchants define the order manually

This dual-label system allows humans to retain ultimate authority while AI handles most of the work. It also builds trust—merchants know they can override model decisions at any time without disrupting the pipeline.

Data Persistence: Using MongoDB as the Single Source of Truth

All results are directly written into the Product MongoDB, keeping architecture simple and centralized. MongoDB becomes the sole operational store for:

Sorted attribute values
Complete attribute names
Category-level sorting tags
Product-level sorting fields

This makes change auditing, value overrides, re-categorization, and synchronization with other systems straightforward.

Closed-Loop Search Layer: From Data to Discovery

Once sorting is complete, values flow into:

Elasticsearch — keyword-driven search
Vespa — semantic and vector-based search

This ensures:

Filter options appear in logical order
Product pages display consistent attributes
Search results are more accurately ranked
Browsing categories is intuitive and smooth

The power of attribute sorting is most evident in search, where consistency is critical.

System Overview: From Raw Data to User Interface

To operate this system over millions of SKUs, I designed a modular pipeline centered around backend tasks, AI inference, and search integration:

Data flow:

Product data sourced from the product information system
Attribute extraction tasks pull attribute values and category context
These are sent to the AI sorting service
Updated product documents are written into Product MongoDB
Outbound sync tasks push sorted results back to the product info system
Elasticsearch and Vespa sync tasks update their respective search indexes
API services connect search engines with client applications

This process ensures that every attribute value—whether sorted by AI or manual override—is reflected in search, shelf management, and the final customer experience.

Practical Outcomes of the Transformation

How are chaotic raw values transformed?

Attribute	Raw Chaotic Values	Sorted Output
Size	XL, Small, 12cm, Large, M, S	Small, M, Large, XL, 12cm
Color	RAL 3020, Crimson, Red, Dark Red	Red, Dark Red, Crimson, RAL 3020
Material	Steel, Carbon Steel, Stainless, Stainless Steel	Steel, Stainless Steel, Carbon Steel
Numeric	5cm, 12cm, 2cm, 20cm	2cm, 5cm, 12cm, 20cm

These examples demonstrate how the pipeline combines contextual understanding with clear rules to produce clean, understandable sequences.

Why Offline Instead of Real-Time?

Real-time processing would introduce:

Unpredictable latency fluctuations
Higher computational costs
Fragile dependency chains
Operational complexity

Offline tasks provide:

Batch processing efficiency
Asynchronous LLM calls
Retry logic and dead-letter queues
Manual review windows
Fully predictable costs

The trade-off is a slight delay from data ingestion to display, but the benefit is large-scale consistency—what customers truly value.

Business Impact

The results are quite significant:

Attribute sorting consistency across 3 million+ SKUs
Predictable numeric sorting via deterministic fallback
Fine-grained manual control for merchants through tagging
Cleaner product pages and intuitive filtering
Improved search relevance
Increased user trust and conversion rates

This is not just a technical victory but a win for user experience and revenue.

Key Takeaways

Hybrid pipelines outperform pure AI solutions at scale. Guardrails are essential.
Context significantly improves LLM accuracy
Offline tasks are the backbone of throughput and fault tolerance
Manual override mechanisms build trust and acceptance
Clean input is the foundation of reliable AI output

Conclusion

Sorting attribute values may seem simple, but when scaling to millions of products, it becomes a real challenge. By combining the intelligence of LLMs with clear rules and merchant controls, this invisible yet pervasive problem is transformed into a clean, scalable system.

A reminder: the greatest victories often come from solving those boring, overlooked problems—those that appear on every product page every day.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.