Practical approaches to solving large-scale e-commerce product attribute chaos with AI

When discussing the scaling of e-commerce, people often focus on seemingly grand technological challenges like distributed search, inventory management, and recommendation engines. But what truly troubles every e-commerce platform are the most fundamental issues: inconsistencies in product attribute values.

Attribute values drive the entire product discovery system. They support filtering, comparison, search ranking, and recommendation logic. However, in real product catalogs, attribute values are rarely clean. Duplication, inconsistent formats, and ambiguous semantics are the norm.

Take a look at the seemingly simple attribute: [“XL”, “Small”, “12cm”, “Large”, “M”, “S”]

And “Color”: [“RAL 3020”, “Crimson”, “Red”, “Dark Red”]

On their own, these may seem manageable, but when you have over 3 million SKUs, each with dozens of attributes, the problem becomes a system-level challenge. Search becomes chaotic, recommendations fail, operations are overwhelmed with manual corrections, and user experience declines steadily.

Breaking the Black Box Mindset: Design Principles of a Hybrid Intelligent System

Faced with this challenge, the key is to avoid falling into the trap of “black box AI”—systems that mysteriously sort items without human understanding or control.

The correct approach is to build a pipeline with these characteristics:

  • High interpretability
  • Predictable behavior
  • Scalable operation
  • Accepts manual intervention

The ultimate solution is a Hybrid AI Pipeline: combining LLMs’ contextual understanding with explicit rules and manual controls. It operates intelligently when needed but always remains controllable. This is AI with guardrails, not out-of-control AI.

Offline Processing: The Foundation of Scalable Architecture

All attribute processing is performed in backend offline tasks, not in real-time. This is not a compromise but a strategic architectural decision.

Real-time pipelines sound attractive, but at e-commerce scale, they lead to:

  • Unpredictable latency fluctuations
  • Fragile dependency chains
  • Spikes in computational costs
  • Fragility in operations

Offline tasks, on the other hand, offer:

  • High throughput: batch processing massive data without impacting customer systems
  • Resilience: failures never reach user traffic
  • Cost control: computations scheduled during low-traffic periods
  • Isolation: LLM latency independent of product pages
  • Atomic consistency: updates are fully predictable and synchronized

When handling tens of millions of SKUs, isolating customer systems from data processing pipelines is critical.

Data Cleaning: The Highest ROI Step

Before applying AI, rigorous preprocessing is necessary—this step appears simple but yields significant results.

Cleaning pipeline includes:

  • Removing leading/trailing spaces
  • Eliminating null values
  • Deduplication
  • Simplifying categorical paths into structured strings

This ensures the LLM receives clean, clear input. In large-scale systems, even minor noise can explode into major issues later. Garbage in → garbage out. This fundamental rule becomes even more brutal at the million-level data scale.

Contextual Empowerment for LLM Services

LLMs are not just sorting attribute values alphabetically. They truly understand their meanings.

This service receives:

  • Cleaned attribute values
  • Category information (breadcrumbs)
  • Attribute metadata

With this context, the model can understand:

  • “Voltage” in power tools should be sorted numerically
  • “Size” in clothing follows a predictable progression (S→M→L→XL)
  • “Color” may use RAL standards (e.g., RAL 3020 codes)
  • “Material” in hardware has semantic relationships (Steel → Stainless Steel → Carbon Steel)

The model returns:

  • Sorted value sequences
  • Complete attribute names
  • A decision flag: whether to use deterministic sorting or context-aware sorting

This enables the pipeline to handle various attribute types without hardcoding rules for each category.

Deterministic Fallback: Knowing When Not to Use AI

Not every attribute requires AI. In fact, many are better handled with deterministic logic.

Numerical ranges, normalized values, simple sets benefit from:

  • Faster processing
  • Fully predictable ordering
  • Lower costs
  • Zero ambiguity

The pipeline automatically detects these cases and applies deterministic logic, maintaining system efficiency and avoiding unnecessary LLM calls.

Power Balance: Merchant Tagging System

Merchants need to retain control, especially over key attributes. Therefore, each category can be tagged as:

  • LLM_SORT — model decides
  • MANUAL_SORT — merchants define the order manually

This dual-label system allows humans to retain ultimate authority while AI handles most of the work. It also builds trust—merchants know they can override model decisions at any time without disrupting the pipeline.

Data Persistence: Using MongoDB as the Single Source of Truth

All results are directly written into the Product MongoDB, keeping architecture simple and centralized. MongoDB becomes the sole operational store for:

  • Sorted attribute values
  • Complete attribute names
  • Category-level sorting tags
  • Product-level sorting fields

This makes change auditing, value overrides, re-categorization, and synchronization with other systems straightforward.

Closed-Loop Search Layer: From Data to Discovery

Once sorting is complete, values flow into:

  • Elasticsearch — keyword-driven search
  • Vespa — semantic and vector-based search

This ensures:

  • Filter options appear in logical order
  • Product pages display consistent attributes
  • Search results are more accurately ranked
  • Browsing categories is intuitive and smooth

The power of attribute sorting is most evident in search, where consistency is critical.

System Overview: From Raw Data to User Interface

To operate this system over millions of SKUs, I designed a modular pipeline centered around backend tasks, AI inference, and search integration:

Data flow:

  • Product data sourced from the product information system
  • Attribute extraction tasks pull attribute values and category context
  • These are sent to the AI sorting service
  • Updated product documents are written into Product MongoDB
  • Outbound sync tasks push sorted results back to the product info system
  • Elasticsearch and Vespa sync tasks update their respective search indexes
  • API services connect search engines with client applications

This process ensures that every attribute value—whether sorted by AI or manual override—is reflected in search, shelf management, and the final customer experience.

Practical Outcomes of the Transformation

How are chaotic raw values transformed?

Attribute Raw Chaotic Values Sorted Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, RAL 3020
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

These examples demonstrate how the pipeline combines contextual understanding with clear rules to produce clean, understandable sequences.

Why Offline Instead of Real-Time?

Real-time processing would introduce:

  • Unpredictable latency fluctuations
  • Higher computational costs
  • Fragile dependency chains
  • Operational complexity

Offline tasks provide:

  • Batch processing efficiency
  • Asynchronous LLM calls
  • Retry logic and dead-letter queues
  • Manual review windows
  • Fully predictable costs

The trade-off is a slight delay from data ingestion to display, but the benefit is large-scale consistency—what customers truly value.

Business Impact

The results are quite significant:

  • Attribute sorting consistency across 3 million+ SKUs
  • Predictable numeric sorting via deterministic fallback
  • Fine-grained manual control for merchants through tagging
  • Cleaner product pages and intuitive filtering
  • Improved search relevance
  • Increased user trust and conversion rates

This is not just a technical victory but a win for user experience and revenue.

Key Takeaways

  • Hybrid pipelines outperform pure AI solutions at scale. Guardrails are essential.
  • Context significantly improves LLM accuracy
  • Offline tasks are the backbone of throughput and fault tolerance
  • Manual override mechanisms build trust and acceptance
  • Clean input is the foundation of reliable AI output

Conclusion

Sorting attribute values may seem simple, but when scaling to millions of products, it becomes a real challenge. By combining the intelligence of LLMs with clear rules and merchant controls, this invisible yet pervasive problem is transformed into a clean, scalable system.

A reminder: the greatest victories often come from solving those boring, overlooked problems—those that appear on every product page every day.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • بالعربية
  • Português (Brasil)
  • 简体中文
  • English
  • Español
  • Français (Afrique)
  • Bahasa Indonesia
  • 日本語
  • Português (Portugal)
  • Русский
  • 繁體中文
  • Українська
  • Tiếng Việt