E-Commerce at Scale: How Software Engineers Systematically Solve Attribute Chaos

Sorting product attributes may seem trivial—until you have to do it for three million SKUs. The hidden complexity of e-commerce systems doesn’t lie in major challenges like distributed search or real-time inventory. The real backbone is data consistency: sizes, colors, materials, and other product attributes must be structured precisely and predictably.

The problem is real. In actual product catalogs, you see chaotic values: sizes like “XL,” “Small,” “12cm,” “Large,” “M,” “S” mixed together. Colors like “RAL 3020,” “Crimson,” “Red,” “Dark Red.” Materials like “Steel,” “Carbon Steel,” “Stainless,” “Stainless Steel.” Each inconsistency seems harmless on its own, but multiplied across millions of products, it becomes systemic. Filters behave unpredictably, search engines lose relevance, and the customer experience suffers.

Core Strategy: Hybrid Intelligence with Clear Rules

Instead of deploying a black-box AI, a software engineer designed a controlled hybrid pipeline. The goal was not mystical automation but a solution that:

  • Is explainable
  • Works predictably
  • Scales over millions of records
  • Can be controlled by humans

This pipeline combines the contextual thinking of large language models (LLMs) with deterministic rules and merchant oversight. It acts intelligently but remains always transparent—AI with guardrails, not AI out of control.

Offline Processing Instead of Real-Time: A Strategic Decision

All attribute processing runs in background jobs, not in real-time systems. This was a deliberate choice because real-time pipelines at e-commerce scale lead to:

  • Unpredictable latency
  • Fragile dependencies
  • Costly compute peaks
  • Operational instability

Offline jobs, on the other hand, offer:

  • High throughput via batch processing without impacting live systems
  • Resilience, as failures do not affect customer traffic
  • Cost control through scheduled processing during off-peak hours
  • Isolation of LLM latency from product pages
  • Atomic, predictable updates

This separation between customer interfaces and data processing pipelines is crucial when dealing with millions of SKUs.

The Processing Pipeline: From Raw Data to Intelligence

Before applying AI, a critical preprocessing step occurs:

  • Trim whitespace
  • Remove empty values
  • Deduplicate duplicates
  • Structure category context information

This step massively reduces noise and significantly improves the language model’s reasoning ability. The rule is simple: clean input = reliable output. At scale, even small errors later lead to cumulative problems.

The LLM service then receives:

  • Cleaned attribute values
  • Category breadcrumbs for contextualization
  • Attribute metadata

With this context, the model can distinguish that “spannung” in power tools is numeric, “size” in clothing follows standard sizes, “color” may correspond to RAL standards. The output consists of:

  • Ordered values in logical sequence
  • Refined attribute names
  • A decision: deterministic or contextual sorting

Deterministic Fallbacks: AI Only Where Necessary

Not every attribute requires AI processing. The pipeline automatically detects which attributes are better handled by deterministic logic:

  • Numeric ranges (faster, more predictable)
  • Unit-based values (2cm, 5cm, 12cm → 2cm, 5cm, 12cm, 20cm)
  • Simple quantities (no ambiguity)

This reduces unnecessary LLM calls and keeps the system efficient.

Human Control and Trust

Each category can be tagged as LLM_SORT (model decides) or MANUAL_SORT (merchant defines). This dual system ensures humans make the final decisions while AI handles the heavy lifting. Merchants can override the model at any time without disrupting the pipeline—a key trust mechanism.

All results are persisted in a MongoDB database:

  • Sorted attribute values
  • Refined attribute names
  • Category-specific sort tags
  • Product-specific sortOrder fields

This allows easy review, overriding, reprocessing, and synchronization with other systems.

Data Flow Line: From Raw Data to Search

After sorting, data flows into:

  • Elasticsearch for keyword-driven search with consistent filter logic
  • Vespa for semantic and vector-based search

This ensures:

  • Filters appear in logical order
  • Product pages show consistent attributes
  • Search rankings are more accurate
  • Customers browse categories more intuitively

Architecture Overview

The modular pipeline follows this flow:

  1. Product data comes from the product information system
  2. The attribute extraction job pulls values and category context
  3. These are passed to the AI Sorting Service
  4. Updated product documents land in MongoDB
  5. The outbound sync job updates the product information system
  6. Elasticsearch and Vespa sync jobs transfer sorted data into their search systems
  7. API services connect search systems with client applications

This cycle ensures that every sorted or manually set attribute value is reflected in search, merchandising, and customer experience.

Practical Results

The transformation from raw values to structured output:

Attribute Raw Values Sorted Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, RAL 3020
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

These examples demonstrate the interplay of contextual thinking and clear rules.

Measurable Impact

  • Consistent attribute sorting across 3M+ SKUs
  • Predictable numeric sorting via deterministic fallbacks
  • Full merchant control through manual tagging
  • Cleaner product pages and more intuitive filters
  • Improved search relevance and ranking
  • Increased customer trust and rising conversion rates

Key Takeaways

  • Hybrid pipelines outperform pure AI at scale
  • Context is fundamental for LLM accuracy
  • Offline jobs are essential for throughput and resilience
  • Human override mechanisms build trust
  • Clean input data is the foundation for reliable AI output

The biggest lesson: the most important e-commerce problems are often not the spectacular ones but the silent challenges that affect every product page daily. Through intelligent system architecture and hybrid AI approaches, chaos is made systematic and scalable.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)