E-Commerce at Scale: How Software Engineers Systematically Solve Attribute Chaos

2026-01-09 11:23:04

Sorting product attributes may seem trivial—until you have to do it for three million SKUs. The hidden complexity of e-commerce systems doesn’t lie in major challenges like distributed search or real-time inventory. The real backbone is data consistency: sizes, colors, materials, and other product attributes must be structured precisely and predictably.

The problem is real. In actual product catalogs, you see chaotic values: sizes like “XL,” “Small,” “12cm,” “Large,” “M,” “S” mixed together. Colors like “RAL 3020,” “Crimson,” “Red,” “Dark Red.” Materials like “Steel,” “Carbon Steel,” “Stainless,” “Stainless Steel.” Each inconsistency seems harmless on its own, but multiplied across millions of products, it becomes systemic. Filters behave unpredictably, search engines lose relevance, and the customer experience suffers.

Core Strategy: Hybrid Intelligence with Clear Rules

Instead of deploying a black-box AI, a software engineer designed a controlled hybrid pipeline. The goal was not mystical automation but a solution that:

Is explainable
Works predictably
Scales over millions of records
Can be controlled by humans

This pipeline combines the contextual thinking of large language models (LLMs) with deterministic rules and merchant oversight. It acts intelligently but remains always transparent—AI with guardrails, not AI out of control.

Offline Processing Instead of Real-Time: A Strategic Decision

All attribute processing runs in background jobs, not in real-time systems. This was a deliberate choice because real-time pipelines at e-commerce scale lead to:

Unpredictable latency
Fragile dependencies
Costly compute peaks
Operational instability

Offline jobs, on the other hand, offer:

High throughput via batch processing without impacting live systems
Resilience, as failures do not affect customer traffic
Cost control through scheduled processing during off-peak hours
Isolation of LLM latency from product pages
Atomic, predictable updates

This separation between customer interfaces and data processing pipelines is crucial when dealing with millions of SKUs.

The Processing Pipeline: From Raw Data to Intelligence

Before applying AI, a critical preprocessing step occurs:

Trim whitespace
Remove empty values
Deduplicate duplicates
Structure category context information

This step massively reduces noise and significantly improves the language model’s reasoning ability. The rule is simple: clean input = reliable output. At scale, even small errors later lead to cumulative problems.

The LLM service then receives:

Cleaned attribute values
Category breadcrumbs for contextualization
Attribute metadata

With this context, the model can distinguish that “spannung” in power tools is numeric, “size” in clothing follows standard sizes, “color” may correspond to RAL standards. The output consists of:

Ordered values in logical sequence
Refined attribute names
A decision: deterministic or contextual sorting

Deterministic Fallbacks: AI Only Where Necessary

Not every attribute requires AI processing. The pipeline automatically detects which attributes are better handled by deterministic logic:

Numeric ranges (faster, more predictable)
Unit-based values (2cm, 5cm, 12cm → 2cm, 5cm, 12cm, 20cm)
Simple quantities (no ambiguity)

This reduces unnecessary LLM calls and keeps the system efficient.

Human Control and Trust

Each category can be tagged as LLM_SORT (model decides) or MANUAL_SORT (merchant defines). This dual system ensures humans make the final decisions while AI handles the heavy lifting. Merchants can override the model at any time without disrupting the pipeline—a key trust mechanism.

All results are persisted in a MongoDB database:

Sorted attribute values
Refined attribute names
Category-specific sort tags
Product-specific sortOrder fields

This allows easy review, overriding, reprocessing, and synchronization with other systems.

Data Flow Line: From Raw Data to Search

After sorting, data flows into:

Elasticsearch for keyword-driven search with consistent filter logic
Vespa for semantic and vector-based search

This ensures:

Filters appear in logical order
Product pages show consistent attributes
Search rankings are more accurate
Customers browse categories more intuitively

Architecture Overview

The modular pipeline follows this flow:

Product data comes from the product information system
The attribute extraction job pulls values and category context
These are passed to the AI Sorting Service
Updated product documents land in MongoDB
The outbound sync job updates the product information system
Elasticsearch and Vespa sync jobs transfer sorted data into their search systems
API services connect search systems with client applications

This cycle ensures that every sorted or manually set attribute value is reflected in search, merchandising, and customer experience.

Practical Results

The transformation from raw values to structured output:

Attribute	Raw Values	Sorted Output
Size	XL, Small, 12cm, Large, M, S	Small, M, Large, XL, 12cm
Color	RAL 3020, Crimson, Red, Dark Red	Red, Dark Red, Crimson, RAL 3020
Material	Steel, Carbon Steel, Stainless, Stainless Steel	Steel, Stainless Steel, Carbon Steel
Numeric	5cm, 12cm, 2cm, 20cm	2cm, 5cm, 12cm, 20cm

These examples demonstrate the interplay of contextual thinking and clear rules.

Measurable Impact

Consistent attribute sorting across 3M+ SKUs
Predictable numeric sorting via deterministic fallbacks
Full merchant control through manual tagging
Cleaner product pages and more intuitive filters
Improved search relevance and ranking
Increased customer trust and rising conversion rates

Key Takeaways

Hybrid pipelines outperform pure AI at scale
Context is fundamental for LLM accuracy
Offline jobs are essential for throughput and resilience
Human override mechanisms build trust
Clean input data is the foundation for reliable AI output

The biggest lesson: the most important e-commerce problems are often not the spectacular ones but the silent challenges that affect every product page daily. Through intelligent system architecture and hybrid AI approaches, chaos is made systematic and scalable.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.