When discussing the scaling of e-commerce, people often focus on seemingly grand technological challenges like distributed search, inventory management, and recommendation engines. But what truly troubles every e-commerce platform are the most fundamental issues: inconsistencies in product attribute values.
Attribute values drive the entire product discovery system. They support filtering, comparison, search ranking, and recommendation logic. However, in real product catalogs, attribute values are rarely clean. Duplication, inconsistent formats, and ambiguous semantics are the norm.
Take a look at the seemingly simple attribute: [“XL”, “Small”, “12cm”, “Large”, “M”, “S”]
And “Color”: [“RAL 3020”, “Crimson”, “Red”, “Dark Red”]
On their own, these may seem manageable, but when you have over 3 million SKUs, each with dozens of attributes, the problem becomes a system-level challenge. Search becomes chaotic, recommendations fail, operations are overwhelmed with manual corrections, and user experience declines steadily.
Breaking the Black Box Mindset: Design Principles of a Hybrid Intelligent System
Faced with this challenge, the key is to avoid falling into the trap of “black box AI”—systems that mysteriously sort items without human understanding or control.
The correct approach is to build a pipeline with these characteristics:
High interpretability
Predictable behavior
Scalable operation
Accepts manual intervention
The ultimate solution is a Hybrid AI Pipeline: combining LLMs’ contextual understanding with explicit rules and manual controls. It operates intelligently when needed but always remains controllable. This is AI with guardrails, not out-of-control AI.
Offline Processing: The Foundation of Scalable Architecture
All attribute processing is performed in backend offline tasks, not in real-time. This is not a compromise but a strategic architectural decision.
Real-time pipelines sound attractive, but at e-commerce scale, they lead to:
Unpredictable latency fluctuations
Fragile dependency chains
Spikes in computational costs
Fragility in operations
Offline tasks, on the other hand, offer:
High throughput: batch processing massive data without impacting customer systems
Resilience: failures never reach user traffic
Cost control: computations scheduled during low-traffic periods
Isolation: LLM latency independent of product pages
Atomic consistency: updates are fully predictable and synchronized
When handling tens of millions of SKUs, isolating customer systems from data processing pipelines is critical.
Data Cleaning: The Highest ROI Step
Before applying AI, rigorous preprocessing is necessary—this step appears simple but yields significant results.
Cleaning pipeline includes:
Removing leading/trailing spaces
Eliminating null values
Deduplication
Simplifying categorical paths into structured strings
This ensures the LLM receives clean, clear input. In large-scale systems, even minor noise can explode into major issues later. Garbage in → garbage out. This fundamental rule becomes even more brutal at the million-level data scale.
Contextual Empowerment for LLM Services
LLMs are not just sorting attribute values alphabetically. They truly understand their meanings.
This service receives:
Cleaned attribute values
Category information (breadcrumbs)
Attribute metadata
With this context, the model can understand:
“Voltage” in power tools should be sorted numerically
“Size” in clothing follows a predictable progression (S→M→L→XL)
“Color” may use RAL standards (e.g., RAL 3020 codes)
“Material” in hardware has semantic relationships (Steel → Stainless Steel → Carbon Steel)
The model returns:
Sorted value sequences
Complete attribute names
A decision flag: whether to use deterministic sorting or context-aware sorting
This enables the pipeline to handle various attribute types without hardcoding rules for each category.
Deterministic Fallback: Knowing When Not to Use AI
Not every attribute requires AI. In fact, many are better handled with deterministic logic.
The pipeline automatically detects these cases and applies deterministic logic, maintaining system efficiency and avoiding unnecessary LLM calls.
Power Balance: Merchant Tagging System
Merchants need to retain control, especially over key attributes. Therefore, each category can be tagged as:
LLM_SORT — model decides
MANUAL_SORT — merchants define the order manually
This dual-label system allows humans to retain ultimate authority while AI handles most of the work. It also builds trust—merchants know they can override model decisions at any time without disrupting the pipeline.
Data Persistence: Using MongoDB as the Single Source of Truth
All results are directly written into the Product MongoDB, keeping architecture simple and centralized. MongoDB becomes the sole operational store for:
Sorted attribute values
Complete attribute names
Category-level sorting tags
Product-level sorting fields
This makes change auditing, value overrides, re-categorization, and synchronization with other systems straightforward.
Closed-Loop Search Layer: From Data to Discovery
Once sorting is complete, values flow into:
Elasticsearch — keyword-driven search
Vespa — semantic and vector-based search
This ensures:
Filter options appear in logical order
Product pages display consistent attributes
Search results are more accurately ranked
Browsing categories is intuitive and smooth
The power of attribute sorting is most evident in search, where consistency is critical.
System Overview: From Raw Data to User Interface
To operate this system over millions of SKUs, I designed a modular pipeline centered around backend tasks, AI inference, and search integration:
Data flow:
Product data sourced from the product information system
Attribute extraction tasks pull attribute values and category context
These are sent to the AI sorting service
Updated product documents are written into Product MongoDB
Outbound sync tasks push sorted results back to the product info system
Elasticsearch and Vespa sync tasks update their respective search indexes
API services connect search engines with client applications
This process ensures that every attribute value—whether sorted by AI or manual override—is reflected in search, shelf management, and the final customer experience.
Practical Outcomes of the Transformation
How are chaotic raw values transformed?
Attribute
Raw Chaotic Values
Sorted Output
Size
XL, Small, 12cm, Large, M, S
Small, M, Large, XL, 12cm
Color
RAL 3020, Crimson, Red, Dark Red
Red, Dark Red, Crimson, RAL 3020
Material
Steel, Carbon Steel, Stainless, Stainless Steel
Steel, Stainless Steel, Carbon Steel
Numeric
5cm, 12cm, 2cm, 20cm
2cm, 5cm, 12cm, 20cm
These examples demonstrate how the pipeline combines contextual understanding with clear rules to produce clean, understandable sequences.
Why Offline Instead of Real-Time?
Real-time processing would introduce:
Unpredictable latency fluctuations
Higher computational costs
Fragile dependency chains
Operational complexity
Offline tasks provide:
Batch processing efficiency
Asynchronous LLM calls
Retry logic and dead-letter queues
Manual review windows
Fully predictable costs
The trade-off is a slight delay from data ingestion to display, but the benefit is large-scale consistency—what customers truly value.
Business Impact
The results are quite significant:
Attribute sorting consistency across 3 million+ SKUs
Predictable numeric sorting via deterministic fallback
Fine-grained manual control for merchants through tagging
Cleaner product pages and intuitive filtering
Improved search relevance
Increased user trust and conversion rates
This is not just a technical victory but a win for user experience and revenue.
Key Takeaways
Hybrid pipelines outperform pure AI solutions at scale. Guardrails are essential.
Context significantly improves LLM accuracy
Offline tasks are the backbone of throughput and fault tolerance
Manual override mechanisms build trust and acceptance
Clean input is the foundation of reliable AI output
Conclusion
Sorting attribute values may seem simple, but when scaling to millions of products, it becomes a real challenge. By combining the intelligence of LLMs with clear rules and merchant controls, this invisible yet pervasive problem is transformed into a clean, scalable system.
A reminder: the greatest victories often come from solving those boring, overlooked problems—those that appear on every product page every day.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Practical approaches to solving large-scale e-commerce product attribute chaos with AI
When discussing the scaling of e-commerce, people often focus on seemingly grand technological challenges like distributed search, inventory management, and recommendation engines. But what truly troubles every e-commerce platform are the most fundamental issues: inconsistencies in product attribute values.
Attribute values drive the entire product discovery system. They support filtering, comparison, search ranking, and recommendation logic. However, in real product catalogs, attribute values are rarely clean. Duplication, inconsistent formats, and ambiguous semantics are the norm.
Take a look at the seemingly simple attribute: [“XL”, “Small”, “12cm”, “Large”, “M”, “S”]
And “Color”: [“RAL 3020”, “Crimson”, “Red”, “Dark Red”]
On their own, these may seem manageable, but when you have over 3 million SKUs, each with dozens of attributes, the problem becomes a system-level challenge. Search becomes chaotic, recommendations fail, operations are overwhelmed with manual corrections, and user experience declines steadily.
Breaking the Black Box Mindset: Design Principles of a Hybrid Intelligent System
Faced with this challenge, the key is to avoid falling into the trap of “black box AI”—systems that mysteriously sort items without human understanding or control.
The correct approach is to build a pipeline with these characteristics:
The ultimate solution is a Hybrid AI Pipeline: combining LLMs’ contextual understanding with explicit rules and manual controls. It operates intelligently when needed but always remains controllable. This is AI with guardrails, not out-of-control AI.
Offline Processing: The Foundation of Scalable Architecture
All attribute processing is performed in backend offline tasks, not in real-time. This is not a compromise but a strategic architectural decision.
Real-time pipelines sound attractive, but at e-commerce scale, they lead to:
Offline tasks, on the other hand, offer:
When handling tens of millions of SKUs, isolating customer systems from data processing pipelines is critical.
Data Cleaning: The Highest ROI Step
Before applying AI, rigorous preprocessing is necessary—this step appears simple but yields significant results.
Cleaning pipeline includes:
This ensures the LLM receives clean, clear input. In large-scale systems, even minor noise can explode into major issues later. Garbage in → garbage out. This fundamental rule becomes even more brutal at the million-level data scale.
Contextual Empowerment for LLM Services
LLMs are not just sorting attribute values alphabetically. They truly understand their meanings.
This service receives:
With this context, the model can understand:
The model returns:
This enables the pipeline to handle various attribute types without hardcoding rules for each category.
Deterministic Fallback: Knowing When Not to Use AI
Not every attribute requires AI. In fact, many are better handled with deterministic logic.
Numerical ranges, normalized values, simple sets benefit from:
The pipeline automatically detects these cases and applies deterministic logic, maintaining system efficiency and avoiding unnecessary LLM calls.
Power Balance: Merchant Tagging System
Merchants need to retain control, especially over key attributes. Therefore, each category can be tagged as:
This dual-label system allows humans to retain ultimate authority while AI handles most of the work. It also builds trust—merchants know they can override model decisions at any time without disrupting the pipeline.
Data Persistence: Using MongoDB as the Single Source of Truth
All results are directly written into the Product MongoDB, keeping architecture simple and centralized. MongoDB becomes the sole operational store for:
This makes change auditing, value overrides, re-categorization, and synchronization with other systems straightforward.
Closed-Loop Search Layer: From Data to Discovery
Once sorting is complete, values flow into:
This ensures:
The power of attribute sorting is most evident in search, where consistency is critical.
System Overview: From Raw Data to User Interface
To operate this system over millions of SKUs, I designed a modular pipeline centered around backend tasks, AI inference, and search integration:
Data flow:
This process ensures that every attribute value—whether sorted by AI or manual override—is reflected in search, shelf management, and the final customer experience.
Practical Outcomes of the Transformation
How are chaotic raw values transformed?
These examples demonstrate how the pipeline combines contextual understanding with clear rules to produce clean, understandable sequences.
Why Offline Instead of Real-Time?
Real-time processing would introduce:
Offline tasks provide:
The trade-off is a slight delay from data ingestion to display, but the benefit is large-scale consistency—what customers truly value.
Business Impact
The results are quite significant:
This is not just a technical victory but a win for user experience and revenue.
Key Takeaways
Conclusion
Sorting attribute values may seem simple, but when scaling to millions of products, it becomes a real challenge. By combining the intelligence of LLMs with clear rules and merchant controls, this invisible yet pervasive problem is transformed into a clean, scalable system.
A reminder: the greatest victories often come from solving those boring, overlooked problems—those that appear on every product page every day.