Scalable Data Management: How Attribute Values Remain Consistent in Large E-Commerce Catalogs

2026-01-09 11:29:28

In e-commerce businesses, technical discussions often focus on topics like distributed search systems, real-time inventory management, or checkout optimization. However, an often underestimated but systemic problem remains hidden beneath the surface: the reliable management and standardization of product attributes across millions of SKUs.

The Hidden Problem: Attribute Chaos in Reality

Attributes form the foundation of product discovery. They control filter functionality, product comparisons, search ranking algorithms, and recommendation systems. In real product catalogs, these values are rarely structured and consistent. A simple example: the attribute “Size” might appear in a dataset as [“XL”, “Small”, “12cm”, “Large”, “M”, “S”], while “Color” could be recorded as [“RAL 3020”, “Crimson”, “Red”, “Dark Red”].

In isolation, such inconsistencies seem trivial. But when scaled across 3 million SKUs with dozens of attributes each, it creates a critical systemic problem. Filters become unpredictable, search engines lose relevance, and customer navigation becomes increasingly frustrating. For operators of large e-commerce platforms, manual cleanup of these attribute values becomes an operational nightmare.

A Hybrid Approach: AI with Constraints Instead of Black-Box Systems

The challenge was to create a system that is explainable, predictable, scalable, and human-controlled. The key was not in an opaque AI black box, but in a hybrid pipeline combining Large Language Models (LLMs) with deterministic rules and control mechanisms.

This concept merges intelligent contextual reasoning with clear, traceable rules. The system acts intelligently when needed but remains always predictable and controllable.

Architectural Decision: Offline Processing Instead of Real-Time

All attribute processing is performed not in real-time but via asynchronous background jobs. This was not a compromise but a deliberate architectural choice:

Real-time pipelines would lead to unpredictable latency, fragile dependencies, processing spikes, and operational instability. Offline jobs, on the other hand, offer:

High throughput: Massive data volumes can be processed without affecting live systems
Fault tolerance: Errors in data processing never impact customer traffic
Cost control: Computations can be scheduled during low-traffic periods
System isolation: LLM latency has no impact on product page performance
Atomic consistency: Updates are predictable and conflict-free

Strict separation between customer-facing systems and data processing pipelines is essential when working with millions of SKUs.

The Attribute Processing Pipeline: From Raw Data to Structured Attributes

Phase 1: Data Cleaning and Normalization

Before applying AI models to attribute values, each dataset underwent comprehensive preprocessing. This seemingly simple step was crucial for the quality of subsequent results:

Whitespace trimming
Removal of empty values
Deduplication
Contextual simplification of category hierarchies

This cleaning ensured that the LLM received clean, clear inputs—a fundamental requirement for consistent outputs. The principle “Garbage In, Garbage Out” becomes even more critical at scale.

Phase 2: Intelligent Attribute Analysis via LLMs

The LLM system didn’t just analyze alphabetically but understood the semantic context. The service received:

Cleaned attribute values
Category breadcrumbs with hierarchical context
Metadata about attribute types

With this context, the model could, for example, understand that:

“Voltage” in power tools should be interpreted numerically
“Size” in clothing follows a known size progression
“Color” in certain categories might meet RAL standards
“Material” in hardware products has semantic relationships

The model returned: ordered values, refined attribute names, and a classification between deterministic or contextual sorting.

Phase 3: Deterministic Fallbacks for Efficiency

Not every attribute required AI processing. Numeric ranges, unit-based values, and simple categories benefited from:

Faster processing
Predictable sorting
Lower processing costs
Complete elimination of ambiguities

The pipeline automatically recognized these cases and applied deterministic logic—an efficiency measure that avoided unnecessary LLM calls.

Phase 4: Manual Tagging and Merchant Control

While automation formed the basis, merchants needed control over critical attributes. Each category could be tagged with:

LLM_SORT: The model decides the sorting order
MANUAL_SORT: Merchants define the final order

This dual tagging system allowed humans to make intelligent decisions while AI handled most of the work. It also built trust, as merchants could override when needed.

Data Persistence and Synchronization

All results were stored directly in the Product MongoDB, forming the sole operational storage for:

Sorted attribute values
Refined attribute names
Category-specific sort tags
Product-related sorting metadata

This centralized data management enabled easy review, overwriting, and reprocessing of categories.

Integration with Search Systems

After sorting, the standardized attribute values were synchronized into search solutions:

Elasticsearch: For keyword-driven search
Vespa: For semantic and vector-based search logic

This ensured that:

Filters appeared in logical order
Product pages showed consistent attribute displays
Search engines ranked products more accurately
Customers could browse categories intuitively

Practical Transformation: From Chaos to Structure

The pipeline transformed chaotic raw values into consistent, usable sequences:

Attribute	Raw Values	Structured Output
Size	XL, Small, 12cm, Large, M, S	Small, M, Large, XL, 12cm
Color	RAL 3020, Crimson, Red, Dark Red	Red, Dark Red, Crimson, RAL 3020
Material	Steel, Carbon Steel, Stainless, Stainless Steel	Steel, Stainless Steel, Carbon Steel
Numeric	5cm, 12cm, 2cm, 20cm	2cm, 5cm, 12cm, 20cm

These examples illustrate how contextual reasoning combined with clear rules leads to readable, logical sequences.

Operational Impact and Business Results

Implementing this attribute management strategy yielded measurable results:

Consistent attribute sorting across 3 million+ SKUs
Predictable numeric ordering via deterministic fallbacks
Continuous merchant control through manual tagging options
Much cleaner product pages with more intuitive filters
Improved search relevance and ranking quality
Increased customer trust and higher conversion rates

The success was not only technical—it directly impacted user experience and business metrics.

Key Takeaways

Hybrid pipelines outperform pure AI systems at scale. Constraints and control are essential
Contextualization dramatically improves LLM accuracy
Offline processing is indispensable for throughput, fault tolerance, and predictable resource usage
Human override mechanisms build trust and operational acceptance
Data quality is fundamental: clean inputs lead to reliable AI outputs

Conclusion

Managing and standardizing attributes may seem superficially trivial, but it becomes a real engineering challenge when scaled to millions of products. By combining LLM-based reasoning with traceable rules and operational control, a hidden but critical problem was transformed into a scalable, maintainable system. It serves as a reminder that often the greatest business successes stem from solving seemingly “boring” problems—those that are easy to overlook but appear on every product page.

IN0,79%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.