Scalable Data Management: How Attribute Values Remain Consistent in Large E-Commerce Catalogs

In e-commerce businesses, technical discussions often focus on topics like distributed search systems, real-time inventory management, or checkout optimization. However, an often underestimated but systemic problem remains hidden beneath the surface: the reliable management and standardization of product attributes across millions of SKUs.

The Hidden Problem: Attribute Chaos in Reality

Attributes form the foundation of product discovery. They control filter functionality, product comparisons, search ranking algorithms, and recommendation systems. In real product catalogs, these values are rarely structured and consistent. A simple example: the attribute “Size” might appear in a dataset as [“XL”, “Small”, “12cm”, “Large”, “M”, “S”], while “Color” could be recorded as [“RAL 3020”, “Crimson”, “Red”, “Dark Red”].

In isolation, such inconsistencies seem trivial. But when scaled across 3 million SKUs with dozens of attributes each, it creates a critical systemic problem. Filters become unpredictable, search engines lose relevance, and customer navigation becomes increasingly frustrating. For operators of large e-commerce platforms, manual cleanup of these attribute values becomes an operational nightmare.

A Hybrid Approach: AI with Constraints Instead of Black-Box Systems

The challenge was to create a system that is explainable, predictable, scalable, and human-controlled. The key was not in an opaque AI black box, but in a hybrid pipeline combining Large Language Models (LLMs) with deterministic rules and control mechanisms.

This concept merges intelligent contextual reasoning with clear, traceable rules. The system acts intelligently when needed but remains always predictable and controllable.

Architectural Decision: Offline Processing Instead of Real-Time

All attribute processing is performed not in real-time but via asynchronous background jobs. This was not a compromise but a deliberate architectural choice:

Real-time pipelines would lead to unpredictable latency, fragile dependencies, processing spikes, and operational instability. Offline jobs, on the other hand, offer:

  • High throughput: Massive data volumes can be processed without affecting live systems
  • Fault tolerance: Errors in data processing never impact customer traffic
  • Cost control: Computations can be scheduled during low-traffic periods
  • System isolation: LLM latency has no impact on product page performance
  • Atomic consistency: Updates are predictable and conflict-free

Strict separation between customer-facing systems and data processing pipelines is essential when working with millions of SKUs.

The Attribute Processing Pipeline: From Raw Data to Structured Attributes

Phase 1: Data Cleaning and Normalization

Before applying AI models to attribute values, each dataset underwent comprehensive preprocessing. This seemingly simple step was crucial for the quality of subsequent results:

  • Whitespace trimming
  • Removal of empty values
  • Deduplication
  • Contextual simplification of category hierarchies

This cleaning ensured that the LLM received clean, clear inputs—a fundamental requirement for consistent outputs. The principle “Garbage In, Garbage Out” becomes even more critical at scale.

Phase 2: Intelligent Attribute Analysis via LLMs

The LLM system didn’t just analyze alphabetically but understood the semantic context. The service received:

  • Cleaned attribute values
  • Category breadcrumbs with hierarchical context
  • Metadata about attribute types

With this context, the model could, for example, understand that:

  • “Voltage” in power tools should be interpreted numerically
  • “Size” in clothing follows a known size progression
  • “Color” in certain categories might meet RAL standards
  • “Material” in hardware products has semantic relationships

The model returned: ordered values, refined attribute names, and a classification between deterministic or contextual sorting.

Phase 3: Deterministic Fallbacks for Efficiency

Not every attribute required AI processing. Numeric ranges, unit-based values, and simple categories benefited from:

  • Faster processing
  • Predictable sorting
  • Lower processing costs
  • Complete elimination of ambiguities

The pipeline automatically recognized these cases and applied deterministic logic—an efficiency measure that avoided unnecessary LLM calls.

Phase 4: Manual Tagging and Merchant Control

While automation formed the basis, merchants needed control over critical attributes. Each category could be tagged with:

  • LLM_SORT: The model decides the sorting order
  • MANUAL_SORT: Merchants define the final order

This dual tagging system allowed humans to make intelligent decisions while AI handled most of the work. It also built trust, as merchants could override when needed.

Data Persistence and Synchronization

All results were stored directly in the Product MongoDB, forming the sole operational storage for:

  • Sorted attribute values
  • Refined attribute names
  • Category-specific sort tags
  • Product-related sorting metadata

This centralized data management enabled easy review, overwriting, and reprocessing of categories.

Integration with Search Systems

After sorting, the standardized attribute values were synchronized into search solutions:

  • Elasticsearch: For keyword-driven search
  • Vespa: For semantic and vector-based search logic

This ensured that:

  • Filters appeared in logical order
  • Product pages showed consistent attribute displays
  • Search engines ranked products more accurately
  • Customers could browse categories intuitively

Practical Transformation: From Chaos to Structure

The pipeline transformed chaotic raw values into consistent, usable sequences:

Attribute Raw Values Structured Output
Size XL, Small, 12cm, Large, M, S Small, M, Large, XL, 12cm
Color RAL 3020, Crimson, Red, Dark Red Red, Dark Red, Crimson, RAL 3020
Material Steel, Carbon Steel, Stainless, Stainless Steel Steel, Stainless Steel, Carbon Steel
Numeric 5cm, 12cm, 2cm, 20cm 2cm, 5cm, 12cm, 20cm

These examples illustrate how contextual reasoning combined with clear rules leads to readable, logical sequences.

Operational Impact and Business Results

Implementing this attribute management strategy yielded measurable results:

  • Consistent attribute sorting across 3 million+ SKUs
  • Predictable numeric ordering via deterministic fallbacks
  • Continuous merchant control through manual tagging options
  • Much cleaner product pages with more intuitive filters
  • Improved search relevance and ranking quality
  • Increased customer trust and higher conversion rates

The success was not only technical—it directly impacted user experience and business metrics.

Key Takeaways

  • Hybrid pipelines outperform pure AI systems at scale. Constraints and control are essential
  • Contextualization dramatically improves LLM accuracy
  • Offline processing is indispensable for throughput, fault tolerance, and predictable resource usage
  • Human override mechanisms build trust and operational acceptance
  • Data quality is fundamental: clean inputs lead to reliable AI outputs

Conclusion

Managing and standardizing attributes may seem superficially trivial, but it becomes a real engineering challenge when scaled to millions of products. By combining LLM-based reasoning with traceable rules and operational control, a hidden but critical problem was transformed into a scalable, maintainable system. It serves as a reminder that often the greatest business successes stem from solving seemingly “boring” problems—those that are easy to overlook but appear on every product page.

IN-0.04%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)