White paper / Version 1.0 / May 2026

Maycee Retail Dataset

A privacy-safe synthetic Australian retail analytics sandbox for enterprise analytics, data engineering, BI, and AI agent evaluation.

Open on Hugging Face Contact SData

Executive summary

Realistic data for AI work without real customer risk.

Artificial intelligence is changing how organisations analyse data, automate reporting, and interact with enterprise information. Many teams still face a practical barrier: real customer, transaction, and operational data is too sensitive for public demos, training, benchmarking, or early product development.

The Maycee Retail Dataset was developed by SData to close that gap. It simulates an Australian retail business with customers, stores, products, brands, suppliers, promotions, transactions, item-level sales, returns, and calendar dimensions.

Unlike toy datasets, Maycee is intended for enterprise-style workflows: lakehouse pipelines, semantic modelling, dashboard development, text-to-SQL evaluation, RAG applications, analytics agents, data quality testing, dirty-data scenarios, anomaly detection, and AI benchmark design.

SData created Maycee Retail to help teams prototype, train, and evaluate modern analytics and AI systems without exposing real customer or operational data.

Who this is for

Built for data leaders and delivery teams.

Analytics and data team leads

Evaluating privacy-safe datasets for training, demos, and AI readiness.

Analytics engineers

Building lakehouse pipelines, semantic models, and repeatable data quality checks.

BI developers

Creating dashboards, KPI models, and stakeholder-ready reporting examples.

Data scientists and AI/ML teams

Testing text-to-SQL, RAG, and analytics-agent workflows.

Why it matters

AI evaluation has become a data problem.

Organisations can now deploy text-to-SQL agents quickly, but many cannot tell whether those agents work correctly on realistic enterprise schemas until mistakes reach production. The bottleneck is not just model capability. It is repeatable, realistic evaluation data.

Privacy barriers

Production data contains customer, commercial, and behavioural patterns that cannot be freely shared.

Shallow benchmarks

Many public datasets are too small, too clean, or too generic to represent enterprise analytics complexity.

Missing edge cases

Real business data contains returns, duplicates, missing values, delayed records, and changing definitions.

Dataset overview

A retail star schema built for analytics and AI.

Maycee is structured around 10 dimensions and 3 fact tables. The public free tier covers 2017-2019. Extended timelines and enterprise-specific scenarios are planned — contact SData to discuss availability.

Table group	Tables	Purpose
Dimensions	`date_dim`, `regions`, `districts`, `suppliers`, `brands`, `categories`, `customers`, `stores`, `products`, `promotions`	Calendar, geography, customer, product, store, supplier, brand, category, and campaign context.
Facts	`transactions`, `items`, `returns`	Header-level purchases, line items with pricing and promotion detail, and refund events linked back to original items.

164,968free transactions

417,907free line items

3,157free customers

13analytics tables

Deliberate data quality scenarios

Scenario	Where it appears	Why it matters
POS phone capture outage	`customers.phone` is null for sign-ups between 2018-07-01 and 2018-09-30.	Tests whether pipelines, dashboards, and AI agents can distinguish a known operational outage from random missing data.
Year-end returns backlog	Some Nov-Dec 2019 purchases have `return_date` in Jan-Feb after the free-tier window closes.	Tests delayed processing, cross-window reasoning, and correct handling of returns that refer back to earlier purchases.

Access and licensing

Public free tier now available.

The 2017-2019 free tier is released under Creative Commons Attribution 4.0 (CC BY 4.0). It can be used freely with attribution.

Use the Hugging Face dataset Contact SData about premium access

Free tier released under CC BY 4.0 — use freely with attribution.

Load any table in three lines using the Hugging Face datasets library.

from datasets import load_dataset

transactions = load_dataset(
    "SDataPro/maycee-retail-dataset",
    "transactions",
    split="train"
)

print(transactions.num_rows)
# 164968

AI evaluation

A foundation for text-to-SQL and analytics-agent benchmarks.

Maycee supports realistic evaluation of AI systems that generate SQL, answer business questions, explain anomalies, or reason over enterprise metadata. It includes temporal structure, multi-table joins, fiscal calendar logic, return flows, promotions, and intentionally documented dirty-data scenarios.

Benchmark category	Example capability
Text-to-SQL	Generate correct queries across customers, stores, products, dates, items, returns, and promotions.
Temporal reasoning	Handle fiscal years, seasonality, pre/post disruption comparison, and month-level analysis.
Dirty-data robustness	Recognise missing values, duplicates, delayed returns, supplier drift, and intentional operational anomalies.
Business explanation	Translate analytical results into clear explanations for data and business teams.

Roadmap

From dataset to benchmark suite.

v1.0 / May 2026: public free dataset, star schema, documented dirty-data scenarios, data quality checks, and starter AI evaluation corpus.

v1.1 / H2 2026: evaluator, 100 verified question-to-SQL tasks, result hashes, and first model comparison report.

v2.0 / 2027: enterprise AI benchmark with RAG tasks, adversarial scenarios, hidden evaluation sets, and richer model reporting.

About SData

Practical data and AI capability for Australian organisations.

SData is an Australian technology and data consulting firm helping organisations strengthen advanced analytics, information management, data integration, AI/ML, data governance, cybersecurity, and technical training capability.

Maycee Retail reflects SData's practical approach: useful, privacy-safe data assets that help teams build, test, and explain modern analytics and AI systems before they touch sensitive production data.

Talk to SData

Reference

Reference Maycee Retail in demos, training, and research.

@dataset{sdata_maycee_retail_dataset_2026,
  title   = {Maycee Retail Dataset},
  author  = {SData},
  year    = {2026},
  version = {1.0},
  url     = {https://huggingface.co/datasets/SDataPro/maycee-retail-dataset},
  note    = {Synthetic Australian retail dataset for analytics and AI evaluation}
}

Build analytics and AI evaluation workflows with privacy-safe data.

Contact SData to discuss Maycee Retail, enterprise dataset extensions, analytics engineering, BI enablement, AI evaluation, or hands-on team training.

Contact SData