We guarantee superior service and quality on every product you purchase, and we will assist you in becoming more successful.
White paper / Version 1.0 / May 2026
A privacy-safe synthetic Australian retail analytics sandbox for enterprise analytics, data engineering, BI, and AI agent evaluation.
Executive summary
Artificial intelligence is changing how organisations analyse data, automate reporting, and interact with enterprise information. Many teams still face a practical barrier: real customer, transaction, and operational data is too sensitive for public demos, training, benchmarking, or early product development.
The Maycee Retail Dataset was developed by SData to close that gap. It simulates an Australian retail business with customers, stores, products, brands, suppliers, promotions, transactions, item-level sales, returns, and calendar dimensions.
Unlike toy datasets, Maycee is intended for enterprise-style workflows: lakehouse pipelines, semantic modelling, dashboard development, text-to-SQL evaluation, RAG applications, analytics agents, data quality testing, dirty-data scenarios, anomaly detection, and AI benchmark design.
SData created Maycee Retail to help teams prototype, train, and evaluate modern analytics and AI systems without exposing real customer or operational data.
Who this is for
Evaluating privacy-safe datasets for training, demos, and AI readiness.
Building lakehouse pipelines, semantic models, and repeatable data quality checks.
Creating dashboards, KPI models, and stakeholder-ready reporting examples.
Testing text-to-SQL, RAG, and analytics-agent workflows.
Why it matters
Organisations can now deploy text-to-SQL agents quickly, but many cannot tell whether those agents work correctly on realistic enterprise schemas until mistakes reach production. The bottleneck is not just model capability. It is repeatable, realistic evaluation data.
Production data contains customer, commercial, and behavioural patterns that cannot be freely shared.
Many public datasets are too small, too clean, or too generic to represent enterprise analytics complexity.
Real business data contains returns, duplicates, missing values, delayed records, and changing definitions.
Dataset overview
Maycee is structured around 10 dimensions and 3 fact tables. The public free tier covers 2017-2019. Extended timelines and enterprise-specific scenarios are planned — contact SData to discuss availability.
| Table group | Tables | Purpose |
|---|---|---|
| Dimensions | date_dim, regions, districts, suppliers, brands, categories, customers, stores, products, promotions | Calendar, geography, customer, product, store, supplier, brand, category, and campaign context. |
| Facts | transactions, items, returns | Header-level purchases, line items with pricing and promotion detail, and refund events linked back to original items. |
| Scenario | Where it appears | Why it matters |
|---|---|---|
| POS phone capture outage | customers.phone is null for sign-ups between 2018-07-01 and 2018-09-30. | Tests whether pipelines, dashboards, and AI agents can distinguish a known operational outage from random missing data. |
| Year-end returns backlog | Some Nov-Dec 2019 purchases have return_date in Jan-Feb after the free-tier window closes. | Tests delayed processing, cross-window reasoning, and correct handling of returns that refer back to earlier purchases. |
Access and licensing
The 2017-2019 free tier is released under Creative Commons Attribution 4.0 (CC BY 4.0). It can be used freely with attribution.
Free tier released under CC BY 4.0 — use freely with attribution.
Load any table in three lines using the Hugging Face datasets library.
from datasets import load_dataset
transactions = load_dataset(
"SDataPro/maycee-retail-dataset",
"transactions",
split="train"
)
print(transactions.num_rows)
# 164968AI evaluation
Maycee supports realistic evaluation of AI systems that generate SQL, answer business questions, explain anomalies, or reason over enterprise metadata. It includes temporal structure, multi-table joins, fiscal calendar logic, return flows, promotions, and intentionally documented dirty-data scenarios.
| Benchmark category | Example capability |
|---|---|
| Text-to-SQL | Generate correct queries across customers, stores, products, dates, items, returns, and promotions. |
| Temporal reasoning | Handle fiscal years, seasonality, pre/post disruption comparison, and month-level analysis. |
| Dirty-data robustness | Recognise missing values, duplicates, delayed returns, supplier drift, and intentional operational anomalies. |
| Business explanation | Translate analytical results into clear explanations for data and business teams. |
Roadmap
v1.0 / May 2026: public free dataset, star schema, documented dirty-data scenarios, data quality checks, and starter AI evaluation corpus.
v1.1 / H2 2026: evaluator, 100 verified question-to-SQL tasks, result hashes, and first model comparison report.
v2.0 / 2027: enterprise AI benchmark with RAG tasks, adversarial scenarios, hidden evaluation sets, and richer model reporting.
About SData
SData is an Australian technology and data consulting firm helping organisations strengthen advanced analytics, information management, data integration, AI/ML, data governance, cybersecurity, and technical training capability.
Maycee Retail reflects SData's practical approach: useful, privacy-safe data assets that help teams build, test, and explain modern analytics and AI systems before they touch sensitive production data.
Reference
@dataset{sdata_maycee_retail_dataset_2026,
title = {Maycee Retail Dataset},
author = {SData},
year = {2026},
version = {1.0},
url = {https://huggingface.co/datasets/SDataPro/maycee-retail-dataset},
note = {Synthetic Australian retail dataset for analytics and AI evaluation}
}Contact SData to discuss Maycee Retail, enterprise dataset extensions, analytics engineering, BI enablement, AI evaluation, or hands-on team training.
