Technical Guide
This page gives the codebase-level view of how the dataset is generated.
For table structure and join fields, use Schema Reference. For event-to-ledger behavior, use GLEntry Posting Reference. For current scope and planned improvements, use Roadmap.
The current system includes demand forecasts, inventory policies, supply recommendations, component-demand planning, rough-cut capacity tieout, explicit O2C pricing lineage through price lists, promotions, and override approvals, and a separate O2C sales-commission subledger. It also includes the workforce-planning layer that supports the approved daily time-clock model.
Current System at a Glance
The current implementation has six layers:
| Layer | Main content |
|---|---|
| Business and process layer | fictional company operations, source documents, and process flow |
| Operational data layer | O2C, P2P, manufacturing, payroll, time, and master-data tables |
| Accounting layer | JournalEntry, GLEntry, and the chart of accounts |
| Control layer | Validations, anomaly injection, validation reporting, and generation logging |
| Delivery layer | SQLite, Excel, CSV zip, report artifacts, and text-log exports |
| Configuration layer | settings, runtime context, fiscal calendar, and validation scopes |
The implemented schema is defined in src/generator_dataset/schema.py through TABLE_COLUMNS.
Entrypoints and Runtime Objects
generate_dataset.pyis the simplest repository-root entrypoint.src/generator_dataset/main.pyorchestrates the full run.src/generator_dataset/settings.pydefines the runtime settings and the shared generation context.
The two runtime objects that matter most are:
Settings, which holds fiscal range, scale parameters, export paths, report-export choices, anomaly mode, and logging choices.GenerationContext, which carries loaded settings, the random generator, fiscal calendar, generated tables, ID counters, anomaly log, and validation results.
End-to-End Build Flow
In plain language, the build:
- loads settings and initializes the shared context
- creates empty tables from the schema definition
- generates master, planning, operational, payroll, and journal activity
- posts accounting events into
GLEntry - runs validations, injects anomalies when configured, and exports outputs
Module Responsibilities
| Module | Current role |
|---|---|
settings.py | Load YAML configuration and initialize the runtime context |
calendar.py | Build the fiscal calendar |
schema.py | Define TABLE_COLUMNS and create empty DataFrames |
master_data.py | Generate accounts, cost centers, employees, warehouses, items, customers, and suppliers, including employee lifecycle and richer item-catalog attributes |
manufacturing.py | Generate BOMs, work centers, capacity calendars, routings, work orders, schedules, issues, completions, and work-order close activity |
planning.py | Generate inventory policies, weekly demand forecasts, supply recommendations, component-demand plans, rough-cut capacity rows, and recommendation conversion helpers |
payroll.py | Generate shifts, assignments, daily rosters, absences, raw punches, approved time clocks, overtime approvals, payroll periods, labor time, payroll registers, payments, remittances, and manufacturing labor helpers |
budgets.py | Generate opening balances, driver-based BudgetLine detail, summary Budget rows, and pro forma balance roll-forwards |
o2c.py | Generate price lists, promotions, pricing resolution, orders, shipments, invoices, receipts, applications, returns, credits, refunds, commission accruals, clawbacks, and sales-rep settlements |
p2p.py | Generate requisitions, purchase orders, receipts, supplier invoices, and disbursements |
journals.py | Generate recurring journals, accrued-expense activity, reclasses, and year-end close journals |
posting_engine.py | Convert operational and payroll events into balanced GL entries |
validations.py | Run document, accounting, payroll, manufacturing, and roll-forward checks |
anomalies.py | Inject configured anomalies and record them in the anomaly log |
state_cache.py | Provide shared cache helpers used by generation and validation |
exporters.py | Write SQLite, dataset Excel, curated report artifacts, and CSV zip outputs |
utils.py | Support numbering, rounding, and shared helper logic |
main.py | Orchestrate the full run and write the generation log |
Current Teaching and Analytics Layer
The current teaching and analytics layer includes:
- broader starter SQL coverage across financial, managerial, and audit topics
- case-based walkthroughs under
docs/analytics/cases/ - a documentation set that centers the published teaching dataset as the classroom artifact
- workforce-planning detail for rosters, absences, punches, and overtime approvals that supports new attendance and staffing analytics
- weekly planning support for forecast, policy, recommendation, MRP, and rough-cut capacity analysis
- commercial-pricing support for segment and customer price lists, promotions, override approvals, and price-realization analysis
- sales-commission support for invoice-line accruals, credit-memo clawbacks, rep-level payments, and payable rollforward analysis
Planning outputs support normal replenishment activity. The O2C layer also includes formal commercial-pricing resolution, explicit pricing lineage, and sales-commission accounting tied to invoice-line revenue.
Posting, Validation, and Outputs
The posting model is event-based. Operational and payroll events are generated first, then converted into balanced GLEntry rows through posting_engine.py. Use GLEntry Posting Reference when you need the detailed event-to-account mapping.
The validation layer checks:
- schema consistency and orphan-row integrity
- header-to-line totals and status consistency
- O2C, P2P, manufacturing, payroll, and time-clock controls
- planning controls for forecast coverage, policy validity, recommendation conversion, MRP reconciliation, and rough-cut capacity availability
- pricing controls for price-list coverage, promotion validity, price-floor compliance, and invoice or credit pricing-lineage consistency
- sales-commission traceability from invoice lines and credit memo lines into payable activity
- master-data controls for employee roles, employment validity, item catalog completeness, and launch-date usage
- voucher balance, trial balance, and control-account roll-forwards
- journal-header-to-GL agreement and close-cycle completeness
Local generation commonly uses these configuration files:
config/settings.yamlfor the released three-year teaching dataset covering 2024 through 2026config/settings_validation.yamlfor one-year fast validationconfig/settings_perf.yamlfor short-horizon performance profiling
The main entrypoint also supports validation scopes:
coreoperationalfull
The current generator exports:
- SQLite database
- Excel workbook with dataset table sheets only
- curated report artifacts with HTML preview JSON plus Excel and CSV downloads
- CSV zip package with one CSV per dataset table
- text generation log
Most course users should start with those generated files and the documentation site.