Skip to main content

Technical Guide

This page gives the codebase-level view of how the dataset is generated.

For table structure and join fields, use Schema Reference. For event-to-ledger behavior, use GLEntry Posting Reference. For current scope and planned improvements, use Roadmap.

The current system includes demand forecasts, inventory policies, supply recommendations, component-demand planning, rough-cut capacity tieout, explicit O2C pricing lineage through price lists, promotions, and override approvals, and a separate O2C sales-commission subledger. It also includes the workforce-planning layer that supports the approved daily time-clock model.

Current System at a Glance

The current implementation has six layers:

LayerMain content
Business and process layerfictional company operations, source documents, and process flow
Operational data layerO2C, P2P, manufacturing, payroll, time, and master-data tables
Accounting layerJournalEntry, GLEntry, and the chart of accounts
Control layerValidations, anomaly injection, validation reporting, and generation logging
Delivery layerSQLite, Excel, CSV zip, report artifacts, and text-log exports
Configuration layersettings, runtime context, fiscal calendar, and validation scopes

The implemented schema is defined in src/generator_dataset/schema.py through TABLE_COLUMNS.

Entrypoints and Runtime Objects

  • generate_dataset.py is the simplest repository-root entrypoint.
  • src/generator_dataset/main.py orchestrates the full run.
  • src/generator_dataset/settings.py defines the runtime settings and the shared generation context.

The two runtime objects that matter most are:

  • Settings, which holds fiscal range, scale parameters, export paths, report-export choices, anomaly mode, and logging choices.
  • GenerationContext, which carries loaded settings, the random generator, fiscal calendar, generated tables, ID counters, anomaly log, and validation results.

End-to-End Build Flow

In plain language, the build:

  1. loads settings and initializes the shared context
  2. creates empty tables from the schema definition
  3. generates master, planning, operational, payroll, and journal activity
  4. posts accounting events into GLEntry
  5. runs validations, injects anomalies when configured, and exports outputs

Module Responsibilities

ModuleCurrent role
settings.pyLoad YAML configuration and initialize the runtime context
calendar.pyBuild the fiscal calendar
schema.pyDefine TABLE_COLUMNS and create empty DataFrames
master_data.pyGenerate accounts, cost centers, employees, warehouses, items, customers, and suppliers, including employee lifecycle and richer item-catalog attributes
manufacturing.pyGenerate BOMs, work centers, capacity calendars, routings, work orders, schedules, issues, completions, and work-order close activity
planning.pyGenerate inventory policies, weekly demand forecasts, supply recommendations, component-demand plans, rough-cut capacity rows, and recommendation conversion helpers
payroll.pyGenerate shifts, assignments, daily rosters, absences, raw punches, approved time clocks, overtime approvals, payroll periods, labor time, payroll registers, payments, remittances, and manufacturing labor helpers
budgets.pyGenerate opening balances, driver-based BudgetLine detail, summary Budget rows, and pro forma balance roll-forwards
o2c.pyGenerate price lists, promotions, pricing resolution, orders, shipments, invoices, receipts, applications, returns, credits, refunds, commission accruals, clawbacks, and sales-rep settlements
p2p.pyGenerate requisitions, purchase orders, receipts, supplier invoices, and disbursements
journals.pyGenerate recurring journals, accrued-expense activity, reclasses, and year-end close journals
posting_engine.pyConvert operational and payroll events into balanced GL entries
validations.pyRun document, accounting, payroll, manufacturing, and roll-forward checks
anomalies.pyInject configured anomalies and record them in the anomaly log
state_cache.pyProvide shared cache helpers used by generation and validation
exporters.pyWrite SQLite, dataset Excel, curated report artifacts, and CSV zip outputs
utils.pySupport numbering, rounding, and shared helper logic
main.pyOrchestrate the full run and write the generation log

Current Teaching and Analytics Layer

The current teaching and analytics layer includes:

  • broader starter SQL coverage across financial, managerial, and audit topics
  • case-based walkthroughs under docs/analytics/cases/
  • a documentation set that centers the published teaching dataset as the classroom artifact
  • workforce-planning detail for rosters, absences, punches, and overtime approvals that supports new attendance and staffing analytics
  • weekly planning support for forecast, policy, recommendation, MRP, and rough-cut capacity analysis
  • commercial-pricing support for segment and customer price lists, promotions, override approvals, and price-realization analysis
  • sales-commission support for invoice-line accruals, credit-memo clawbacks, rep-level payments, and payable rollforward analysis

Planning outputs support normal replenishment activity. The O2C layer also includes formal commercial-pricing resolution, explicit pricing lineage, and sales-commission accounting tied to invoice-line revenue.

Posting, Validation, and Outputs

The posting model is event-based. Operational and payroll events are generated first, then converted into balanced GLEntry rows through posting_engine.py. Use GLEntry Posting Reference when you need the detailed event-to-account mapping.

The validation layer checks:

  • schema consistency and orphan-row integrity
  • header-to-line totals and status consistency
  • O2C, P2P, manufacturing, payroll, and time-clock controls
  • planning controls for forecast coverage, policy validity, recommendation conversion, MRP reconciliation, and rough-cut capacity availability
  • pricing controls for price-list coverage, promotion validity, price-floor compliance, and invoice or credit pricing-lineage consistency
  • sales-commission traceability from invoice lines and credit memo lines into payable activity
  • master-data controls for employee roles, employment validity, item catalog completeness, and launch-date usage
  • voucher balance, trial balance, and control-account roll-forwards
  • journal-header-to-GL agreement and close-cycle completeness

Local generation commonly uses these configuration files:

  • config/settings.yaml for the released three-year teaching dataset covering 2024 through 2026
  • config/settings_validation.yaml for one-year fast validation
  • config/settings_perf.yaml for short-horizon performance profiling

The main entrypoint also supports validation scopes:

  • core
  • operational
  • full

The current generator exports:

  • SQLite database
  • Excel workbook with dataset table sheets only
  • curated report artifacts with HTML preview JSON plus Excel and CSV downloads
  • CSV zip package with one CSV per dataset table
  • text generation log

Most course users should start with those generated files and the documentation site.