Source: data_layer/docs/ORGANIZATION_STRATEGY_COMPLETE.md
π― Complete Organization Strategy Analysis
Date: 2025-01-16
Purpose: Determine optimal folder organization: lifecycle vs slice vs scenario vs hybrid
π Current State Assessment
What You Have Now
data_fabric/
βββ prompts/ # Lifecycle-ish (generation logic)
βββ storage/ # Lifecycle (runtime operations)
βββ knowledge/ # Lifecycle (intelligence operations)
βββ kb_catalog/ # Mixed (business rules + config)
βββ output-styles/ # Scenario-based (onboarding pipeline)
βββ config/ # β οΈ Doesn't fit "output-styles"
βββ onboarding/ # β
Scenario-based stages
βββ schemas/ # β οΈ Duplicates exist elsewhereCurrent Organization: MIXED (70% lifecycle, 30% scenario)
π¨ Organization Philosophies Explained
1οΈβ£ Lifecycle Stage Organization
Definition: Organize by WHERE data exists in its transformation journey
data_fabric/
βββ definitions/ # BIRTH: Canonical sources
βββ weave/ # LIFE: Active processing
βββ views/ # DEATH: Materialized outputsMental Model: Assembly Line
- Raw materials β Processing β Finished goods
Pros:
- β Clear data flow (source β runtime β output)
- β Separation of concerns (immutable vs mutable)
- β Git-friendly (know what to track vs ignore)
- β Scalable (easy to add new lifecycle stages)
- β DRY (single source of truth enforced)
Cons:
- β οΈ Cross-cutting features span multiple stages
- β οΈ Harder to navigate for feature-focused work
- β οΈ Requires understanding of data lineage
Best For:
- Data engineering teams
- Systems with clear ETL pipelines
- Multi-storage architectures
- Version-controlled configuration
2οΈβ£ Slice/Domain Organization
Definition: Organize by WHAT business capability/domain it serves
data_fabric/
βββ pricing/ # Everything pricing-related
β βββ schemas/
β βββ config/
β βββ examples/
β βββ runtime/
βββ scoring/
βββ contracts/
βββ questionnaires/Mental Model: Vertical Slices
- Each slice is self-contained
Pros:
- β Feature co-location (everything for X in one place)
- β Team ownership (clear boundaries)
- β Easier feature work (don't jump directories)
- β Microservice-ready (can extract slices)
Cons:
- β οΈ Cross-domain code duplication risk
- β οΈ Shared infrastructure unclear placement
- β οΈ Inconsistent structure across slices
- β οΈ Harder to see system-wide patterns
Best For:
- Product-focused teams
- Domain-driven design
- Microservices architecture
- Teams with ownership boundaries
3οΈβ£ Scenario/Workflow Organization
Definition: Organize by WHICH business process/workflow it supports
data_fabric/
βββ onboarding/ # Everything for onboarding
β βββ 01-ingest/
β βββ 02-classify/
β βββ 03-contract/
βββ analytics/
βββ real_time_betting/
βββ reporting/Mental Model: User Journeys
- Follow the business process
Pros:
- β Business alignment (mirrors operations)
- β Easy for stakeholders to understand
- β Clear entry points for workflows
- β Process optimization visibility
Cons:
- β οΈ Massive duplication across scenarios
- β οΈ Shared code unclear placement
- β οΈ Rigid (hard to support new scenarios)
- β οΈ Doesn't reflect code reuse
Best For:
- Business-driven projects
- Single workflow focus
- Prototypes/MVPs
- Process documentation
4οΈβ£ Hybrid Organization β RECOMMENDED
Definition: Lifecycle at top, domain/scenario within stages
data_fabric/
βββ definitions/ # LIFECYCLE (immutable, git-tracked)
β βββ schemas/ # Sliced by domain
β βββ config/ # Sliced by domain
β βββ templates/ # Sliced by scenario
β βββ examples/ # Sliced by scenario
β
βββ weave/ # LIFECYCLE (runtime, operational)
β βββ knowledge/ # Slice (intelligence)
β βββ storage/ # Slice (persistence)
β βββ prompts/ # Slice (generation)
β
βββ views/ # LIFECYCLE (outputs, gitignored)
βββ onboarding/ # Scenario-based
βββ contracts/ # Scenario-based
βββ analytics/ # Scenario-basedMental Model: Layered Cake with Flavors
- Layers = lifecycle stages
- Flavors = domains/scenarios within
Pros:
- β Best of both worlds (clear flow + feature co-location)
- β Flexible (choose organization per layer)
- β Intuitive (lifecycle for infra, domain for business)
- β Scalable (add slices without restructuring)
Cons:
- β οΈ More complex (two organization principles)
- β οΈ Requires discipline (don't mix metaphors)
Best For:
- Complex systems with multiple concerns
- Mixed technical/business focus
- Growing teams
- YOUR SYSTEM β
π― Decision Matrix for YOUR System
Your System Characteristics
| Characteristic | Reality | Org Implication |
|---|---|---|
| Multi-storage | PostgreSQL + Redis + Vector | β Lifecycle (separate runtime) |
| Multiple workflows | Onboarding, analytics, contracts | β Scenario (within outputs) |
| Business domains | Pricing, scoring, sports | β Slice (within config) |
| Auto-generation | Config β Examples, Schema β Adapters | β Lifecycle (source vs derived) |
| Team size | Small/Growing | β Hybrid (room to evolve) |
| Git management | Version control critical | β Lifecycle (immutable vs gitignore) |
Conclusion: Hybrid Organization (Lifecycle + Domain/Scenario)
ποΈ Recommended Structure (Complete)
Top-Level: Lifecycle Stages
data_fabric/
βββ definitions/ # π Git-tracked, immutable, canonical
βββ weave/ # π§ Python modules, operational code
βββ views/ # π Generated outputs, gitignored
βββ docs/ # π Documentation (lifecycle-agnostic)
βββ scripts/ # π οΈ Maintenance utilities
βββ tests/ # β
Testing (lifecycle-agnostic)Level 1: definitions/ - SOURCE OF TRUTH
Organization: DOMAIN-SLICED (by business capability)
data_fabric/
βββ definitions/ # All canonical data
β
βββ schemas/ # β
Keep current structure
β βββ domain/v1/ # Domain models
β β βββ league/
β β βββ sports/
β β βββ contract/
β β βββ pricing/
β β βββ questionnaire/
β β
β βββ generated/ # Auto-generated adapters
β β βββ drizzle/ # TypeScript/Drizzle
β β βββ pydantic/ # Python/Pydantic
β β βββ typescript/ # TypeScript interfaces
β β
β βββ README.md
β
βββ config/ # Domain-specific business rules
β βββ business/
β β βββ pricing/ # β MOVE FROM output-styles/config
β β β βββ tier_presets.v1.json
β β β βββ combat.pricing.v1.json
β β β βββ default.pricing.v1.json
β β β βββ README.md
β β β
β β βββ scoring/ # β MOVE FROM output-styles/config
β β β βββ scoring_model.v1.json
β β β βββ weights.v1.json
β β β βββ README.md
β β β
β β βββ contracts/ # NEW: Contract templates config
β β βββ template_mappings.json
β β βββ clause_library.json
β β βββ README.md
β β
β βββ sports/ # Sport-specific configs
β β βββ archetypes.json
β β βββ betting_markets.json
β β βββ data_requirements.json
β β βββ README.md
β β
β βββ pipeline/ # Pipeline stage configs
β β βββ onboarding_stages.json
β β βββ validation_rules.json
β β βββ README.md
β β
β βββ README.md # Config governance
β
βββ templates/ # SCENARIO-ORGANIZED (by workflow)
β βββ prompts/ # AI prompt templates
β β βββ onboarding/
β β β βββ extract_questionnaire.j2
β β β βββ classify_sport.j2
β β β βββ suggest_tier.j2
β β β
β β βββ contracts/
β β β βββ generate_terms.j2
β β β βββ assemble_document.j2
β β β
β β βββ components/ # Reusable fragments
β β β βββ system_instructions/
β β β βββ output_formats/
β β β βββ few_shot/
β β β
β β βββ README.md
β β
β βββ contracts/ # Document templates
β βββ term_sheet.md.j2
β βββ msa.md.j2
β βββ README.md
β
βββ examples/ # SCENARIO-ORGANIZED (training data)
βββ onboarding/
β βββ questionnaire_extraction/
β β βββ examples.jsonl # Manual examples
β β βββ metadata.json
β β βββ README.md
β β
β βββ tier_classification/
β β βββ examples.jsonl # Manual examples
β β βββ generated.jsonl # β AUTO-GENERATED from config
β β βββ generator.py # β Generation script
β β βββ README.md
β β
β βββ contract_assembly/
β βββ examples.jsonl
β βββ README.md
β
βββ sports_classification/
β βββ by_archetype.jsonl
β βββ by_market_readiness.jsonl
β βββ README.md
β
βββ README.md # Example governanceWhy Domain-Sliced Here:
- β Config naturally groups by domain (pricing, scoring)
- β Schemas already domain-organized
- β Templates group by use case (scenario)
- β Examples group by training task (scenario)
Level 2: weave/ - OPERATIONAL RUNTIME
Organization: TECHNICAL-SLICED (by system capability)
data_fabric/
βββ weave/ # All runtime operations
β
βββ knowledge/ # β
Keep structure (AI operations)
β βββ __init__.py
β βββ embeddings/ # Vector generation
β β βββ __init__.py
β β βββ service.py
β β βββ config.py
β β
β βββ intent/ # Query classification
β β βββ __init__.py
β β βββ classifier.py
β β βββ patterns.py
β β
β βββ retrieval/ # RAG operations
β β βββ __init__.py
β β βββ rag_service.py
β β βββ query_builder.py
β β βββ reranker.py
β β
β βββ storage/ # Vector DB interface
β β βββ __init__.py
β β βββ langmem_client.py
β β βββ vector_store.py
β β
β βββ templates/ # Dynamic prompt assembly
β βββ __init__.py
β βββ prompt_builder.py
β βββ template_loader.py
β
βββ storage/ # β
Keep structure (persistence)
β βββ __init__.py
β βββ examples/ # β οΈ This is CODE, not data!
β β βββ __init__.py
β β βββ retriever.py # Example retrieval system
β β βββ matcher.py # Example matching logic
β β βββ cache.py # Runtime example cache
β β βββ data/ # .gitignore runtime cache
β β
β βββ postgres/ # PostgreSQL operations
β β βββ __init__.py
β β βββ client.py
β β βββ models/
β β
β βββ redis/ # Cache layer
β β βββ __init__.py
β β βββ client.py
β β
β βββ supabase/ # Supabase operations
β βββ __init__.py
β βββ client.py
β
βββ prompts/ # β
Enhance (generation logic)
β βββ __init__.py
β βββ builders/ # Prompt construction
β β βββ __init__.py
β β βββ onboarding_prompts.py
β β βββ classification_prompts.py
β β βββ contract_prompts.py
β β βββ base.py
β β
β βββ registry/ # Prompt metadata
β β βββ __init__.py
β β βββ catalog.json
β β
β βββ README.md
β
βββ generators/ # NEW: Data generation pipelines
β βββ __init__.py
β βββ config_to_examples.py # Config β Examples
β βββ schema_to_adapters.py # Schema β Pydantic/TS
β βββ contract_assembler.py # Data β Contracts
β
βββ validators/ # NEW: Validation logic
βββ __init__.py
βββ schema_validator.py
βββ config_validator.py
βββ example_validator.pyWhy Technical-Sliced Here:
- β Python modules are technical capabilities
- β Clear separation of concerns (knowledge vs storage vs generation)
- β Easy to test (mock boundaries)
- β Reusable across scenarios
Level 3: views/ - MATERIALIZED OUTPUTS
Organization: SCENARIO-BASED (by business workflow)
data_fabric/
βββ views/ # β οΈ .gitignore entire directory
β
βββ onboarding/ # Onboarding pipeline outputs
β βββ 02-ingest-validate-questionnaire/
β β βββ example_seeds/ # Input seeds
β β βββ validated/ # Validation results
β β βββ metadata/ # Processing metadata
β β
β βββ 03-enhance-documents/
β β βββ enriched/
β β βββ metadata/
β β
β βββ 04-classify-and-score/
β β βββ classifications/
β β βββ scores/
β β βββ recommendations/
β β
β βββ 05-upsert-and-crossref/
β β βββ upserted/
β β βββ relationships/
β β
β βββ 06-suggest-tiers-and-terms/
β β βββ tier_suggestions/
β β βββ term_suggestions/
β β βββ pricing_recommendations/
β β
β βββ 07-assemble-contract/
β β βββ drafts/
β β βββ final/
β β βββ metadata/
β β
β βββ 07a-output-contract-export/
β β βββ pdf/
β β βββ docx/
β β βββ markdown/
β β
β βββ 07b-output-gamekeeper-scorekeeper-ui/
β β βββ configs/
β β βββ data/
β β
β βββ 07c-output-marketing-nxt-onboarding-materials/
β βββ presentations/
β βββ assets/
β
βββ analytics/ # Analytics pipeline outputs
β βββ reports/
β βββ dashboards/
β βββ exports/
β
βββ contracts/ # Generated contracts (all workflows)
β βββ term_sheets/
β βββ msas/
β βββ amendments/
β
βββ uploads/ # User-uploaded files
βββ questionnaires/
βββ documents/Why Scenario-Based Here:
- β Business workflows are scenarios
- β Each pipeline stage produces artifacts
- β Easy to clean up (rm -rf views/)
- β GitIgnored (don't track generated files)
π Comparison: Current vs Recommended
Current Structure Issues
data_fabric/
βββ output-styles/ # β Mixed metaphor
β βββ config/ # β Should be in definitions/
β βββ onboarding/ # β
Good (scenario-based)
β βββ schemas/ # β Duplicate of schemas/
β
βββ prompts/ # β οΈ Mixed (templates + code)
β βββ components/ # β
Should be in definitions/
β βββ builders/ # β
Should stay (code)
β
βββ kb_catalog/ # β οΈ Unclear purpose
β βββ constants/ # β
Good (business rules)
β βββ manifests/ # β οΈ What's this?
β
βββ storage/examples/ # β Confusing (code or data?)Problems:
- Mixed lifecycle stages (source + runtime + output)
- Duplicate schemas (schemas/ and output-styles/schemas/)
- Unclear metaphors ("output-styles" but has config?)
- Code vs data confusion (storage/examples/ is code!)
Recommended Structure Benefits
data_fabric/
βββ definitions/ # β
Clear: "source of truth"
β βββ schemas/ # β
Only place for schemas
β βββ config/ # β
Only place for business config
β βββ templates/ # β
Only place for templates
β βββ examples/ # β
Only place for training data
β
βββ weave/ # β
Clear: "operational code"
β βββ knowledge/ # β
AI operations
β βββ storage/ # β
Persistence operations
β βββ prompts/ # β
Generation code
β βββ generators/ # β
Transformation code
β
βββ views/ # β
Clear: "generated outputs"
βββ onboarding/ # β
Scenario-based
βββ analytics/ # β
Scenario-basedBenefits:
- β Single source of truth (no duplicates)
- β Clear lifecycle (definitions β weave β views)
- β Git-friendly (track definitions, ignore views)
- β Domain-sliced where it matters (config, schemas)
- β Scenario-sliced where it matters (pipelines, examples)
π Migration Strategy
Phase 1: Non-Breaking Additions (Week 1)
# Create new structure without deleting old
mkdir -p data_fabric/definitions/{schemas,config,templates,examples}
mkdir -p data_fabric/definitions/config/{business,sports,pipeline}
mkdir -p data_fabric/definitions/templates/{prompts,contracts}
mkdir -p data_fabric/definitions/examples/onboarding
mkdir -p data_fabric/weave/{knowledge,storage,prompts,generators,validators}
mkdir -p data_fabric/views/{onboarding,analytics,contracts,uploads}Phase 2: Copy (Don't Move) Critical Files (Week 1)
# Config files (keep originals as backup)
cp -r data_fabric/output-styles/config/business/* data_fabric/definitions/config/business/
# Prompt templates
cp -r data_fabric/prompts/components/* data_fabric/definitions/templates/prompts/components/
# Examples (if any exist outside storage/)
# ... identify and copyPhase 3: Update Import Paths (Week 2)
# OLD
from database.output_styles.config.business.pricing import tier_presets
# NEW
from data_fabric.definitions.config.business.pricing import tier_presets# Find all references
grep -r "output_styles.config" data_fabric/ --include="*.py"
grep -r "from database" data_fabric/ --include="*.py"
# Automated replacement
find data_fabric -name "*.py" -type f -exec sed -i '' \
's/from database\.output_styles\.config/from data_fabric.definitions.config/g' {} +Phase 4: Test & Validate (Week 2)
# Run all tests
python -m pytest data_fabric/tests/
# Validate imports
python -c "from data_fabric.definitions.config.business.pricing import tier_presets"
# Check for broken imports
python scripts/check_imports.pyPhase 5: Delete Old Structure (Week 3)
# Only after confirming everything works!
git rm -r data_fabric/output-styles/config/
git rm -r data_fabric/prompts/components/ # Move to definitions/templates
# Update .gitignore
echo "data_fabric/views/*" >> .gitignore
echo "!data_fabric/views/README.md" >> .gitignoreπ― Special Considerations
1. kb_catalog/ - Where Does It Go?
Current Location: Top-level (unclear)
Options:
Option A: Merge into definitions/config/
definitions/
βββ config/
βββ business/ # Business rules
βββ sports/ # Sports config
βββ system/ # NEW: System-level config
βββ constants.py # β FROM kb_catalog/constants/
βββ registry.json # β FROM kb_catalog/registry/Option B: Keep as definitions/catalog/
definitions/
βββ config/ # Operational config
βββ catalog/ # System inventory
βββ constants/ # Enum-like data
βββ registry/ # Component registry
βββ manifests/ # Auto-generated inventoriesRecommendation: Option B if catalog is auto-generated inventory.
Rationale: Catalogs are metadata ABOUT the system, not config FOR the system.
2. storage/examples/ - Code or Data?
Current Reality: It's CODE (retriever.py, matcher.py)
Decision: Keep in weave/storage/examples/ as a code module
Clarify with README:
# weave/storage/examples/README.md
This is a **Python module** for runtime example retrieval, NOT a data directory.
Training examples live in: `data_fabric/definitions/examples/`3. Generated Schemas - Where?
Current: schemas/generated/
Proposed: definitions/schemas/generated/
Rationale: Generated FROM canonical, so still "definitions"
Alternative View: Move to views/schemas/ since they're derived
Recommendation: Keep in definitions/schemas/generated/
- These are source code (imported by apps)
- They're checked into git (not gitignored)
- They're versioned (breaking changes matter)
4. Pipeline Stage Configs - Where?
Question: Should each pipeline stage have its own config?
Current: Global config in output-styles/config/
Recommendation: Centralized in definitions/config/
definitions/
βββ config/
βββ business/ # Domain config (pricing, scoring)
βββ pipeline/ # Pipeline-wide settings
β βββ onboarding_stages.json
β βββ validation_rules.json
βββ sports/ # Sport-specific configRationale:
- β Single source of truth
- β Easier to version
- β Avoids duplication across stages
- β Pipeline stages READ config, don't OWN it
π Final Recommendation Summary
β Organization Strategy: HYBRID
- Level 1 (Lifecycle):
definitions/βweave/βviews/ - Level 2 (Within definitions/): Domain-sliced (pricing, scoring, sports)
- Level 3 (Within views/): Scenario-sliced (onboarding, analytics)
β Directory Structure
data_fabric/
βββ definitions/ # Lifecycle Stage 1: Source of truth
β βββ schemas/ # Domain-organized
β βββ config/ # Domain-organized (business, sports, pipeline)
β βββ templates/ # Scenario-organized (prompts, contracts)
β βββ examples/ # Scenario-organized (training data)
β
βββ weave/ # Lifecycle Stage 2: Runtime operations
β βββ knowledge/ # Technical slice (AI)
β βββ storage/ # Technical slice (persistence)
β βββ prompts/ # Technical slice (generation)
β βββ generators/ # Technical slice (transformation)
β βββ validators/ # Technical slice (validation)
β
βββ views/ # Lifecycle Stage 3: Generated outputs
β βββ onboarding/ # Scenario-organized
β βββ analytics/ # Scenario-organized
β βββ contracts/ # Scenario-organized
β βββ uploads/ # Scenario-organized
β
βββ docs/ # Documentation (lifecycle-agnostic)
βββ scripts/ # Utilities (lifecycle-agnostic)
βββ tests/ # Testing (lifecycle-agnostic)β Migration Priority
- Week 1: Create structure, copy (don't move) files
- Week 2: Update imports, test thoroughly
- Week 3: Delete old structure, update docs
β Why This Works
| Concern | Solution |
|---|---|
| "Where does X go?" | Lifecycle first β domain/scenario second |
| "Too many directories" | Only 3 top-level (definitions, weave, views) |
| "Hard to navigate" | IDE search + clear README in each |
| "Breaking changes" | Copy-then-migrate strategy |
| "Team confusion" | Visual diagram + onboarding doc |
π Teaching the System
Create: data_fabric/README.md
# Data Fabric Architecture
This directory uses a **hybrid lifecycle + domain organization**.
## ποΈ Top-Level Structure
- `definitions/` - **Source of truth** (git-tracked, immutable)
- `weave/` - **Operational code** (Python modules, runtime logic)
- `views/` - **Generated outputs** (.gitignored, materialized views)
## π§ Finding What You Need
**Looking for business rules?** β `definitions/config/business/`
**Looking for AI prompts?** β `definitions/templates/prompts/`
**Looking for schemas?** β `definitions/schemas/domain/`
**Looking for runtime code?** β `weave/{knowledge,storage,prompts}/`
**Looking for pipeline outputs?** β `views/onboarding/`
## π Learn More
- [Lifecycle Guide](docs/LIFECYCLE_GUIDE.md)
- [Domain Guide](docs/DOMAIN_GUIDE.md)
- [Scenario Guide](docs/SCENARIO_GUIDE.md)Bottom Line: Use HYBRID organization with lifecycle at the top level, domain slicing for config/schemas, and scenario slicing for workflows/examples.