Source: data_layer/docs/DATA_FABRIC_ARCHITECTURE.md

🌐 Data Fabric Architecture - Complete System Design

Version: 2.0
Date: 2025-10-16
Status: Production Architecture

Purpose: Unified, intelligent data architecture supporting multi-storage retrieval, schema-driven validation, prompt composition, and AI-powered generation pipelines.

🎯 Core Concept: The Complete Data Flow

┌──────────────────────────────────────────────────────────────────────┐
│                     DEFINITIONS (Source of Truth)                     │
│  Git-tracked, version-controlled, single source of truth             │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │
│  │Schemas │  │Configs │  │Prompts │  │Examples│  │  Seeds │        │
│  │(shape) │  │(values)│  │(instr) │  │(train) │  │(synth) │        │
│  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘  └───┬────┘        │
│      │           │            │            │            │             │
└──────┼───────────┼────────────┼────────────┼────────────┼─────────────┘
       │           │            │            │            │
       │     ┌─────┴────────────┴────────┐   │            │
       │     │                            │   │            │
       ▼     ▼                            ▼   ▼            ▼
┌──────────────────────────────────────────────────────────────────────┐
│                        WEAVE (Transformation)                         │
│  Python modules that BUILD, COMPOSE, EMBED, GENERATE                 │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │   Builders   │  │  Generators  │  │  Embedders   │              │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤              │
│  │• Prompt      │  │• Examples    │  │• Vector      │              │
│  │  Composer    │  │  from Config │  │  Embeddings  │              │
│  │• Schema      │  │• Pydantic    │  │• Semantic    │              │
│  │  Generator   │  │  from JSON   │  │  Index       │              │
│  │• Config      │  │• TypeScript  │  │• LangMem     │              │
│  │  Loader      │  │  from JSON   │  │  Sync        │              │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
│         │                  │                  │                       │
└─────────┼──────────────────┼──────────────────┼───────────────────────┘
          │                  │                  │
          ▼                  ▼                  ▼
┌──────────────────────────────────────────────────────────────────────┐
│                         VIEWS (Materialized)                          │
│  Multi-storage, optimized for specific access patterns               │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │PostgreSQL│  │ LangMem  │  │  Redis   │  │Supabase  │           │
│  │  (JSONB) │  │ (Vector) │  │ (Cache)  │  │  (Auth)  │           │
│  ├──────────┤  ├──────────┤  ├──────────┤  ├──────────┤           │
│  │• Query   │  │• RAG     │  │• Hot     │  │• User    │           │
│  │• Join    │  │• Semantic│  │  Configs │  │  State   │           │
│  │• Version │  │  Search  │  │• Session │  │• Realtime│           │
│  └──────┬───┘  └──────┬───┘  └──────┬───┘  └──────┬───┘           │
│         │              │              │              │                │
└─────────┼──────────────┼──────────────┼──────────────┼────────────────┘
          │              │              │              │
          └──────────────┴──────────────┴──────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────────────┐
│                        APPLICATION (Consumers)                        │
│  FastAPI + LangGraph + MCP Servers + Next.js Frontend                │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │
│  │  LLM Pipeline  │  │  Validation    │  │   Frontend     │        │
│  ├────────────────┤  ├────────────────┤  ├────────────────┤        │
│  │• Prompt with   │  │• Pydantic      │  │• Zod           │        │
│  │  embedded ex.  │  │  validates     │  │  validates     │        │
│  │• Generate with │  │  backend       │  │  frontend      │        │
│  │  constraints   │  │• Enforce       │  │• TypeScript    │        │
│  │• Retrieve      │  │  schema        │  │  types         │        │
│  │  semantically  │  │• Return JSON   │  │• UI safety     │        │
│  └────────────────┘  └────────────────┘  └────────────────┘        │
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘

📁 Complete Directory Structure

data_layer/                                    # The unified data fabric
│
├── README.md                                  # This file
├── DATA_FABRIC_ARCHITECTURE.md                # Architecture overview
│
├── definitions/                               # 🔷 TIER 1: Source of Truth
│   │                                          # Git-tracked, canonical definitions
│   ├── schemas/                               # JSON Schema (canonical)
│   │   ├── canonical/                         # Draft 2020-12 JSON Schema
│   │   │   ├── contract-terms.schema.json
│   │   │   ├── questionnaire.schema.json
│   │   │   ├── tier-classification.schema.json
│   │   │   └── README.md
│   │   │
│   │   ├── generated/                         # AUTO-GENERATED from canonical
│   │   │   ├── pydantic/                      # Python validation
│   │   │   │   ├── contract_terms.py
│   │   │   │   ├── questionnaire.py
│   │   │   │   └── __init__.py
│   │   │   │
│   │   │   ├── typescript/                    # Frontend types
│   │   │   │   ├── contract-terms.ts
│   │   │   │   ├── questionnaire.ts
│   │   │   │   └── index.ts
│   │   │   │
│   │   │   ├── zod/                           # Frontend validation
│   │   │   │   ├── contract-terms.zod.ts
│   │   │   │   ├── questionnaire.zod.ts
│   │   │   │   └── index.ts
│   │   │   │
│   │   │   └── drizzle/                       # ORM schemas
│   │   │       ├── contract-terms.ts
│   │   │       ├── questionnaire.ts
│   │   │       └── index.ts
│   │   │
│   │   ├── generate_all.py                    # Master generator script
│   │   └── README.md                          # Schema governance
│   │
│   ├── config/                                # Business configuration
│   │   ├── business/
│   │   │   ├── pricing/
│   │   │   │   ├── tier_presets.v1.json      # Tier pricing & terms
│   │   │   │   ├── combat.pricing.v1.json    # Combat vertical pricing
│   │   │   │   ├── standard.pricing.v1.json  # Standard pricing
│   │   │   │   └── README.md
│   │   │   │
│   │   │   ├── scoring/
│   │   │   │   ├── scoring_model.v1.json     # Scoring weights & thresholds
│   │   │   │   ├── tier_thresholds.v1.json
│   │   │   │   └── README.md
│   │   │   │
│   │   │   ├── rules/
│   │   │   │   ├── validation_rules.json
│   │   │   │   ├── business_logic.json
│   │   │   │   └── README.md
│   │   │   │
│   │   │   └── README.md
│   │   │
│   │   ├── sports/
│   │   │   ├── archetypes.json               # Sport classifications
│   │   │   ├── betting_markets.json          # Market definitions
│   │   │   ├── stat_mappings.json            # Sport-specific stats
│   │   │   └── README.md
│   │   │
│   │   ├── workflows/
│   │   │   ├── onboarding.config.json
│   │   │   ├── contract_generation.config.json
│   │   │   └── README.md
│   │   │
│   │   └── README.md                          # Config governance
│   │
│   ├── prompts/                               # Static prompt definitions
│   │   ├── templates/                         # Jinja2/Mustache templates
│   │   │   ├── onboarding/
│   │   │   │   ├── questionnaire_extraction.j2
│   │   │   │   ├── enhancement.j2
│   │   │   │   └── classification.j2
│   │   │   │
│   │   │   ├── contract/
│   │   │   │   ├── tier_1_template.j2
│   │   │   │   ├── tier_2_template.j2
│   │   │   │   └── variable_sections.j2
│   │   │   │
│   │   │   └── README.md
│   │   │
│   │   ├── components/                        # Reusable prompt blocks
│   │   │   ├── system_instructions/
│   │   │   │   ├── base_agent.md
│   │   │   │   ├── tier_classifier.md
│   │   │   │   └── contract_assembler.md
│   │   │   │
│   │   │   ├── few_shot_patterns/
│   │   │   │   ├── classification_pattern.md
│   │   │   │   └── extraction_pattern.md
│   │   │   │
│   │   │   ├── output_formats/
│   │   │   │   ├── json_structure.md
│   │   │   │   └── markdown_contract.md
│   │   │   │
│   │   │   └── README.md
│   │   │
│   │   └── README.md                          # Prompt template guide
│   │
│   ├── examples/                              # Training & reference data
│   │   ├── seeds/                             # Hand-curated golden examples
│   │   │   ├── onboarding/
│   │   │   │   ├── questionnaire-extraction.jsonl
│   │   │   │   ├── enhancement.jsonl
│   │   │   │   ├── classification.jsonl
│   │   │   │   └── tier-suggestion.jsonl
│   │   │   │
│   │   │   ├── contract-generation/
│   │   │   │   ├── tier-1-examples.jsonl
│   │   │   │   ├── tier-2-examples.jsonl
│   │   │   │   └── combat-examples.jsonl
│   │   │   │
│   │   │   └── README.md
│   │   │
│   │   ├── generated/                         # AUTO-GENERATED from configs
│   │   │   ├── pricing-examples.jsonl        # From tier_presets
│   │   │   ├── scoring-examples.jsonl        # From scoring_model
│   │   │   ├── sport-classification.jsonl    # From archetypes
│   │   │   └── README.md                      # Generation docs
│   │   │
│   │   ├── validation/                        # Edge cases & tests
│   │   │   ├── edge-cases.jsonl
│   │   │   ├── negative-examples.jsonl
│   │   │   └── README.md
│   │   │
│   │   └── README.md                          # Example governance
│   │
│   └── kb_catalog/                            # Business intelligence
│       ├── constants/                         # Python constants
│       │   ├── __init__.py
│       │   ├── business_rules.py              # Importable rules
│       │   ├── sport_classifications.py
│       │   ├── field_mappings.py
│       │   └── validation_rules.py
│       │
│       ├── registry/                          # Manual registries
│       │   ├── core_schemas_registry.json
│       │   ├── workflow_registry.json
│       │   └── triage_rules.json
│       │
│       ├── manifests/                         # Auto-generated catalogs
│       │   ├── agents.json                    # System agent inventory
│       │   ├── tools.json                     # MCP tools catalog
│       │   └── services.json                  # Service registry
│       │
│       └── README.md                          # KB catalog guide
│
├── weave/                                     # 🔶 TIER 2: Transformation
│   │                                          # Python code for integration
│   ├── builders/                              # Composition engines
│   │   ├── prompts/
│   │   │   ├── __init__.py
│   │   │   ├── base_builder.py               # Base prompt builder
│   │   │   ├── onboarding_builder.py         # Builds onboarding prompts
│   │   │   ├── classification_builder.py     # Builds classification prompts
│   │   │   ├── contract_builder.py           # Builds contract prompts
│   │   │   └── README.md
│   │   │
│   │   ├── schemas/
│   │   │   ├── __init__.py
│   │   │   ├── pydantic_generator.py         # JSON → Pydantic
│   │   │   ├── typescript_generator.py       # JSON → TypeScript
│   │   │   ├── zod_generator.py              # JSON → Zod
│   │   │   ├── drizzle_generator.py          # JSON → Drizzle
│   │   │   └── README.md
│   │   │
│   │   └── examples/
│   │       ├── __init__.py
│   │       ├── config_to_examples.py         # Config → Training examples
│   │       ├── synthetic_generator.py        # Synthetic data generation
│   │       └── README.md
│   │
│   ├── embedders/                             # Vector generation
│   │   ├── __init__.py
│   │   ├── prompt_embedder.py                # Embed prompts for retrieval
│   │   ├── example_embedder.py               # Embed examples for RAG
│   │   ├── config_embedder.py                # Embed configs as knowledge
│   │   └── README.md
│   │
│   ├── retrievers/                            # Intelligent retrieval
│   │   ├── __init__.py
│   │   ├── prompt_retriever.py               # Retrieve similar prompts
│   │   ├── example_retriever.py              # Retrieve relevant examples
│   │   ├── semantic_matcher.py               # Semantic similarity
│   │   └── README.md
│   │
│   ├── knowledge/                             # Intelligence layer
│   │   ├── __init__.py
│   │   ├── intent/                            # Intent classification
│   │   │   ├── classifier.py
│   │   │   └── router.py
│   │   ├── retrieval/                         # RAG operations
│   │   │   ├── rag_engine.py
│   │   │   └── context_builder.py
│   │   └── templates/                         # Dynamic templates
│   │       ├── template_engine.py
│   │       └── variable_injector.py
│   │
│   ├── storage/                               # Multi-storage abstraction
│   │   ├── __init__.py
│   │   ├── postgres_client.py                # PostgreSQL operations
│   │   ├── langmem_client.py                 # LangMem operations
│   │   ├── redis_client.py                   # Redis operations
│   │   ├── supabase_client.py                # Supabase operations
│   │   └── README.md
│   │
│   └── README.md                              # Weave layer guide
│
├── views/                                     # 🔸 TIER 3: Materialized
│   │                                          # Generated outputs, queryable
│   ├── prompts/                               # Generated final prompts
│   │   ├── agents/
│   │   │   ├── tier-classifier.v2.md         # AUTO-GENERATED
│   │   │   ├── contract-assembler.v3.md
│   │   │   └── questionnaire-extractor.v1.md
│   │   │
│   │   ├── workflows/
│   │   │   ├── onboarding-workflow.v1.md
│   │   │   └── contract-generation.v2.md
│   │   │
│   │   └── README.md                          # Usage: Don't edit!
│   │
│   ├── onboarding/                            # Pipeline materialized views
│   │   ├── 02-ingest-validate/
│   │   │   ├── outputs/                       # Generated outputs
│   │   │   ├── cache/                         # Processed cache
│   │   │   └── README.md
│   │   │
│   │   ├── 06-suggest-tiers/
│   │   │   ├── outputs/
│   │   │   │   ├── tier-suggestions.json
│   │   │   │   └── scoring-results.json
│   │   │   └── README.md
│   │   │
│   │   └── 07-assemble-contract/
│   │       ├── outputs/
│   │       │   ├── contracts/                 # Generated PDFs
│   │       │   └── markdown/                  # Markdown versions
│   │       └── README.md
│   │
│   ├── embeddings/                            # Vector stores (runtime)
│   │   ├── prompt_vectors/                    # Embedded prompts
│   │   ├── example_vectors/                   # Embedded examples
│   │   ├── config_vectors/                    # Embedded configs
│   │   └── README.md
│   │
│   └── README.md                              # Views layer guide
│
├── scripts/                                   # 🛠️ Orchestration scripts
│   ├── sync/
│   │   ├── sync_to_postgresql.py             # Config → PostgreSQL JSONB
│   │   ├── sync_to_langmem.py                # Examples → LangMem vectors
│   │   ├── sync_to_redis.py                  # Hot configs → Redis cache
│   │   └── sync_all.py                        # Master sync script
│   │
│   ├── generate/
│   │   ├── generate_schemas.py               # JSON → Pydantic/TS/Zod/Drizzle
│   │   ├── generate_examples.py              # Config → Training examples
│   │   ├── generate_prompts.py               # Components → Final prompts
│   │   └── generate_all.py                    # Master generation script
│   │
│   ├── embed/
│   │   ├── embed_prompts.py                  # Prompts → Vectors
│   │   ├── embed_examples.py                 # Examples → Vectors
│   │   ├── embed_configs.py                  # Configs → Vectors
│   │   └── embed_all.py                       # Master embedding script
│   │
│   └── README.md                              # Scripts usage guide
│
├── tests/                                     # Testing infrastructure
│   ├── test_builders.py                       # Test prompt/schema builders
│   ├── test_generators.py                    # Test example generation
│   ├── test_embeddings.py                    # Test vector operations
│   ├── test_retrieval.py                     # Test RAG pipeline
│   └── README.md
│
└── docs/                                      # Documentation
    ├── ARCHITECTURE.md                        # This file (symlink)
    ├── QUICK_START.md                         # Developer onboarding
    ├── API_REFERENCE.md                       # Code API docs
    └── WORKFLOWS.md                           # Common workflows

🔄 The Complete Data Flow (Your Vision Realized)

Flow 1: Schema-Driven Validation Pipeline

# 1. CANONICAL SCHEMA (definitions/schemas/canonical/)
# contract-terms.schema.json (JSON Schema Draft 2020-12)
 
# 2. GENERATE VALIDATORS (weave/builders/schemas/)
python weave/builders/schemas/generate_all.py
# → Creates Pydantic, TypeScript, Zod, Drizzle
 
# 3. BACKEND VALIDATION (Application Layer)
from data_layer.definitions.schemas.generated.pydantic import ContractTerms
 
contract = ContractTerms(**llm_output)  # Pydantic validates
 
# 4. FRONTEND VALIDATION (Application Layer)
import { contractTermsSchema } from '@/data_layer/definitions/schemas/generated/zod'
 
const validated = contractTermsSchema.parse(apiResponse)  // Zod validates

Flow 2: Config-Driven Example Generation

# 1. BUSINESS CONFIG (definitions/config/business/)
# tier_presets.v1.json contains actual pricing values
 
# 2. GENERATE EXAMPLES (weave/builders/examples/)
from weave.builders.examples import config_to_examples
 
examples = config_to_examples(
    config_path="definitions/config/business/pricing/tier_presets.v1.json",
    output_path="definitions/examples/generated/pricing-examples.jsonl"
)
# Creates 50+ training examples in JSONL format
 
# 3. EMBED EXAMPLES (weave/embedders/)
from weave.embedders import example_embedder
 
example_embedder.embed_all(
    input_path="definitions/examples/generated/pricing-examples.jsonl",
    namespace="pricing-examples"
)
# Stores in LangMem for RAG retrieval
 
# 4. RETRIEVE IN CONTEXT (Application Layer)
from weave.retrievers import example_retriever
 
relevant_examples = example_retriever.get_similar(
    query="What tier for a combat league with $2M revenue?",
    namespace="pricing-examples",
    k=5
)
# Returns 5 most relevant examples for few-shot prompting

Flow 3: Prompt Component Composition

# 1. PROMPT COMPONENTS (definitions/prompts/components/)
# system_instructions/tier_classifier.md
# few_shot_patterns/classification_pattern.md
# output_formats/json_structure.md
 
# 2. LOAD BUSINESS CONFIG (definitions/config/)
from data_layer.definitions.config.business import load_config
 
scoring_weights = load_config("business/scoring/scoring_model.v1.json")
 
# 3. BUILD DYNAMIC PROMPT (weave/builders/prompts/)
from weave.builders.prompts import classification_builder
 
prompt = classification_builder.build(
    components=[
        "system_instructions/tier_classifier.md",
        "few_shot_patterns/classification_pattern.md"
    ],
    config=scoring_weights,  # Inject actual weights
    examples=relevant_examples  # From retrieval
)
 
# 4. EMBED FOR FUTURE RETRIEVAL (weave/embedders/)
from weave.embedders import prompt_embedder
 
prompt_embedder.embed(
    prompt_text=prompt,
    metadata={
        "type": "classification",
        "version": "2.0",
        "config_version": scoring_weights['version']
    }
)
 
# 5. RETRIEVE SIMILAR PROMPTS LATER
from weave.retrievers import prompt_retriever
 
similar_prompts = prompt_retriever.get_similar(
    query="Need to classify a new league type",
    k=3
)
# Returns 3 most similar historical prompts for reference

Flow 4: Multi-Storage Retrieval Strategy

# APPLICATION NEEDS: Get tier recommendation with reasoning
 
# 1. RETRIEVE FROM REDIS (Hot Cache)
from weave.storage import redis_client
 
cached_tier = redis_client.get(f"tier:league:{league_id}")
if cached_tier:
    return cached_tier  # Fast path: < 5ms
 
# 2. RETRIEVE FROM POSTGRESQL (Structured Query)
from weave.storage import postgres_client
 
tier_config = postgres_client.query("""
    SELECT config_data->'tiers'->'tier_1' as tier_1
    FROM business_config
    WHERE config_type = 'tier_presets' AND version = 1
""")
 
# 3. RETRIEVE FROM LANGMEM (Semantic Search)
from weave.storage import langmem_client
 
relevant_examples = langmem_client.query(
    query=f"Tier recommendation for {league_characteristics}",
    namespace="pricing-examples",
    filters={"type": "tier_recommendation"},
    k=5
)
 
# 4. COMPOSE FINAL PROMPT WITH ALL CONTEXT
from weave.builders.prompts import classification_builder
 
final_prompt = classification_builder.build(
    system_instructions="tier_classifier.md",
    business_config=tier_config,  # From PostgreSQL
    few_shot_examples=relevant_examples,  # From LangMem
    output_schema=tier_classification_schema  # From definitions/schemas/
)
 
# 5. LLM GENERATES with Pydantic Validation
from langchain_openai import ChatOpenAI
from data_layer.definitions.schemas.generated.pydantic import TierClassification
 
llm = ChatOpenAI(model="gpt-4")
structured_llm = llm.with_structured_output(TierClassification)
 
result = structured_llm.invoke(final_prompt)
# Returns validated Pydantic model
 
# 6. CACHE RESULT
redis_client.set(
    f"tier:league:{league_id}",
    result.model_dump_json(),
    ex=3600  # 1 hour TTL
)
 
# 7. SEND TO FRONTEND (Zod validates there)
# Frontend receives JSON, validates with Zod schema

🎨 Key Design Patterns

Pattern 1: Single Source, Multiple Views

tier_presets.v1.json (SINGLE SOURCE)
    │
    ├─→ PostgreSQL JSONB (queryable)
    ├─→ LangMem vectors (semantic)
    ├─→ Redis JSON (cached)
    ├─→ Training examples JSONL (few-shot)
    └─→ API response templates (runtime)

Benefit: Update once, propagates everywhere

Pattern 2: Schema-Driven Everything

contract-terms.schema.json (CANONICAL)
    │
    ├─→ Pydantic model (backend validation)
    ├─→ TypeScript types (frontend types)
    ├─→ Zod schema (frontend validation)
    ├─→ Drizzle schema (database ORM)
    └─→ Documentation (auto-generated)

Benefit: Type safety across entire stack

Pattern 3: Component-Based Prompt Assembly

# Components (small, reusable)
system_instruction = load("system_instructions/tier_classifier.md")
few_shot_pattern = load("few_shot_patterns/classification.md")
output_format = load("output_formats/json_structure.md")
 
# Config (actual values)
weights = load_config("business/scoring/scoring_model.v1.json")
 
# Examples (context)
examples = retrieve_examples(
    query="tier classification",
    k=5
)
 
# BUILD final prompt
final_prompt = compose(
    system_instruction,
    inject_weights(few_shot_pattern, weights),
    inject_examples(few_shot_pattern, examples),
    output_format
)

Benefit: Prompts are dynamic, data-driven, testable

Pattern 4: Embedded Retrieval Everywhere

# Everything can be retrieved semantically:
 
# 1. Retrieve similar prompts
similar_prompts = retrieve_prompts(
    "How to classify combat sports?",
    namespace="prompts"
)
 
# 2. Retrieve relevant examples
relevant_examples = retrieve_examples(
    "Tier 1 combat league pricing",
    namespace="examples"
)
 
# 3. Retrieve business rules
business_context = retrieve_configs(
    "Combat sports pricing rules",
    namespace="business-rules"
)
 
# 4. Compose everything into final prompt
final_prompt = compose_with_retrieval(
    query="Classify new MMA league",
    prompt_template=similar_prompts[0],
    examples=relevant_examples[:5],
    config=business_context
)

Benefit: AI has intelligent access to all knowledge

🚀 Implementation Scripts

Master Sync Script

# scripts/sync/sync_all.py
"""
Master orchestration: SOURCE_OF_TRUTH → RUNTIME SYSTEMS
"""
import asyncio
from pathlib import Path
from weave.storage import postgres_client, langmem_client, redis_client
 
async def sync_all():
    """Sync everything from definitions/ to runtime"""
    
    print("🔄 Starting multi-storage sync...")
    
    # 1. Sync configs to PostgreSQL (JSONB)
    print("  📊 Syncing to PostgreSQL...")
    await postgres_client.sync_configs(
        source_dir=Path("data_layer/definitions/config")
    )
    
    # 2. Sync examples to LangMem (vectors)
    print("  🧠 Syncing to LangMem...")
    await langmem_client.sync_examples(
        source_dir=Path("data_layer/definitions/examples")
    )
    
    # 3. Cache hot configs in Redis
    print("  ⚡ Caching in Redis...")
    await redis_client.cache_hot_configs(
        configs=["tier_presets.v1", "scoring_model.v1"]
    )
    
    # 4. Embed prompts for retrieval
    print("  🔍 Embedding prompts...")
    from weave.embedders import prompt_embedder
    await prompt_embedder.embed_all(
        source_dir=Path("data_layer/views/prompts")
    )
    
    print("✅ Sync complete!")
 
if __name__ == "__main__":
    asyncio.run(sync_all())

Master Generation Script

# scripts/generate/generate_all.py
"""
Generate all derived artifacts from SOURCE_OF_TRUTH
"""
from pathlib import Path
from weave.builders import schemas, examples, prompts
 
def generate_all():
    """Generate schemas, examples, and prompts"""
    
    print("🏗️  Generating all artifacts...")
    
    # 1. Generate schema adapters
    print("  📋 Generating schemas...")
    schemas.pydantic_generator.generate_all(
        source=Path("data_layer/definitions/schemas/canonical"),
        output=Path("data_layer/definitions/schemas/generated/pydantic")
    )
    schemas.zod_generator.generate_all(
        source=Path("data_layer/definitions/schemas/canonical"),
        output=Path("data_layer/definitions/schemas/generated/zod")
    )
    schemas.drizzle_generator.generate_all(
        source=Path("data_layer/definitions/schemas/canonical"),
        output=Path("data_layer/definitions/schemas/generated/drizzle")
    )
    
    # 2. Generate examples from configs
    print("  🎯 Generating examples...")
    examples.config_to_examples.generate_from_configs(
        config_dir=Path("data_layer/definitions/config/business"),
        output_dir=Path("data_layer/definitions/examples/generated")
    )
    
    # 3. Build final prompts from components
    print("  📝 Building prompts...")
    prompts.onboarding_builder.build_all(
        components_dir=Path("data_layer/definitions/prompts/components"),
        config_dir=Path("data_layer/definitions/config"),
        output_dir=Path("data_layer/views/prompts")
    )
    
    print("✅ Generation complete!")
 
if __name__ == "__main__":
    generate_all()

🧪 Usage Examples

Example 1: Complete LLM Pipeline with Validation

from langchain_openai import ChatOpenAI
from data_layer.weave.builders.prompts import classification_builder
from data_layer.weave.retrievers import example_retriever
from data_layer.definitions.schemas.generated.pydantic import TierClassification
 
async def classify_league(league_data: dict) -> TierClassification:
    """
    Complete classification pipeline:
    1. Retrieve relevant examples (embedded)
    2. Build prompt (component composition)
    3. Generate with LLM
    4. Validate with Pydantic
    5. Return type-safe result
    """
    
    # 1. Retrieve similar examples
    relevant_examples = await example_retriever.get_similar(
        query=f"Classify {league_data['sport']} league",
        namespace="tier-classification",
        k=5
    )
    
    # 2. Build prompt with components + config + examples
    prompt = classification_builder.build(
        system_instructions="tier_classifier.md",
        few_shot_examples=relevant_examples,
        config_weights=load_config("scoring_model.v1.json"),
        input_data=league_data
    )
    
    # 3. Generate with structured output
    llm = ChatOpenAI(model="gpt-4")
    structured_llm = llm.with_structured_output(TierClassification)
    
    result = structured_llm.invoke(prompt)
    # Returns validated Pydantic model!
    
    return result  # Type-safe TierClassification object

Example 2: Frontend Receives Validated Data

// Frontend receives API response
import { contractTermsSchema } from '@/data_layer/definitions/schemas/generated/zod';
import type { ContractTerms } from '@/data_layer/definitions/schemas/generated/typescript';
 
async function fetchContract(leagueId: string): Promise<ContractTerms> {
  const response = await fetch(`/api/contracts/${leagueId}`);
  const data = await response.json();
  
  // Zod validates at runtime
  const validated = contractTermsSchema.parse(data);
  
  // TypeScript types ensure compile-time safety
  return validated;  // Type: ContractTerms
}

Example 3: Retrieve Prompts from Embedded Space

from weave.retrievers import prompt_retriever
 
# Find similar prompts for new task
similar_prompts = await prompt_retriever.get_similar(
    query="Need to extract racing event data from PDF",
    namespace="prompts",
    filters={"category": "extraction"},
    k=3
)
 
# Use as reference or starting point
for prompt in similar_prompts:
    print(f"Similar prompt: {prompt.metadata['title']}")
    print(f"Similarity: {prompt.score}")
    print(f"Content preview: {prompt.content[:200]}...")

📊 Storage Strategy Matrix

Data Type	Source Location	Generated To	Queryable Via	Use Case
Config Files	`definitions/config/`	PostgreSQL (JSONB)	SQL queries	Business rules lookup
Config Files	`definitions/config/`	LangMem (vectors)	Semantic search	RAG context
Config Files	`definitions/config/`	Redis (JSON)	Key-value	Hot data cache
Examples	`definitions/examples/seeds/`	LangMem (vectors)	Semantic search	Few-shot learning
Examples	`definitions/examples/generated/`	LangMem (vectors)	Semantic search	Training data
Prompts	`views/prompts/`	LangMem (vectors)	Semantic search	Prompt retrieval
Schemas	`definitions/schemas/canonical/`	Git	File system	Single source
Pydantic	`definitions/schemas/generated/pydantic/`	Git	Python import	Backend validation
Zod	`definitions/schemas/generated/zod/`	Git	TypeScript import	Frontend validation
Drizzle	`definitions/schemas/generated/drizzle/`	Git	TypeScript import	ORM operations

🔐 Governance & Best Practices

Versioning Strategy

All source files use semantic versioning:
- tier_presets.v1.json → v2.json (breaking changes)
- scoring_model.v1.1.json (minor improvements)
- archetypes.v1.0.1.json (patches)

Version in filename AND inside JSON:
{
  "version": "1.2.0",
  "schemaVersion": "draft-2020-12",
  "lastUpdated": "2025-10-16"
}

Change Management

# When you update a config file:
 
# 1. Edit SOURCE_OF_TRUTH
vim data_layer/definitions/config/business/pricing/tier_presets.v1.json
 
# 2. Increment version
# "version": 5 → "version": 6
 
# 3. Run generators
python data_layer/scripts/generate/generate_all.py
 
# 4. Run sync
python data_layer/scripts/sync/sync_all.py
 
# 5. Verify in each system
psql -c "SELECT version FROM business_config WHERE config_type='tier_presets'"
redis-cli GET config:tier_presets:version
# Check LangMem dashboard for new embeddings
 
# 6. Git commit
git add data_layer/definitions/config/business/pricing/tier_presets.v1.json
git commit -m "feat(pricing): Update tier 1 pricing to \$150k (v6)"

🧪 Testing Strategy

Unit Tests: Test Each Layer

# tests/test_builders.py
def test_prompt_builder_uses_live_config():
    """Ensure prompts load actual config values"""
    from weave.builders.prompts import classification_builder
    
    prompt = classification_builder.build("tier_classifier")
    
    # Should contain actual weights from config
    assert "0.25" in prompt  # market_potential weight
    assert "0.20" in prompt  # data_quality weight
 
def test_example_generation_from_config():
    """Ensure examples generated match config"""
    from weave.builders.examples import config_to_examples
    
    examples = config_to_examples("tier_presets.v1.json")
    
    # Should have example for each tier
    assert len(examples) >= 4  # tier_1 through tier_4
    
    # Should contain actual pricing
    tier_1_example = [e for e in examples if 'tier_1' in str(e)][0]
    assert "$25000" in tier_1_example['output'] or 25000 in tier_1_example['output']

Integration Tests: Test Data Flow

# tests/test_retrieval.py
async def test_end_to_end_retrieval():
    """Test complete retrieval flow"""
    
    # 1. Sync data
    from data_layer.scripts.sync import sync_all
    await sync_all.sync_all()
    
    # 2. Retrieve from LangMem
    from weave.retrievers import example_retriever
    examples = await example_retriever.get_similar(
        query="Tier 1 combat league",
        k=3
    )
    
    assert len(examples) == 3
    assert all('tier_1' in str(e) or 'combat' in str(e) for e in examples)
    
    # 3. Use in prompt
    from weave.builders.prompts import classification_builder
    prompt = classification_builder.build(
        examples=examples
    )
    
    assert len(prompt) > 1000  # Substantial prompt

📚 Developer Workflows

Workflow 1: Add New Business Rule

# 1. Create config file
cat > data_layer/definitions/config/business/new_rule.v1.json << 'EOF'
{
  "version": "1.0.0",
  "rule_type": "validation",
  "rules": {
    "minimum_revenue": 100000
  }
}
EOF
 
# 2. Generate examples
python data_layer/scripts/generate/generate_examples.py --config=new_rule.v1.json
 
# 3. Sync to runtime
python data_layer/scripts/sync/sync_all.py
 
# 4. Verify
psql -c "SELECT * FROM business_config WHERE config_type='new_rule'"

Workflow 2: Update Prompt Component

# 1. Edit component
vim data_layer/definitions/prompts/components/system_instructions/my_agent.md
 
# 2. Rebuild prompts that use it
python data_layer/scripts/generate/generate_prompts.py --component=my_agent
 
# 3. Re-embed for retrieval
python data_layer/scripts/embed/embed_prompts.py
 
# 4. Test retrieval
python -c "
from weave.retrievers import prompt_retriever
prompts = prompt_retriever.get_similar('task for my_agent', k=1)
print(prompts[0].content[:200])
"

Workflow 3: Add Training Example

# 1. Add to seeds (manual)
cat >> data_layer/definitions/examples/seeds/onboarding/tier-classification.jsonl << 'EOF'
{"input": "What tier for Premier Lacrosse League?", "output": "Tier 1 - High revenue, established brand", "metadata": {"tier": "tier_1", "sport": "lacrosse"}}
EOF
 
# 2. Embed into LangMem
python data_layer/scripts/embed/embed_examples.py --file=tier-classification.jsonl
 
# 3. Test retrieval
python -c "
from weave.retrievers import example_retriever
examples = example_retriever.get_similar('tier for lacrosse league', k=1)
print(examples[0].content)
"

🎓 Architecture Principles

1. Single Source of Truth

All canonical data in definitions/
Never edit views/ or runtime systems directly
Always regenerate from source

2. Everything is Retrievable

Configs → embedded for semantic search
Examples → embedded for RAG
Prompts → embedded for reuse
All have metadata for filtering

3. Type Safety Everywhere

JSON Schema → Pydantic (backend)
JSON Schema → Zod (frontend)
JSON Schema → TypeScript (types)
JSON Schema → Drizzle (ORM)

4. Generation Over Duplication

Don't copy, generate
Don't hardcode, compose
Don't scatter, centralize then distribute

5. Multi-Storage Optimization

PostgreSQL for structured queries
LangMem for semantic search
Redis for speed
Supabase for auth/realtime

📖 Quick Reference Card

┌─────────────────────────────────────────────────────────────┐
│                    I NEED TO...                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Add pricing rule → definitions/config/business/pricing/    │
│  Add scoring weight → definitions/config/business/scoring/  │
│  Add JSON Schema → definitions/schemas/canonical/           │
│  Add training example → definitions/examples/seeds/         │
│  Add prompt component → definitions/prompts/components/     │
│                                                              │
│  Build prompt → weave/builders/prompts/                     │
│  Generate examples → weave/builders/examples/               │
│  Generate Pydantic → weave/builders/schemas/                │
│  Embed for RAG → weave/embedders/                           │
│  Retrieve examples → weave/retrievers/                      │
│                                                              │
│  Query business rules → views/ → PostgreSQL                 │
│  Semantic search → views/ → LangMem                         │
│  Fast access → views/ → Redis                               │
│                                                              │
│  Sync everything → scripts/sync/sync_all.py                 │
│  Generate everything → scripts/generate/generate_all.py     │
│  Embed everything → scripts/embed/embed_all.py              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

🎯 Success Criteria

After full implementation:

✅ Discoverability: Any developer finds source data in < 30 seconds
✅ Consistency: Zero manual edits to runtime systems
✅ Type Safety: 100% schema coverage (Pydantic + Zod)
✅ Retrieval: < 100ms semantic search across all data
✅ Validation: Backend (Pydantic) + Frontend (Zod) from same source
✅ Prompts: Dynamic composition with live config injection
✅ Examples: Embedded for intelligent few-shot selection
✅ Caching: Hot paths < 10ms via Redis

Next Steps: See MIGRATION_GUIDE_PRACTICAL.md for step-by-step implementation

Data Architecture Guide League Database Quick Start