Source: data_layer/docs/DATA_FABRIC_ARCHITECTURE.md
π Data Fabric Architecture - Complete System Design
Version: 2.0
Date: 2025-10-16
Status: Production Architecture
Purpose: Unified, intelligent data architecture supporting multi-storage retrieval, schema-driven validation, prompt composition, and AI-powered generation pipelines.
π― Core Concept: The Complete Data Flow
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEFINITIONS (Source of Truth) β
β Git-tracked, version-controlled, single source of truth β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββ ββββββββββ ββββββββββ ββββββββββ ββββββββββ β
β βSchemas β βConfigs β βPrompts β βExamplesβ β Seeds β β
β β(shape) β β(values)β β(instr) β β(train) β β(synth) β β
β βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ β
β β β β β β β
ββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββββββ
β β β β β
β βββββββ΄βββββββββββββ΄βββββββββ β β
β β β β β
βΌ βΌ βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WEAVE (Transformation) β
β Python modules that BUILD, COMPOSE, EMBED, GENERATE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Builders β β Generators β β Embedders β β
β ββββββββββββββββ€ ββββββββββββββββ€ ββββββββββββββββ€ β
β ββ’ Prompt β ββ’ Examples β ββ’ Vector β β
β β Composer β β from Config β β Embeddings β β
β ββ’ Schema β ββ’ Pydantic β ββ’ Semantic β β
β β Generator β β from JSON β β Index β β
β ββ’ Config β ββ’ TypeScript β ββ’ LangMem β β
β β Loader β β from JSON β β Sync β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
βββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VIEWS (Materialized) β
β Multi-storage, optimized for specific access patterns β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β βPostgreSQLβ β LangMem β β Redis β βSupabase β β
β β (JSONB) β β (Vector) β β (Cache) β β (Auth) β β
β ββββββββββββ€ ββββββββββββ€ ββββββββββββ€ ββββββββββββ€ β
β ββ’ Query β ββ’ RAG β ββ’ Hot β ββ’ User β β
β ββ’ Join β ββ’ Semanticβ β Configs β β State β β
β ββ’ Version β β Search β ββ’ Session β ββ’ Realtimeβ β
β ββββββββ¬ββββ ββββββββ¬ββββ ββββββββ¬ββββ ββββββββ¬ββββ β
β β β β β β
βββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββββ
β β β β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION (Consumers) β
β FastAPI + LangGraph + MCP Servers + Next.js Frontend β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ β
β β LLM Pipeline β β Validation β β Frontend β β
β ββββββββββββββββββ€ ββββββββββββββββββ€ ββββββββββββββββββ€ β
β ββ’ Prompt with β ββ’ Pydantic β ββ’ Zod β β
β β embedded ex. β β validates β β validates β β
β ββ’ Generate with β β backend β β frontend β β
β β constraints β ββ’ Enforce β ββ’ TypeScript β β
β ββ’ Retrieve β β schema β β types β β
β β semantically β ββ’ Return JSON β ββ’ UI safety β β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ Complete Directory Structure
data_layer/ # The unified data fabric
β
βββ README.md # This file
βββ DATA_FABRIC_ARCHITECTURE.md # Architecture overview
β
βββ definitions/ # π· TIER 1: Source of Truth
β β # Git-tracked, canonical definitions
β βββ schemas/ # JSON Schema (canonical)
β β βββ canonical/ # Draft 2020-12 JSON Schema
β β β βββ contract-terms.schema.json
β β β βββ questionnaire.schema.json
β β β βββ tier-classification.schema.json
β β β βββ README.md
β β β
β β βββ generated/ # AUTO-GENERATED from canonical
β β β βββ pydantic/ # Python validation
β β β β βββ contract_terms.py
β β β β βββ questionnaire.py
β β β β βββ __init__.py
β β β β
β β β βββ typescript/ # Frontend types
β β β β βββ contract-terms.ts
β β β β βββ questionnaire.ts
β β β β βββ index.ts
β β β β
β β β βββ zod/ # Frontend validation
β β β β βββ contract-terms.zod.ts
β β β β βββ questionnaire.zod.ts
β β β β βββ index.ts
β β β β
β β β βββ drizzle/ # ORM schemas
β β β βββ contract-terms.ts
β β β βββ questionnaire.ts
β β β βββ index.ts
β β β
β β βββ generate_all.py # Master generator script
β β βββ README.md # Schema governance
β β
β βββ config/ # Business configuration
β β βββ business/
β β β βββ pricing/
β β β β βββ tier_presets.v1.json # Tier pricing & terms
β β β β βββ combat.pricing.v1.json # Combat vertical pricing
β β β β βββ standard.pricing.v1.json # Standard pricing
β β β β βββ README.md
β β β β
β β β βββ scoring/
β β β β βββ scoring_model.v1.json # Scoring weights & thresholds
β β β β βββ tier_thresholds.v1.json
β β β β βββ README.md
β β β β
β β β βββ rules/
β β β β βββ validation_rules.json
β β β β βββ business_logic.json
β β β β βββ README.md
β β β β
β β β βββ README.md
β β β
β β βββ sports/
β β β βββ archetypes.json # Sport classifications
β β β βββ betting_markets.json # Market definitions
β β β βββ stat_mappings.json # Sport-specific stats
β β β βββ README.md
β β β
β β βββ workflows/
β β β βββ onboarding.config.json
β β β βββ contract_generation.config.json
β β β βββ README.md
β β β
β β βββ README.md # Config governance
β β
β βββ prompts/ # Static prompt definitions
β β βββ templates/ # Jinja2/Mustache templates
β β β βββ onboarding/
β β β β βββ questionnaire_extraction.j2
β β β β βββ enhancement.j2
β β β β βββ classification.j2
β β β β
β β β βββ contract/
β β β β βββ tier_1_template.j2
β β β β βββ tier_2_template.j2
β β β β βββ variable_sections.j2
β β β β
β β β βββ README.md
β β β
β β βββ components/ # Reusable prompt blocks
β β β βββ system_instructions/
β β β β βββ base_agent.md
β β β β βββ tier_classifier.md
β β β β βββ contract_assembler.md
β β β β
β β β βββ few_shot_patterns/
β β β β βββ classification_pattern.md
β β β β βββ extraction_pattern.md
β β β β
β β β βββ output_formats/
β β β β βββ json_structure.md
β β β β βββ markdown_contract.md
β β β β
β β β βββ README.md
β β β
β β βββ README.md # Prompt template guide
β β
β βββ examples/ # Training & reference data
β β βββ seeds/ # Hand-curated golden examples
β β β βββ onboarding/
β β β β βββ questionnaire-extraction.jsonl
β β β β βββ enhancement.jsonl
β β β β βββ classification.jsonl
β β β β βββ tier-suggestion.jsonl
β β β β
β β β βββ contract-generation/
β β β β βββ tier-1-examples.jsonl
β β β β βββ tier-2-examples.jsonl
β β β β βββ combat-examples.jsonl
β β β β
β β β βββ README.md
β β β
β β βββ generated/ # AUTO-GENERATED from configs
β β β βββ pricing-examples.jsonl # From tier_presets
β β β βββ scoring-examples.jsonl # From scoring_model
β β β βββ sport-classification.jsonl # From archetypes
β β β βββ README.md # Generation docs
β β β
β β βββ validation/ # Edge cases & tests
β β β βββ edge-cases.jsonl
β β β βββ negative-examples.jsonl
β β β βββ README.md
β β β
β β βββ README.md # Example governance
β β
β βββ kb_catalog/ # Business intelligence
β βββ constants/ # Python constants
β β βββ __init__.py
β β βββ business_rules.py # Importable rules
β β βββ sport_classifications.py
β β βββ field_mappings.py
β β βββ validation_rules.py
β β
β βββ registry/ # Manual registries
β β βββ core_schemas_registry.json
β β βββ workflow_registry.json
β β βββ triage_rules.json
β β
β βββ manifests/ # Auto-generated catalogs
β β βββ agents.json # System agent inventory
β β βββ tools.json # MCP tools catalog
β β βββ services.json # Service registry
β β
β βββ README.md # KB catalog guide
β
βββ weave/ # πΆ TIER 2: Transformation
β β # Python code for integration
β βββ builders/ # Composition engines
β β βββ prompts/
β β β βββ __init__.py
β β β βββ base_builder.py # Base prompt builder
β β β βββ onboarding_builder.py # Builds onboarding prompts
β β β βββ classification_builder.py # Builds classification prompts
β β β βββ contract_builder.py # Builds contract prompts
β β β βββ README.md
β β β
β β βββ schemas/
β β β βββ __init__.py
β β β βββ pydantic_generator.py # JSON β Pydantic
β β β βββ typescript_generator.py # JSON β TypeScript
β β β βββ zod_generator.py # JSON β Zod
β β β βββ drizzle_generator.py # JSON β Drizzle
β β β βββ README.md
β β β
β β βββ examples/
β β βββ __init__.py
β β βββ config_to_examples.py # Config β Training examples
β β βββ synthetic_generator.py # Synthetic data generation
β β βββ README.md
β β
β βββ embedders/ # Vector generation
β β βββ __init__.py
β β βββ prompt_embedder.py # Embed prompts for retrieval
β β βββ example_embedder.py # Embed examples for RAG
β β βββ config_embedder.py # Embed configs as knowledge
β β βββ README.md
β β
β βββ retrievers/ # Intelligent retrieval
β β βββ __init__.py
β β βββ prompt_retriever.py # Retrieve similar prompts
β β βββ example_retriever.py # Retrieve relevant examples
β β βββ semantic_matcher.py # Semantic similarity
β β βββ README.md
β β
β βββ knowledge/ # Intelligence layer
β β βββ __init__.py
β β βββ intent/ # Intent classification
β β β βββ classifier.py
β β β βββ router.py
β β βββ retrieval/ # RAG operations
β β β βββ rag_engine.py
β β β βββ context_builder.py
β β βββ templates/ # Dynamic templates
β β βββ template_engine.py
β β βββ variable_injector.py
β β
β βββ storage/ # Multi-storage abstraction
β β βββ __init__.py
β β βββ postgres_client.py # PostgreSQL operations
β β βββ langmem_client.py # LangMem operations
β β βββ redis_client.py # Redis operations
β β βββ supabase_client.py # Supabase operations
β β βββ README.md
β β
β βββ README.md # Weave layer guide
β
βββ views/ # πΈ TIER 3: Materialized
β β # Generated outputs, queryable
β βββ prompts/ # Generated final prompts
β β βββ agents/
β β β βββ tier-classifier.v2.md # AUTO-GENERATED
β β β βββ contract-assembler.v3.md
β β β βββ questionnaire-extractor.v1.md
β β β
β β βββ workflows/
β β β βββ onboarding-workflow.v1.md
β β β βββ contract-generation.v2.md
β β β
β β βββ README.md # Usage: Don't edit!
β β
β βββ onboarding/ # Pipeline materialized views
β β βββ 02-ingest-validate/
β β β βββ outputs/ # Generated outputs
β β β βββ cache/ # Processed cache
β β β βββ README.md
β β β
β β βββ 06-suggest-tiers/
β β β βββ outputs/
β β β β βββ tier-suggestions.json
β β β β βββ scoring-results.json
β β β βββ README.md
β β β
β β βββ 07-assemble-contract/
β β βββ outputs/
β β β βββ contracts/ # Generated PDFs
β β β βββ markdown/ # Markdown versions
β β βββ README.md
β β
β βββ embeddings/ # Vector stores (runtime)
β β βββ prompt_vectors/ # Embedded prompts
β β βββ example_vectors/ # Embedded examples
β β βββ config_vectors/ # Embedded configs
β β βββ README.md
β β
β βββ README.md # Views layer guide
β
βββ scripts/ # π οΈ Orchestration scripts
β βββ sync/
β β βββ sync_to_postgresql.py # Config β PostgreSQL JSONB
β β βββ sync_to_langmem.py # Examples β LangMem vectors
β β βββ sync_to_redis.py # Hot configs β Redis cache
β β βββ sync_all.py # Master sync script
β β
β βββ generate/
β β βββ generate_schemas.py # JSON β Pydantic/TS/Zod/Drizzle
β β βββ generate_examples.py # Config β Training examples
β β βββ generate_prompts.py # Components β Final prompts
β β βββ generate_all.py # Master generation script
β β
β βββ embed/
β β βββ embed_prompts.py # Prompts β Vectors
β β βββ embed_examples.py # Examples β Vectors
β β βββ embed_configs.py # Configs β Vectors
β β βββ embed_all.py # Master embedding script
β β
β βββ README.md # Scripts usage guide
β
βββ tests/ # Testing infrastructure
β βββ test_builders.py # Test prompt/schema builders
β βββ test_generators.py # Test example generation
β βββ test_embeddings.py # Test vector operations
β βββ test_retrieval.py # Test RAG pipeline
β βββ README.md
β
βββ docs/ # Documentation
βββ ARCHITECTURE.md # This file (symlink)
βββ QUICK_START.md # Developer onboarding
βββ API_REFERENCE.md # Code API docs
βββ WORKFLOWS.md # Common workflowsπ The Complete Data Flow (Your Vision Realized)
Flow 1: Schema-Driven Validation Pipeline
# 1. CANONICAL SCHEMA (definitions/schemas/canonical/)
# contract-terms.schema.json (JSON Schema Draft 2020-12)
# 2. GENERATE VALIDATORS (weave/builders/schemas/)
python weave/builders/schemas/generate_all.py
# β Creates Pydantic, TypeScript, Zod, Drizzle
# 3. BACKEND VALIDATION (Application Layer)
from data_layer.definitions.schemas.generated.pydantic import ContractTerms
contract = ContractTerms(**llm_output) # Pydantic validates
# 4. FRONTEND VALIDATION (Application Layer)
import { contractTermsSchema } from '@/data_layer/definitions/schemas/generated/zod'
const validated = contractTermsSchema.parse(apiResponse) // Zod validatesFlow 2: Config-Driven Example Generation
# 1. BUSINESS CONFIG (definitions/config/business/)
# tier_presets.v1.json contains actual pricing values
# 2. GENERATE EXAMPLES (weave/builders/examples/)
from weave.builders.examples import config_to_examples
examples = config_to_examples(
config_path="definitions/config/business/pricing/tier_presets.v1.json",
output_path="definitions/examples/generated/pricing-examples.jsonl"
)
# Creates 50+ training examples in JSONL format
# 3. EMBED EXAMPLES (weave/embedders/)
from weave.embedders import example_embedder
example_embedder.embed_all(
input_path="definitions/examples/generated/pricing-examples.jsonl",
namespace="pricing-examples"
)
# Stores in LangMem for RAG retrieval
# 4. RETRIEVE IN CONTEXT (Application Layer)
from weave.retrievers import example_retriever
relevant_examples = example_retriever.get_similar(
query="What tier for a combat league with $2M revenue?",
namespace="pricing-examples",
k=5
)
# Returns 5 most relevant examples for few-shot promptingFlow 3: Prompt Component Composition
# 1. PROMPT COMPONENTS (definitions/prompts/components/)
# system_instructions/tier_classifier.md
# few_shot_patterns/classification_pattern.md
# output_formats/json_structure.md
# 2. LOAD BUSINESS CONFIG (definitions/config/)
from data_layer.definitions.config.business import load_config
scoring_weights = load_config("business/scoring/scoring_model.v1.json")
# 3. BUILD DYNAMIC PROMPT (weave/builders/prompts/)
from weave.builders.prompts import classification_builder
prompt = classification_builder.build(
components=[
"system_instructions/tier_classifier.md",
"few_shot_patterns/classification_pattern.md"
],
config=scoring_weights, # Inject actual weights
examples=relevant_examples # From retrieval
)
# 4. EMBED FOR FUTURE RETRIEVAL (weave/embedders/)
from weave.embedders import prompt_embedder
prompt_embedder.embed(
prompt_text=prompt,
metadata={
"type": "classification",
"version": "2.0",
"config_version": scoring_weights['version']
}
)
# 5. RETRIEVE SIMILAR PROMPTS LATER
from weave.retrievers import prompt_retriever
similar_prompts = prompt_retriever.get_similar(
query="Need to classify a new league type",
k=3
)
# Returns 3 most similar historical prompts for referenceFlow 4: Multi-Storage Retrieval Strategy
# APPLICATION NEEDS: Get tier recommendation with reasoning
# 1. RETRIEVE FROM REDIS (Hot Cache)
from weave.storage import redis_client
cached_tier = redis_client.get(f"tier:league:{league_id}")
if cached_tier:
return cached_tier # Fast path: < 5ms
# 2. RETRIEVE FROM POSTGRESQL (Structured Query)
from weave.storage import postgres_client
tier_config = postgres_client.query("""
SELECT config_data->'tiers'->'tier_1' as tier_1
FROM business_config
WHERE config_type = 'tier_presets' AND version = 1
""")
# 3. RETRIEVE FROM LANGMEM (Semantic Search)
from weave.storage import langmem_client
relevant_examples = langmem_client.query(
query=f"Tier recommendation for {league_characteristics}",
namespace="pricing-examples",
filters={"type": "tier_recommendation"},
k=5
)
# 4. COMPOSE FINAL PROMPT WITH ALL CONTEXT
from weave.builders.prompts import classification_builder
final_prompt = classification_builder.build(
system_instructions="tier_classifier.md",
business_config=tier_config, # From PostgreSQL
few_shot_examples=relevant_examples, # From LangMem
output_schema=tier_classification_schema # From definitions/schemas/
)
# 5. LLM GENERATES with Pydantic Validation
from langchain_openai import ChatOpenAI
from data_layer.definitions.schemas.generated.pydantic import TierClassification
llm = ChatOpenAI(model="gpt-4")
structured_llm = llm.with_structured_output(TierClassification)
result = structured_llm.invoke(final_prompt)
# Returns validated Pydantic model
# 6. CACHE RESULT
redis_client.set(
f"tier:league:{league_id}",
result.model_dump_json(),
ex=3600 # 1 hour TTL
)
# 7. SEND TO FRONTEND (Zod validates there)
# Frontend receives JSON, validates with Zod schemaπ¨ Key Design Patterns
Pattern 1: Single Source, Multiple Views
tier_presets.v1.json (SINGLE SOURCE)
β
βββ PostgreSQL JSONB (queryable)
βββ LangMem vectors (semantic)
βββ Redis JSON (cached)
βββ Training examples JSONL (few-shot)
βββ API response templates (runtime)Benefit: Update once, propagates everywhere
Pattern 2: Schema-Driven Everything
contract-terms.schema.json (CANONICAL)
β
βββ Pydantic model (backend validation)
βββ TypeScript types (frontend types)
βββ Zod schema (frontend validation)
βββ Drizzle schema (database ORM)
βββ Documentation (auto-generated)Benefit: Type safety across entire stack
Pattern 3: Component-Based Prompt Assembly
# Components (small, reusable)
system_instruction = load("system_instructions/tier_classifier.md")
few_shot_pattern = load("few_shot_patterns/classification.md")
output_format = load("output_formats/json_structure.md")
# Config (actual values)
weights = load_config("business/scoring/scoring_model.v1.json")
# Examples (context)
examples = retrieve_examples(
query="tier classification",
k=5
)
# BUILD final prompt
final_prompt = compose(
system_instruction,
inject_weights(few_shot_pattern, weights),
inject_examples(few_shot_pattern, examples),
output_format
)Benefit: Prompts are dynamic, data-driven, testable
Pattern 4: Embedded Retrieval Everywhere
# Everything can be retrieved semantically:
# 1. Retrieve similar prompts
similar_prompts = retrieve_prompts(
"How to classify combat sports?",
namespace="prompts"
)
# 2. Retrieve relevant examples
relevant_examples = retrieve_examples(
"Tier 1 combat league pricing",
namespace="examples"
)
# 3. Retrieve business rules
business_context = retrieve_configs(
"Combat sports pricing rules",
namespace="business-rules"
)
# 4. Compose everything into final prompt
final_prompt = compose_with_retrieval(
query="Classify new MMA league",
prompt_template=similar_prompts[0],
examples=relevant_examples[:5],
config=business_context
)Benefit: AI has intelligent access to all knowledge
π Implementation Scripts
Master Sync Script
# scripts/sync/sync_all.py
"""
Master orchestration: SOURCE_OF_TRUTH β RUNTIME SYSTEMS
"""
import asyncio
from pathlib import Path
from weave.storage import postgres_client, langmem_client, redis_client
async def sync_all():
"""Sync everything from definitions/ to runtime"""
print("π Starting multi-storage sync...")
# 1. Sync configs to PostgreSQL (JSONB)
print(" π Syncing to PostgreSQL...")
await postgres_client.sync_configs(
source_dir=Path("data_layer/definitions/config")
)
# 2. Sync examples to LangMem (vectors)
print(" π§ Syncing to LangMem...")
await langmem_client.sync_examples(
source_dir=Path("data_layer/definitions/examples")
)
# 3. Cache hot configs in Redis
print(" β‘ Caching in Redis...")
await redis_client.cache_hot_configs(
configs=["tier_presets.v1", "scoring_model.v1"]
)
# 4. Embed prompts for retrieval
print(" π Embedding prompts...")
from weave.embedders import prompt_embedder
await prompt_embedder.embed_all(
source_dir=Path("data_layer/views/prompts")
)
print("β
Sync complete!")
if __name__ == "__main__":
asyncio.run(sync_all())Master Generation Script
# scripts/generate/generate_all.py
"""
Generate all derived artifacts from SOURCE_OF_TRUTH
"""
from pathlib import Path
from weave.builders import schemas, examples, prompts
def generate_all():
"""Generate schemas, examples, and prompts"""
print("ποΈ Generating all artifacts...")
# 1. Generate schema adapters
print(" π Generating schemas...")
schemas.pydantic_generator.generate_all(
source=Path("data_layer/definitions/schemas/canonical"),
output=Path("data_layer/definitions/schemas/generated/pydantic")
)
schemas.zod_generator.generate_all(
source=Path("data_layer/definitions/schemas/canonical"),
output=Path("data_layer/definitions/schemas/generated/zod")
)
schemas.drizzle_generator.generate_all(
source=Path("data_layer/definitions/schemas/canonical"),
output=Path("data_layer/definitions/schemas/generated/drizzle")
)
# 2. Generate examples from configs
print(" π― Generating examples...")
examples.config_to_examples.generate_from_configs(
config_dir=Path("data_layer/definitions/config/business"),
output_dir=Path("data_layer/definitions/examples/generated")
)
# 3. Build final prompts from components
print(" π Building prompts...")
prompts.onboarding_builder.build_all(
components_dir=Path("data_layer/definitions/prompts/components"),
config_dir=Path("data_layer/definitions/config"),
output_dir=Path("data_layer/views/prompts")
)
print("β
Generation complete!")
if __name__ == "__main__":
generate_all()π§ͺ Usage Examples
Example 1: Complete LLM Pipeline with Validation
from langchain_openai import ChatOpenAI
from data_layer.weave.builders.prompts import classification_builder
from data_layer.weave.retrievers import example_retriever
from data_layer.definitions.schemas.generated.pydantic import TierClassification
async def classify_league(league_data: dict) -> TierClassification:
"""
Complete classification pipeline:
1. Retrieve relevant examples (embedded)
2. Build prompt (component composition)
3. Generate with LLM
4. Validate with Pydantic
5. Return type-safe result
"""
# 1. Retrieve similar examples
relevant_examples = await example_retriever.get_similar(
query=f"Classify {league_data['sport']} league",
namespace="tier-classification",
k=5
)
# 2. Build prompt with components + config + examples
prompt = classification_builder.build(
system_instructions="tier_classifier.md",
few_shot_examples=relevant_examples,
config_weights=load_config("scoring_model.v1.json"),
input_data=league_data
)
# 3. Generate with structured output
llm = ChatOpenAI(model="gpt-4")
structured_llm = llm.with_structured_output(TierClassification)
result = structured_llm.invoke(prompt)
# Returns validated Pydantic model!
return result # Type-safe TierClassification objectExample 2: Frontend Receives Validated Data
// Frontend receives API response
import { contractTermsSchema } from '@/data_layer/definitions/schemas/generated/zod';
import type { ContractTerms } from '@/data_layer/definitions/schemas/generated/typescript';
async function fetchContract(leagueId: string): Promise<ContractTerms> {
const response = await fetch(`/api/contracts/${leagueId}`);
const data = await response.json();
// Zod validates at runtime
const validated = contractTermsSchema.parse(data);
// TypeScript types ensure compile-time safety
return validated; // Type: ContractTerms
}Example 3: Retrieve Prompts from Embedded Space
from weave.retrievers import prompt_retriever
# Find similar prompts for new task
similar_prompts = await prompt_retriever.get_similar(
query="Need to extract racing event data from PDF",
namespace="prompts",
filters={"category": "extraction"},
k=3
)
# Use as reference or starting point
for prompt in similar_prompts:
print(f"Similar prompt: {prompt.metadata['title']}")
print(f"Similarity: {prompt.score}")
print(f"Content preview: {prompt.content[:200]}...")π Storage Strategy Matrix
| Data Type | Source Location | Generated To | Queryable Via | Use Case |
|---|---|---|---|---|
| Config Files | definitions/config/ | PostgreSQL (JSONB) | SQL queries | Business rules lookup |
| Config Files | definitions/config/ | LangMem (vectors) | Semantic search | RAG context |
| Config Files | definitions/config/ | Redis (JSON) | Key-value | Hot data cache |
| Examples | definitions/examples/seeds/ | LangMem (vectors) | Semantic search | Few-shot learning |
| Examples | definitions/examples/generated/ | LangMem (vectors) | Semantic search | Training data |
| Prompts | views/prompts/ | LangMem (vectors) | Semantic search | Prompt retrieval |
| Schemas | definitions/schemas/canonical/ | Git | File system | Single source |
| Pydantic | definitions/schemas/generated/pydantic/ | Git | Python import | Backend validation |
| Zod | definitions/schemas/generated/zod/ | Git | TypeScript import | Frontend validation |
| Drizzle | definitions/schemas/generated/drizzle/ | Git | TypeScript import | ORM operations |
π Governance & Best Practices
Versioning Strategy
All source files use semantic versioning:
- tier_presets.v1.json β v2.json (breaking changes)
- scoring_model.v1.1.json (minor improvements)
- archetypes.v1.0.1.json (patches)
Version in filename AND inside JSON:
{
"version": "1.2.0",
"schemaVersion": "draft-2020-12",
"lastUpdated": "2025-10-16"
}Change Management
# When you update a config file:
# 1. Edit SOURCE_OF_TRUTH
vim data_layer/definitions/config/business/pricing/tier_presets.v1.json
# 2. Increment version
# "version": 5 β "version": 6
# 3. Run generators
python data_layer/scripts/generate/generate_all.py
# 4. Run sync
python data_layer/scripts/sync/sync_all.py
# 5. Verify in each system
psql -c "SELECT version FROM business_config WHERE config_type='tier_presets'"
redis-cli GET config:tier_presets:version
# Check LangMem dashboard for new embeddings
# 6. Git commit
git add data_layer/definitions/config/business/pricing/tier_presets.v1.json
git commit -m "feat(pricing): Update tier 1 pricing to \$150k (v6)"π§ͺ Testing Strategy
Unit Tests: Test Each Layer
# tests/test_builders.py
def test_prompt_builder_uses_live_config():
"""Ensure prompts load actual config values"""
from weave.builders.prompts import classification_builder
prompt = classification_builder.build("tier_classifier")
# Should contain actual weights from config
assert "0.25" in prompt # market_potential weight
assert "0.20" in prompt # data_quality weight
def test_example_generation_from_config():
"""Ensure examples generated match config"""
from weave.builders.examples import config_to_examples
examples = config_to_examples("tier_presets.v1.json")
# Should have example for each tier
assert len(examples) >= 4 # tier_1 through tier_4
# Should contain actual pricing
tier_1_example = [e for e in examples if 'tier_1' in str(e)][0]
assert "$25000" in tier_1_example['output'] or 25000 in tier_1_example['output']Integration Tests: Test Data Flow
# tests/test_retrieval.py
async def test_end_to_end_retrieval():
"""Test complete retrieval flow"""
# 1. Sync data
from data_layer.scripts.sync import sync_all
await sync_all.sync_all()
# 2. Retrieve from LangMem
from weave.retrievers import example_retriever
examples = await example_retriever.get_similar(
query="Tier 1 combat league",
k=3
)
assert len(examples) == 3
assert all('tier_1' in str(e) or 'combat' in str(e) for e in examples)
# 3. Use in prompt
from weave.builders.prompts import classification_builder
prompt = classification_builder.build(
examples=examples
)
assert len(prompt) > 1000 # Substantial promptπ Developer Workflows
Workflow 1: Add New Business Rule
# 1. Create config file
cat > data_layer/definitions/config/business/new_rule.v1.json << 'EOF'
{
"version": "1.0.0",
"rule_type": "validation",
"rules": {
"minimum_revenue": 100000
}
}
EOF
# 2. Generate examples
python data_layer/scripts/generate/generate_examples.py --config=new_rule.v1.json
# 3. Sync to runtime
python data_layer/scripts/sync/sync_all.py
# 4. Verify
psql -c "SELECT * FROM business_config WHERE config_type='new_rule'"Workflow 2: Update Prompt Component
# 1. Edit component
vim data_layer/definitions/prompts/components/system_instructions/my_agent.md
# 2. Rebuild prompts that use it
python data_layer/scripts/generate/generate_prompts.py --component=my_agent
# 3. Re-embed for retrieval
python data_layer/scripts/embed/embed_prompts.py
# 4. Test retrieval
python -c "
from weave.retrievers import prompt_retriever
prompts = prompt_retriever.get_similar('task for my_agent', k=1)
print(prompts[0].content[:200])
"Workflow 3: Add Training Example
# 1. Add to seeds (manual)
cat >> data_layer/definitions/examples/seeds/onboarding/tier-classification.jsonl << 'EOF'
{"input": "What tier for Premier Lacrosse League?", "output": "Tier 1 - High revenue, established brand", "metadata": {"tier": "tier_1", "sport": "lacrosse"}}
EOF
# 2. Embed into LangMem
python data_layer/scripts/embed/embed_examples.py --file=tier-classification.jsonl
# 3. Test retrieval
python -c "
from weave.retrievers import example_retriever
examples = example_retriever.get_similar('tier for lacrosse league', k=1)
print(examples[0].content)
"π Architecture Principles
1. Single Source of Truth
- All canonical data in
definitions/ - Never edit
views/or runtime systems directly - Always regenerate from source
2. Everything is Retrievable
- Configs β embedded for semantic search
- Examples β embedded for RAG
- Prompts β embedded for reuse
- All have metadata for filtering
3. Type Safety Everywhere
- JSON Schema β Pydantic (backend)
- JSON Schema β Zod (frontend)
- JSON Schema β TypeScript (types)
- JSON Schema β Drizzle (ORM)
4. Generation Over Duplication
- Don't copy, generate
- Don't hardcode, compose
- Don't scatter, centralize then distribute
5. Multi-Storage Optimization
- PostgreSQL for structured queries
- LangMem for semantic search
- Redis for speed
- Supabase for auth/realtime
π Quick Reference Card
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β I NEED TO... β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Add pricing rule β definitions/config/business/pricing/ β
β Add scoring weight β definitions/config/business/scoring/ β
β Add JSON Schema β definitions/schemas/canonical/ β
β Add training example β definitions/examples/seeds/ β
β Add prompt component β definitions/prompts/components/ β
β β
β Build prompt β weave/builders/prompts/ β
β Generate examples β weave/builders/examples/ β
β Generate Pydantic β weave/builders/schemas/ β
β Embed for RAG β weave/embedders/ β
β Retrieve examples β weave/retrievers/ β
β β
β Query business rules β views/ β PostgreSQL β
β Semantic search β views/ β LangMem β
β Fast access β views/ β Redis β
β β
β Sync everything β scripts/sync/sync_all.py β
β Generate everything β scripts/generate/generate_all.py β
β Embed everything β scripts/embed/embed_all.py β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ― Success Criteria
After full implementation:
β
Discoverability: Any developer finds source data in < 30 seconds
β
Consistency: Zero manual edits to runtime systems
β
Type Safety: 100% schema coverage (Pydantic + Zod)
β
Retrieval: < 100ms semantic search across all data
β
Validation: Backend (Pydantic) + Frontend (Zod) from same source
β
Prompts: Dynamic composition with live config injection
β
Examples: Embedded for intelligent few-shot selection
β
Caching: Hot paths < 10ms via Redis
Next Steps: See MIGRATION_GUIDE_PRACTICAL.md for step-by-step implementation