Source: data_layer/docs/DATA_ARCHITECTURE_GUIDE.md

Data Architecture Guide

🏗️ Three-Tier Data System

Overview

┌─────────────────────────────────────────────────────────────┐
│                    DEVELOPMENT LAYER                         │
│              database/schemas/seeds/                         │
│           Individual JSON files (240+ files)                 │
│        Purpose: Testing, dev, version control               │
└─────────────────────────────────────────────────────────────┘
                            ↓
                    (Optional Migration)
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                    PRODUCTION LAYER                          │
│            database/storage/examples/                        │
│              JSONL files → Prisma DB                        │
│    Purpose: AI few-shot learning, retrieval, caching       │
└─────────────────────────────────────────────────────────────┘
                            ↓
                    (Query Interface)
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                   APPLICATION LAYER                          │
│         Intelligent Retrieval API + Embeddings              │
│    Purpose: Production queries with semantic matching       │
└─────────────────────────────────────────────────────────────┘

📁 Directory Structure

1. Development Seeds (`database/schemas/seeds/`)

Purpose: Granular, version-controlled test data

seeds/
├── leagues/          # 101 individual league definitions
├── questionnaires/   # 53 questionnaire examples
├── schemas/          # 105 schema seed definitions
├── samples/          # 5 small sample datasets
└── synthetic_emails/ # 34 generated test emails

When to Use:

✅ Development and testing
✅ Bootstrapping new tests
✅ Manual data curation
✅ Version control of examples
❌ Production queries (too slow)

Access Pattern:

from database.schemas.seeds import load_seed
league = load_seed("leagues/mltt.seed.json")

2. Production Few-Shot KB (`database/storage/examples/`)

Purpose: High-performance few-shot learning system

examples/
├── data/                          # JSONL source files (11 files)
│   ├── triage.jsonl              # Email classification
│   ├── contract-generation.jsonl # Contract examples
│   ├── questionnaires.jsonl      # Questionnaire processing
│   ├── league_examples.jsonl     # League data examples
│   ├── schema_definitions.jsonl  # Schema examples
│   ├── legacy_seeds.jsonl        # ⚠️  Migrated historical data
│   └── ...
├── api.py                         # High-level API
├── retriever.py                   # Core retrieval logic
├── matcher.py                     # Semantic similarity
└── cache.py                       # LRU caching

When to Use:

✅ Production AI queries
✅ Semantic similarity matching
✅ Context-aware example selection
✅ High-performance retrieval
❌ Manual editing (edit JSONL then reseed)

Access Pattern:

from database.storage.examples import FewShotExamplesAPI
 
api = FewShotExamplesAPI()
examples = await api.get_examples_for_prompt(
    prompt_text="Premium basketball partnership...",
    prompt_type="triage",
    max_examples=3
)

3. Schema Examples (`database/schemas/examples/`)

Purpose: Reserved for future schema documentation

examples/
└── few_shot/    # Empty - reserved for schema patterns

Potential Future Uses:

Schema definition examples
Documentation of schema compliance
Pattern libraries

🔄 Data Lifecycle

Path 1: Development Data (Most Common)

1. Create/edit seed file
   └─> database/schemas/seeds/leagues/new-league.seed.json

2. Use in development
   └─> load_seed("leagues/new-league.seed.json")

3. Version control
   └─> git add + commit

Path 2: Production Examples (AI/ML Use)

1. Curate examples
   └─> Edit database/storage/examples/data/triage.jsonl

2. Seed database
   └─> uv run python scripts/seed.examples.py --category triage

3. Query in production
   └─> FewShotExamplesAPI().get_examples_for_prompt(...)

4. Track usage & improve
   └─> Analytics update quality_score and usage_count

Path 3: Migration (Seeds → Production)

1. Identify high-quality seed
   └─> Review database/schemas/seeds/leagues/example.json

2. Convert to JSONL format
   └─> Add to appropriate .jsonl file with metadata:
       {
         "id": "unique_id",
         "scenario": "partnership",
         "sport": "basketball",
         "tier": "premium",
         "quality_score": 0.95,
         "input": {...},
         "output": {...}
       }

3. Seed to database
   └─> uv run python scripts/seed.examples.py

4. Keep original seed
   └─> Don't delete from seeds/ (different purpose)

🎯 Decision Matrix

Question	Use Seeds	Use Examples	Use Both
Running unit tests?	✅	❌	-
Development iteration?	✅	❌	-
Production AI queries?	❌	✅	-
Semantic search needed?	❌	✅	-
Manual data curation?	✅	Edit JSONL	-
Version control examples?	✅	✅ JSONL	-
Need fast retrieval?	❌	✅	-
Bootstrap new feature?	✅	Maybe	✅

🛠️ Common Workflows

Workflow 1: Add Test Data

# 1. Create seed file
cat > database/schemas/seeds/leagues/new-league.seed.json << EOF
{
  "league_id": "new-001",
  "name": "New League",
  ...
}
EOF
 
# 2. Use in tests
python -c "from database.schemas.seeds import load_seed; print(load_seed('leagues/new-league.seed.json'))"
 
# 3. Commit
git add database/schemas/seeds/leagues/new-league.seed.json
git commit -m "feat: add new league test data"

Workflow 2: Add Production Example

# 1. Edit JSONL (source of truth)
vim database/storage/examples/data/triage.jsonl
# Add line: {"id": "triage_new", "scenario": "partnership", ...}
 
# 2. Seed database
uv run python scripts/seed.examples.py --category triage
 
# 3. Verify in production code
python -c "
from database.storage.examples import FewShotExamplesAPI
api = FewShotExamplesAPI()
# Test retrieval
"
 
# 4. Commit JSONL
git add database/storage/examples/data/triage.jsonl
git commit -m "feat: add new triage example"

Workflow 3: Update Example Quality

# 1. Edit JSONL file
vim database/storage/examples/data/triage.jsonl
# Update quality_score or fix issues
 
# 2. Reseed database (upserts existing)
uv run python scripts/seed.examples.py --category triage
 
# 3. Verify
psql $DATABASE_URL -c "SELECT example_id, quality_score FROM \"FewShotExample\" WHERE example_id = 'triage_001';"

⚠️ Important Conventions

DO ✅

Edit JSONL files as source of truth for production examples
Use seed script to populate database (never edit DB directly)
Query via Prisma/API for production use
Keep seeds for testing even if migrated to examples
Track usage in database for quality improvement
Version both seeds and JSONL files in git

DON'T ❌

Don't edit database directly - edit JSONL then reseed
Don't query JSONL in production - use API/Prisma
Don't delete seeds when migrating to examples
Don't mix purposes - keep layers separate
Don't skip seed script - maintains consistency
Don't create duplicate files - use versioning instead

🔧 Maintenance Tasks

Weekly

# Check example quality
psql $DATABASE_URL -c "
SELECT category, COUNT(*), AVG(quality_score), AVG(usage_count)
FROM \"FewShotExample\"
GROUP BY category
ORDER BY AVG(quality_score) ASC;
"
 
# Identify unused examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score, usage_count
FROM \"FewShotExample\"
WHERE usage_count = 0
  AND created_at < NOW() - INTERVAL '30 days'
LIMIT 10;
"

Monthly

# Review and improve low-quality examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score
FROM \"FewShotExample\"
WHERE quality_score < 0.70
ORDER BY usage_count DESC
LIMIT 20;
"
 
# Update JSONL files based on findings
# Then reseed: uv run python scripts/seed.examples.py

Quarterly

# Archive old legacy data if no longer needed
# Review database/storage/examples/data/legacy_seeds.jsonl
# Consider removing if fully migrated
 
# Audit for duplicates
python database/scripts/audit_duplicates.py

📊 Metrics & Analytics

Production Health

-- Example usage by category
SELECT 
    category,
    COUNT(*) as total,
    AVG(quality_score) as avg_quality,
    AVG(usage_count) as avg_usage,
    MAX(usage_count) as max_usage
FROM "FewShotExample"
GROUP BY category
ORDER BY avg_usage DESC;
 
-- Find high-value examples
SELECT 
    example_id,
    category,
    quality_score,
    usage_count,
    (quality_score * LN(usage_count + 1)) as value_score
FROM "FewShotExample"
WHERE usage_count > 0
ORDER BY value_score DESC
LIMIT 20;

🐛 Troubleshooting

Seeds not loading

# Check file exists
ls -lh database/schemas/seeds/leagues/mltt.seed.json
 
# Check format
python -c "import json; json.load(open('database/schemas/seeds/leagues/mltt.seed.json'))"

Examples not retrieving

# Check database connection
psql $DATABASE_URL -c "SELECT COUNT(*) FROM \"FewShotExample\";"
 
# Check if seeded
psql $DATABASE_URL -c "SELECT COUNT(*) FROM \"FewShotExample\" WHERE category = 'triage';"
 
# Reseed if needed
uv run python scripts/seed.examples.py --clear --category triage

Cache not working

from database.storage.examples import ExampleCache
 
cache = ExampleCache()
print(f"Cache size: {cache.size()}")
print(f"Hit rate: {cache.hit_rate()}")
cache.clear()  # Reset if needed

📚 Related Documentation

Seeds README: database/schemas/seeds/README.md
Examples README: database/storage/examples/README.md
Retrieval System: database/storage/RETRIEVAL_SYSTEM_README.md
Migration Guide: database/storage/MIGRATION_GUIDE.md
Seed Script: scripts/seed.examples.py

🎓 Summary

Layer	Purpose	Format	Speed	Use Case
Seeds	Dev/test data	Individual JSON	Slow	Testing, development
Examples	Production KB	JSONL → DB	Fast	AI queries, retrieval
API	Intelligent retrieval	Python API	Fast	Semantic search

Bottom Line:

Seeds for development 📝
Examples for production 🚀
API for intelligence 🧠

Last Updated: 2025-01-14 Maintained by: Database Team

Database Architecture Strategy Data Fabric Architecture

Data Architecture Guide

🏗️ Three-Tier Data System

Overview

📁 Directory Structure

1. Development Seeds (`database/schemas/seeds/`)

2. Production Few-Shot KB (`database/storage/examples/`)

3. Schema Examples (`database/schemas/examples/`)

🔄 Data Lifecycle

Path 1: Development Data (Most Common)

Path 2: Production Examples (AI/ML Use)

Path 3: Migration (Seeds → Production)

🎯 Decision Matrix

🛠️ Common Workflows

Workflow 1: Add Test Data

Workflow 2: Add Production Example

Workflow 3: Update Example Quality

⚠️ Important Conventions

DO ✅

DON'T ❌

🔧 Maintenance Tasks

Weekly

Monthly

Quarterly

📊 Metrics & Analytics

Production Health

🐛 Troubleshooting

Seeds not loading

Examples not retrieving

Cache not working

📚 Related Documentation

🎓 Summary

Platform

Documentation

Community

Support

Data Architecture Guide

🏗️ Three-Tier Data System

Overview

📁 Directory Structure

1. Development Seeds (database/schemas/seeds/)

2. Production Few-Shot KB (database/storage/examples/)

3. Schema Examples (database/schemas/examples/)

🔄 Data Lifecycle

Path 1: Development Data (Most Common)

Path 2: Production Examples (AI/ML Use)

Path 3: Migration (Seeds → Production)

🎯 Decision Matrix

🛠️ Common Workflows

Workflow 1: Add Test Data

Workflow 2: Add Production Example

Workflow 3: Update Example Quality

⚠️ Important Conventions

DO ✅

DON'T ❌

🔧 Maintenance Tasks

Weekly

Monthly

Quarterly

📊 Metrics & Analytics

Production Health

🐛 Troubleshooting

Seeds not loading

Examples not retrieving

Cache not working

📚 Related Documentation

🎓 Summary

Platform

Documentation

Community

Support

1. Development Seeds (`database/schemas/seeds/`)

2. Production Few-Shot KB (`database/storage/examples/`)

3. Schema Examples (`database/schemas/examples/`)