Architecture
Data Architecture Guide

Source: data_layer/docs/DATA_ARCHITECTURE_GUIDE.md

Data Architecture Guide

πŸ—οΈ Three-Tier Data System

Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DEVELOPMENT LAYER                         β”‚
β”‚              database/schemas/seeds/                         β”‚
β”‚           Individual JSON files (240+ files)                 β”‚
β”‚        Purpose: Testing, dev, version control               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
                    (Optional Migration)
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PRODUCTION LAYER                          β”‚
β”‚            database/storage/examples/                        β”‚
β”‚              JSONL files β†’ Prisma DB                        β”‚
β”‚    Purpose: AI few-shot learning, retrieval, caching       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
                    (Query Interface)
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   APPLICATION LAYER                          β”‚
β”‚         Intelligent Retrieval API + Embeddings              β”‚
β”‚    Purpose: Production queries with semantic matching       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Directory Structure

1. Development Seeds (database/schemas/seeds/)

Purpose: Granular, version-controlled test data

seeds/
β”œβ”€β”€ leagues/          # 101 individual league definitions
β”œβ”€β”€ questionnaires/   # 53 questionnaire examples
β”œβ”€β”€ schemas/          # 105 schema seed definitions
β”œβ”€β”€ samples/          # 5 small sample datasets
└── synthetic_emails/ # 34 generated test emails

When to Use:

  • βœ… Development and testing
  • βœ… Bootstrapping new tests
  • βœ… Manual data curation
  • βœ… Version control of examples
  • ❌ Production queries (too slow)

Access Pattern:

from database.schemas.seeds import load_seed
league = load_seed("leagues/mltt.seed.json")

2. Production Few-Shot KB (database/storage/examples/)

Purpose: High-performance few-shot learning system

examples/
β”œβ”€β”€ data/                          # JSONL source files (11 files)
β”‚   β”œβ”€β”€ triage.jsonl              # Email classification
β”‚   β”œβ”€β”€ contract-generation.jsonl # Contract examples
β”‚   β”œβ”€β”€ questionnaires.jsonl      # Questionnaire processing
β”‚   β”œβ”€β”€ league_examples.jsonl     # League data examples
β”‚   β”œβ”€β”€ schema_definitions.jsonl  # Schema examples
β”‚   β”œβ”€β”€ legacy_seeds.jsonl        # ⚠️  Migrated historical data
β”‚   └── ...
β”œβ”€β”€ api.py                         # High-level API
β”œβ”€β”€ retriever.py                   # Core retrieval logic
β”œβ”€β”€ matcher.py                     # Semantic similarity
└── cache.py                       # LRU caching

When to Use:

  • βœ… Production AI queries
  • βœ… Semantic similarity matching
  • βœ… Context-aware example selection
  • βœ… High-performance retrieval
  • ❌ Manual editing (edit JSONL then reseed)

Access Pattern:

from database.storage.examples import FewShotExamplesAPI
 
api = FewShotExamplesAPI()
examples = await api.get_examples_for_prompt(
    prompt_text="Premium basketball partnership...",
    prompt_type="triage",
    max_examples=3
)

3. Schema Examples (database/schemas/examples/)

Purpose: Reserved for future schema documentation

examples/
└── few_shot/    # Empty - reserved for schema patterns

Potential Future Uses:

  • Schema definition examples
  • Documentation of schema compliance
  • Pattern libraries

πŸ”„ Data Lifecycle

Path 1: Development Data (Most Common)

1. Create/edit seed file
   └─> database/schemas/seeds/leagues/new-league.seed.json

2. Use in development
   └─> load_seed("leagues/new-league.seed.json")

3. Version control
   └─> git add + commit

Path 2: Production Examples (AI/ML Use)

1. Curate examples
   └─> Edit database/storage/examples/data/triage.jsonl

2. Seed database
   └─> uv run python scripts/seed.examples.py --category triage

3. Query in production
   └─> FewShotExamplesAPI().get_examples_for_prompt(...)

4. Track usage & improve
   └─> Analytics update quality_score and usage_count

Path 3: Migration (Seeds β†’ Production)

1. Identify high-quality seed
   └─> Review database/schemas/seeds/leagues/example.json

2. Convert to JSONL format
   └─> Add to appropriate .jsonl file with metadata:
       {
         "id": "unique_id",
         "scenario": "partnership",
         "sport": "basketball",
         "tier": "premium",
         "quality_score": 0.95,
         "input": {...},
         "output": {...}
       }

3. Seed to database
   └─> uv run python scripts/seed.examples.py

4. Keep original seed
   └─> Don't delete from seeds/ (different purpose)

🎯 Decision Matrix

QuestionUse SeedsUse ExamplesUse Both
Running unit tests?βœ…βŒ-
Development iteration?βœ…βŒ-
Production AI queries?βŒβœ…-
Semantic search needed?βŒβœ…-
Manual data curation?βœ…Edit JSONL-
Version control examples?βœ…βœ… JSONL-
Need fast retrieval?βŒβœ…-
Bootstrap new feature?βœ…Maybeβœ…

πŸ› οΈ Common Workflows

Workflow 1: Add Test Data

# 1. Create seed file
cat > database/schemas/seeds/leagues/new-league.seed.json << EOF
{
  "league_id": "new-001",
  "name": "New League",
  ...
}
EOF
 
# 2. Use in tests
python -c "from database.schemas.seeds import load_seed; print(load_seed('leagues/new-league.seed.json'))"
 
# 3. Commit
git add database/schemas/seeds/leagues/new-league.seed.json
git commit -m "feat: add new league test data"

Workflow 2: Add Production Example

# 1. Edit JSONL (source of truth)
vim database/storage/examples/data/triage.jsonl
# Add line: {"id": "triage_new", "scenario": "partnership", ...}
 
# 2. Seed database
uv run python scripts/seed.examples.py --category triage
 
# 3. Verify in production code
python -c "
from database.storage.examples import FewShotExamplesAPI
api = FewShotExamplesAPI()
# Test retrieval
"
 
# 4. Commit JSONL
git add database/storage/examples/data/triage.jsonl
git commit -m "feat: add new triage example"

Workflow 3: Update Example Quality

# 1. Edit JSONL file
vim database/storage/examples/data/triage.jsonl
# Update quality_score or fix issues
 
# 2. Reseed database (upserts existing)
uv run python scripts/seed.examples.py --category triage
 
# 3. Verify
psql $DATABASE_URL -c "SELECT example_id, quality_score FROM \"FewShotExample\" WHERE example_id = 'triage_001';"

⚠️ Important Conventions

DO βœ…

  1. Edit JSONL files as source of truth for production examples
  2. Use seed script to populate database (never edit DB directly)
  3. Query via Prisma/API for production use
  4. Keep seeds for testing even if migrated to examples
  5. Track usage in database for quality improvement
  6. Version both seeds and JSONL files in git

DON'T ❌

  1. Don't edit database directly - edit JSONL then reseed
  2. Don't query JSONL in production - use API/Prisma
  3. Don't delete seeds when migrating to examples
  4. Don't mix purposes - keep layers separate
  5. Don't skip seed script - maintains consistency
  6. Don't create duplicate files - use versioning instead

πŸ”§ Maintenance Tasks

Weekly

# Check example quality
psql $DATABASE_URL -c "
SELECT category, COUNT(*), AVG(quality_score), AVG(usage_count)
FROM \"FewShotExample\"
GROUP BY category
ORDER BY AVG(quality_score) ASC;
"
 
# Identify unused examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score, usage_count
FROM \"FewShotExample\"
WHERE usage_count = 0
  AND created_at < NOW() - INTERVAL '30 days'
LIMIT 10;
"

Monthly

# Review and improve low-quality examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score
FROM \"FewShotExample\"
WHERE quality_score < 0.70
ORDER BY usage_count DESC
LIMIT 20;
"
 
# Update JSONL files based on findings
# Then reseed: uv run python scripts/seed.examples.py

Quarterly

# Archive old legacy data if no longer needed
# Review database/storage/examples/data/legacy_seeds.jsonl
# Consider removing if fully migrated
 
# Audit for duplicates
python database/scripts/audit_duplicates.py

πŸ“Š Metrics & Analytics

Production Health

-- Example usage by category
SELECT 
    category,
    COUNT(*) as total,
    AVG(quality_score) as avg_quality,
    AVG(usage_count) as avg_usage,
    MAX(usage_count) as max_usage
FROM "FewShotExample"
GROUP BY category
ORDER BY avg_usage DESC;
 
-- Find high-value examples
SELECT 
    example_id,
    category,
    quality_score,
    usage_count,
    (quality_score * LN(usage_count + 1)) as value_score
FROM "FewShotExample"
WHERE usage_count > 0
ORDER BY value_score DESC
LIMIT 20;

πŸ› Troubleshooting

Seeds not loading

# Check file exists
ls -lh database/schemas/seeds/leagues/mltt.seed.json
 
# Check format
python -c "import json; json.load(open('database/schemas/seeds/leagues/mltt.seed.json'))"

Examples not retrieving

# Check database connection
psql $DATABASE_URL -c "SELECT COUNT(*) FROM \"FewShotExample\";"
 
# Check if seeded
psql $DATABASE_URL -c "SELECT COUNT(*) FROM \"FewShotExample\" WHERE category = 'triage';"
 
# Reseed if needed
uv run python scripts/seed.examples.py --clear --category triage

Cache not working

from database.storage.examples import ExampleCache
 
cache = ExampleCache()
print(f"Cache size: {cache.size()}")
print(f"Hit rate: {cache.hit_rate()}")
cache.clear()  # Reset if needed

πŸ“š Related Documentation

  • Seeds README: database/schemas/seeds/README.md
  • Examples README: database/storage/examples/README.md
  • Retrieval System: database/storage/RETRIEVAL_SYSTEM_README.md
  • Migration Guide: database/storage/MIGRATION_GUIDE.md
  • Seed Script: scripts/seed.examples.py

πŸŽ“ Summary

LayerPurposeFormatSpeedUse Case
SeedsDev/test dataIndividual JSONSlowTesting, development
ExamplesProduction KBJSONL β†’ DBFastAI queries, retrieval
APIIntelligent retrievalPython APIFastSemantic search

Bottom Line:

  • Seeds for development πŸ“
  • Examples for production πŸš€
  • API for intelligence 🧠

Last Updated: 2025-01-14 Maintained by: Database Team

Platform

Documentation

Community

Support

partnership@altsportsdata.comdev@altsportsleagues.ai

2025 Β© AltSportsLeagues.ai. Powered by AI-driven sports business intelligence.

πŸ€– AI-Enhancedβ€’πŸ“Š Data-Drivenβ€’βš‘ Real-Time