Source: data_layer/docs/DATA_ARCHITECTURE_GUIDE.md
Data Architecture Guide
ποΈ Three-Tier Data System
Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEVELOPMENT LAYER β
β database/schemas/seeds/ β
β Individual JSON files (240+ files) β
β Purpose: Testing, dev, version control β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
(Optional Migration)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRODUCTION LAYER β
β database/storage/examples/ β
β JSONL files β Prisma DB β
β Purpose: AI few-shot learning, retrieval, caching β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
(Query Interface)
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LAYER β
β Intelligent Retrieval API + Embeddings β
β Purpose: Production queries with semantic matching β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ Directory Structure
1. Development Seeds (database/schemas/seeds/)
Purpose: Granular, version-controlled test data
seeds/
βββ leagues/ # 101 individual league definitions
βββ questionnaires/ # 53 questionnaire examples
βββ schemas/ # 105 schema seed definitions
βββ samples/ # 5 small sample datasets
βββ synthetic_emails/ # 34 generated test emailsWhen to Use:
- β Development and testing
- β Bootstrapping new tests
- β Manual data curation
- β Version control of examples
- β Production queries (too slow)
Access Pattern:
from database.schemas.seeds import load_seed
league = load_seed("leagues/mltt.seed.json")2. Production Few-Shot KB (database/storage/examples/)
Purpose: High-performance few-shot learning system
examples/
βββ data/ # JSONL source files (11 files)
β βββ triage.jsonl # Email classification
β βββ contract-generation.jsonl # Contract examples
β βββ questionnaires.jsonl # Questionnaire processing
β βββ league_examples.jsonl # League data examples
β βββ schema_definitions.jsonl # Schema examples
β βββ legacy_seeds.jsonl # β οΈ Migrated historical data
β βββ ...
βββ api.py # High-level API
βββ retriever.py # Core retrieval logic
βββ matcher.py # Semantic similarity
βββ cache.py # LRU cachingWhen to Use:
- β Production AI queries
- β Semantic similarity matching
- β Context-aware example selection
- β High-performance retrieval
- β Manual editing (edit JSONL then reseed)
Access Pattern:
from database.storage.examples import FewShotExamplesAPI
api = FewShotExamplesAPI()
examples = await api.get_examples_for_prompt(
prompt_text="Premium basketball partnership...",
prompt_type="triage",
max_examples=3
)3. Schema Examples (database/schemas/examples/)
Purpose: Reserved for future schema documentation
examples/
βββ few_shot/ # Empty - reserved for schema patternsPotential Future Uses:
- Schema definition examples
- Documentation of schema compliance
- Pattern libraries
π Data Lifecycle
Path 1: Development Data (Most Common)
1. Create/edit seed file
ββ> database/schemas/seeds/leagues/new-league.seed.json
2. Use in development
ββ> load_seed("leagues/new-league.seed.json")
3. Version control
ββ> git add + commitPath 2: Production Examples (AI/ML Use)
1. Curate examples
ββ> Edit database/storage/examples/data/triage.jsonl
2. Seed database
ββ> uv run python scripts/seed.examples.py --category triage
3. Query in production
ββ> FewShotExamplesAPI().get_examples_for_prompt(...)
4. Track usage & improve
ββ> Analytics update quality_score and usage_countPath 3: Migration (Seeds β Production)
1. Identify high-quality seed
ββ> Review database/schemas/seeds/leagues/example.json
2. Convert to JSONL format
ββ> Add to appropriate .jsonl file with metadata:
{
"id": "unique_id",
"scenario": "partnership",
"sport": "basketball",
"tier": "premium",
"quality_score": 0.95,
"input": {...},
"output": {...}
}
3. Seed to database
ββ> uv run python scripts/seed.examples.py
4. Keep original seed
ββ> Don't delete from seeds/ (different purpose)π― Decision Matrix
| Question | Use Seeds | Use Examples | Use Both |
|---|---|---|---|
| Running unit tests? | β | β | - |
| Development iteration? | β | β | - |
| Production AI queries? | β | β | - |
| Semantic search needed? | β | β | - |
| Manual data curation? | β | Edit JSONL | - |
| Version control examples? | β | β JSONL | - |
| Need fast retrieval? | β | β | - |
| Bootstrap new feature? | β | Maybe | β |
π οΈ Common Workflows
Workflow 1: Add Test Data
# 1. Create seed file
cat > database/schemas/seeds/leagues/new-league.seed.json << EOF
{
"league_id": "new-001",
"name": "New League",
...
}
EOF
# 2. Use in tests
python -c "from database.schemas.seeds import load_seed; print(load_seed('leagues/new-league.seed.json'))"
# 3. Commit
git add database/schemas/seeds/leagues/new-league.seed.json
git commit -m "feat: add new league test data"Workflow 2: Add Production Example
# 1. Edit JSONL (source of truth)
vim database/storage/examples/data/triage.jsonl
# Add line: {"id": "triage_new", "scenario": "partnership", ...}
# 2. Seed database
uv run python scripts/seed.examples.py --category triage
# 3. Verify in production code
python -c "
from database.storage.examples import FewShotExamplesAPI
api = FewShotExamplesAPI()
# Test retrieval
"
# 4. Commit JSONL
git add database/storage/examples/data/triage.jsonl
git commit -m "feat: add new triage example"Workflow 3: Update Example Quality
# 1. Edit JSONL file
vim database/storage/examples/data/triage.jsonl
# Update quality_score or fix issues
# 2. Reseed database (upserts existing)
uv run python scripts/seed.examples.py --category triage
# 3. Verify
psql $DATABASE_URL -c "SELECT example_id, quality_score FROM \"FewShotExample\" WHERE example_id = 'triage_001';"β οΈ Important Conventions
DO β
- Edit JSONL files as source of truth for production examples
- Use seed script to populate database (never edit DB directly)
- Query via Prisma/API for production use
- Keep seeds for testing even if migrated to examples
- Track usage in database for quality improvement
- Version both seeds and JSONL files in git
DON'T β
- Don't edit database directly - edit JSONL then reseed
- Don't query JSONL in production - use API/Prisma
- Don't delete seeds when migrating to examples
- Don't mix purposes - keep layers separate
- Don't skip seed script - maintains consistency
- Don't create duplicate files - use versioning instead
π§ Maintenance Tasks
Weekly
# Check example quality
psql $DATABASE_URL -c "
SELECT category, COUNT(*), AVG(quality_score), AVG(usage_count)
FROM \"FewShotExample\"
GROUP BY category
ORDER BY AVG(quality_score) ASC;
"
# Identify unused examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score, usage_count
FROM \"FewShotExample\"
WHERE usage_count = 0
AND created_at < NOW() - INTERVAL '30 days'
LIMIT 10;
"Monthly
# Review and improve low-quality examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score
FROM \"FewShotExample\"
WHERE quality_score < 0.70
ORDER BY usage_count DESC
LIMIT 20;
"
# Update JSONL files based on findings
# Then reseed: uv run python scripts/seed.examples.pyQuarterly
# Archive old legacy data if no longer needed
# Review database/storage/examples/data/legacy_seeds.jsonl
# Consider removing if fully migrated
# Audit for duplicates
python database/scripts/audit_duplicates.pyπ Metrics & Analytics
Production Health
-- Example usage by category
SELECT
category,
COUNT(*) as total,
AVG(quality_score) as avg_quality,
AVG(usage_count) as avg_usage,
MAX(usage_count) as max_usage
FROM "FewShotExample"
GROUP BY category
ORDER BY avg_usage DESC;
-- Find high-value examples
SELECT
example_id,
category,
quality_score,
usage_count,
(quality_score * LN(usage_count + 1)) as value_score
FROM "FewShotExample"
WHERE usage_count > 0
ORDER BY value_score DESC
LIMIT 20;π Troubleshooting
Seeds not loading
# Check file exists
ls -lh database/schemas/seeds/leagues/mltt.seed.json
# Check format
python -c "import json; json.load(open('database/schemas/seeds/leagues/mltt.seed.json'))"Examples not retrieving
# Check database connection
psql $DATABASE_URL -c "SELECT COUNT(*) FROM \"FewShotExample\";"
# Check if seeded
psql $DATABASE_URL -c "SELECT COUNT(*) FROM \"FewShotExample\" WHERE category = 'triage';"
# Reseed if needed
uv run python scripts/seed.examples.py --clear --category triageCache not working
from database.storage.examples import ExampleCache
cache = ExampleCache()
print(f"Cache size: {cache.size()}")
print(f"Hit rate: {cache.hit_rate()}")
cache.clear() # Reset if neededπ Related Documentation
- Seeds README:
database/schemas/seeds/README.md - Examples README:
database/storage/examples/README.md - Retrieval System:
database/storage/RETRIEVAL_SYSTEM_README.md - Migration Guide:
database/storage/MIGRATION_GUIDE.md - Seed Script:
scripts/seed.examples.py
π Summary
| Layer | Purpose | Format | Speed | Use Case |
|---|---|---|---|---|
| Seeds | Dev/test data | Individual JSON | Slow | Testing, development |
| Examples | Production KB | JSONL β DB | Fast | AI queries, retrieval |
| API | Intelligent retrieval | Python API | Fast | Semantic search |
Bottom Line:
- Seeds for development π
- Examples for production π
- API for intelligence π§
Last Updated: 2025-01-14 Maintained by: Database Team