Architecture
Examples Consolidation Plan

Source: data_layer/docs/CONSOLIDATION_PLAN.md

Examples Consolidation Plan

🎯 Goal

Consolidate example data under output-styles/examples/ while keeping retrieval infrastructure in few_shot_examples_training_data/.

πŸ“‹ Current State

Data Location 1: output-styles/examples/

βœ… Structured JSON examples (by-scenario/, edge-cases/)
βœ… Master index (examples_index.json)
βœ… Validation framework
❌ No JSONL seed files
❌ No embedding storage

Data Location 2: few_shot_examples_training_data/data/

βœ… JSONL seed files (triage.jsonl, contract-generation.jsonl, etc.)
βœ… Source-of-truth for database seeding
❌ Separated from structured examples
❌ No clear organization by scenario

Infrastructure: few_shot_examples_training_data/ (Python modules)

βœ… api.py - High-level API interface
βœ… retriever.py - Core retrieval logic
βœ… matcher.py - Semantic similarity matching
βœ… cache.py - LRU caching system
βœ… example_manager.py - JSONL utilities

πŸš€ Proposed Structure

output-styles/
└── examples/
    β”œβ”€β”€ README.md                          # Master documentation
    β”œβ”€β”€ examples_index.json                # Existing index (keep)
    β”œβ”€β”€ validation-framework.json          # Existing validation (keep)
    β”‚
    β”œβ”€β”€ by-scenario/                       # Existing (keep)
    β”‚   β”œβ”€β”€ triage.json
    β”‚   β”œβ”€β”€ response-generation.json
    β”‚   β”œβ”€β”€ pdf-processing.json
    β”‚   β”œβ”€β”€ contract-generation.json
    β”‚   β”œβ”€β”€ onboarding.json
    β”‚   └── workflow-chain.json
    β”‚
    β”œβ”€β”€ edge-cases/                        # Existing (keep)
    β”‚   └── edge-cases.json
    β”‚
    β”œβ”€β”€ seeds/                             # NEW: JSONL files for DB seeding
    β”‚   β”œβ”€β”€ README.md                      # JSONL format documentation
    β”‚   β”œβ”€β”€ triage.jsonl                   # Move from few_shot/data/
    β”‚   β”œβ”€β”€ contract-generation.jsonl
    β”‚   β”œβ”€β”€ pdf-processing.jsonl
    β”‚   β”œβ”€β”€ response-generation.jsonl
    β”‚   β”œβ”€β”€ onboarding-response.jsonl
    β”‚   β”œβ”€β”€ league_examples.jsonl
    β”‚   β”œβ”€β”€ questionnaires.jsonl
    β”‚   └── schema_definitions.jsonl
    β”‚
    └── embeddings/                        # NEW: Vector embeddings storage
        β”œβ”€β”€ README.md
        β”œβ”€β”€ triage_embeddings.npy
        β”œβ”€β”€ contract_embeddings.npy
        β”œβ”€β”€ response_embeddings.npy
        └── metadata.json

database/few_shot_examples_training_data/    # Keep infrastructure
β”œβ”€β”€ __init__.py
β”œβ”€β”€ api.py                                   # Keep
β”œβ”€β”€ retriever.py                             # Keep (update paths)
β”œβ”€β”€ matcher.py                               # Keep
β”œβ”€β”€ cache.py                                 # Keep
β”œβ”€β”€ example_manager.py                       # Keep (update paths)
β”œβ”€β”€ README.md                                # Update with new paths
└── MIGRATION.md                             # Migration notes

πŸ“ Step-by-Step Migration

Phase 1: Setup New Structure βœ…

# Create new directories
mkdir -p output-styles/examples/seeds
mkdir -p output-styles/examples/embeddings
 
# Create documentation
touch output-styles/examples/seeds/README.md
touch output-styles/examples/embeddings/README.md

Phase 2: Move JSONL Files βœ…

# Move JSONL files to seeds/
mv database/few_shot_examples_training_data/data/*.jsonl \
   output-styles/examples/seeds/
 
# Move subdirectories if needed
mv database/few_shot_examples_training_data/data/contract_sections \
   output-styles/examples/seeds/

Phase 3: Update Infrastructure Code πŸ”„

# In example_manager.py, update default path:
DEFAULT_DATA_DIR = Path("output-styles/examples/seeds")
 
# In retriever.py, update paths:
EXAMPLES_DIR = Path("output-styles/examples")
SEEDS_DIR = EXAMPLES_DIR / "seeds"
EMBEDDINGS_DIR = EXAMPLES_DIR / "embeddings"

Phase 4: Update Seed Script πŸ”„

# In scripts/seed.examples.py
DATA_DIR = Path("output-styles/examples/seeds")

Phase 5: Clean Up Old Structure βœ…

# Remove old data directory (after verifying migration)
rm -rf database/few_shot_examples_training_data/data/
 
# Update README files to reflect new structure

πŸ”„ Update Required Files

1. few_shot_examples_training_data/example_manager.py

# OLD:
DEFAULT_DATA_DIR = Path(__file__).parent / "data"
 
# NEW:
DEFAULT_DATA_DIR = Path(__file__).parent.parent / "output-styles" / "examples" / "seeds"

2. few_shot_examples_training_data/retriever.py

# Add configuration for example locations
class ExamplesConfig:
    EXAMPLES_ROOT = Path("output-styles/examples")
    SEEDS_DIR = EXAMPLES_ROOT / "seeds"
    STRUCTURED_DIR = EXAMPLES_ROOT / "by-scenario"
    EDGE_CASES_DIR = EXAMPLES_ROOT / "edge-cases"
    EMBEDDINGS_DIR = EXAMPLES_ROOT / "embeddings"

3. scripts/seed.examples.py

# Update to use new seeds location
JSONL_DIR = Path("output-styles/examples/seeds")

🎯 Benefits

βœ… Unified Data Location

  • All examples in one place: output-styles/examples/
  • Clear separation: structured JSON vs JSONL seeds vs embeddings
  • Single source of truth for documentation

βœ… Clearer Purpose

  • output-styles/examples/: All example data (JSON, JSONL, embeddings)
  • few_shot_examples_training_data/: Pure infrastructure (API, retrieval, caching)

βœ… Better Organization

output-styles/examples/
β”œβ”€β”€ seeds/              # Database seeding (JSONL)
β”œβ”€β”€ by-scenario/        # Human-readable reference (JSON)
β”œβ”€β”€ edge-cases/         # Edge case examples (JSON)
└── embeddings/         # Vector embeddings (NPY/JSON)

βœ… Retrieval Flexibility

  • Structured JSON: Direct file reading for documentation/reference
  • JSONL β†’ Database: Fast indexed queries via Prisma
  • Embeddings: Semantic similarity search
  • All three approaches work from same unified location

🚨 Migration Risks & Mitigation

Risk 1: Broken Import Paths

Mitigation: Update all imports systematically, test each module

Risk 2: Lost Data

Mitigation:

  1. Backup before migration: cp -r few_shot_examples_training_data/data backup/
  2. Verify file counts match
  3. Run validation tests

Risk 3: Database Seed Script Fails

Mitigation:

  1. Update seed script path first
  2. Test on single category
  3. Run full seed after verification

βœ… Validation Checklist

  • All JSONL files moved to output-styles/examples/seeds/
  • Infrastructure code updated with new paths
  • Seed script works with new location
  • API can retrieve examples from new location
  • Retriever can load from both JSON and JSONL
  • Tests pass
  • Documentation updated
  • Old data directory removed

πŸ“š Post-Migration

Update Documentation

  1. output-styles/examples/README.md - Master examples documentation
  2. output-styles/examples/seeds/README.md - JSONL format guide
  3. few_shot_examples_training_data/README.md - Update infrastructure docs

Test All Retrieval Methods

# Test 1: Direct JSON retrieval
from pathlib import Path
import json
examples = json.load(Path("output-styles/examples/by-scenario/triage.json").open())
 
# Test 2: JSONL database seeding
uv run python scripts/seed.examples.py --category triage
 
# Test 3: API retrieval
from database.few_shot_examples_training_data import FewShotExamplesAPI
api = FewShotExamplesAPI()
examples = await api.get_examples_for_prompt("partnership inquiry", "triage")
 
# Test 4: Embedding-based retrieval (future)
from database.few_shot_examples_training_data import SemanticMatcher
matcher = SemanticMatcher()
similar = await matcher.find_similar("partnership inquiry", top_k=5)

πŸŽ“ Summary

Before:

❌ Two separate data locations
❌ Unclear which is source of truth
❌ No embedding storage
❌ Infrastructure mixed with data

After:

βœ… Unified data location: output-styles/examples/
βœ… Clear separation: data vs infrastructure
βœ… Multiple retrieval strategies: JSON, JSONLβ†’DB, embeddings
βœ… Clean, maintainable structure

Next Steps:

  1. Execute Phase 1-5 migration
  2. Run validation checklist
  3. Update all documentation
  4. Archive old structure
  5. Implement embedding generation pipeline (future enhancement)

Platform

Documentation

Community

Support

partnership@altsportsdata.comdev@altsportsleagues.ai

2025 Β© AltSportsLeagues.ai. Powered by AI-driven sports business intelligence.

πŸ€– AI-Enhancedβ€’πŸ“Š Data-Drivenβ€’βš‘ Real-Time