Source: data_layer/docs/CONSOLIDATION_PLAN.md
Examples Consolidation Plan
π― Goal
Consolidate example data under output-styles/examples/ while keeping retrieval infrastructure in few_shot_examples_training_data/.
π Current State
Data Location 1: output-styles/examples/
β
Structured JSON examples (by-scenario/, edge-cases/)
β
Master index (examples_index.json)
β
Validation framework
β No JSONL seed files
β No embedding storageData Location 2: few_shot_examples_training_data/data/
β
JSONL seed files (triage.jsonl, contract-generation.jsonl, etc.)
β
Source-of-truth for database seeding
β Separated from structured examples
β No clear organization by scenarioInfrastructure: few_shot_examples_training_data/ (Python modules)
β
api.py - High-level API interface
β
retriever.py - Core retrieval logic
β
matcher.py - Semantic similarity matching
β
cache.py - LRU caching system
β
example_manager.py - JSONL utilitiesπ Proposed Structure
output-styles/
βββ examples/
βββ README.md # Master documentation
βββ examples_index.json # Existing index (keep)
βββ validation-framework.json # Existing validation (keep)
β
βββ by-scenario/ # Existing (keep)
β βββ triage.json
β βββ response-generation.json
β βββ pdf-processing.json
β βββ contract-generation.json
β βββ onboarding.json
β βββ workflow-chain.json
β
βββ edge-cases/ # Existing (keep)
β βββ edge-cases.json
β
βββ seeds/ # NEW: JSONL files for DB seeding
β βββ README.md # JSONL format documentation
β βββ triage.jsonl # Move from few_shot/data/
β βββ contract-generation.jsonl
β βββ pdf-processing.jsonl
β βββ response-generation.jsonl
β βββ onboarding-response.jsonl
β βββ league_examples.jsonl
β βββ questionnaires.jsonl
β βββ schema_definitions.jsonl
β
βββ embeddings/ # NEW: Vector embeddings storage
βββ README.md
βββ triage_embeddings.npy
βββ contract_embeddings.npy
βββ response_embeddings.npy
βββ metadata.json
database/few_shot_examples_training_data/ # Keep infrastructure
βββ __init__.py
βββ api.py # Keep
βββ retriever.py # Keep (update paths)
βββ matcher.py # Keep
βββ cache.py # Keep
βββ example_manager.py # Keep (update paths)
βββ README.md # Update with new paths
βββ MIGRATION.md # Migration notesπ Step-by-Step Migration
Phase 1: Setup New Structure β
# Create new directories
mkdir -p output-styles/examples/seeds
mkdir -p output-styles/examples/embeddings
# Create documentation
touch output-styles/examples/seeds/README.md
touch output-styles/examples/embeddings/README.mdPhase 2: Move JSONL Files β
# Move JSONL files to seeds/
mv database/few_shot_examples_training_data/data/*.jsonl \
output-styles/examples/seeds/
# Move subdirectories if needed
mv database/few_shot_examples_training_data/data/contract_sections \
output-styles/examples/seeds/Phase 3: Update Infrastructure Code π
# In example_manager.py, update default path:
DEFAULT_DATA_DIR = Path("output-styles/examples/seeds")
# In retriever.py, update paths:
EXAMPLES_DIR = Path("output-styles/examples")
SEEDS_DIR = EXAMPLES_DIR / "seeds"
EMBEDDINGS_DIR = EXAMPLES_DIR / "embeddings"Phase 4: Update Seed Script π
# In scripts/seed.examples.py
DATA_DIR = Path("output-styles/examples/seeds")Phase 5: Clean Up Old Structure β
# Remove old data directory (after verifying migration)
rm -rf database/few_shot_examples_training_data/data/
# Update README files to reflect new structureπ Update Required Files
1. few_shot_examples_training_data/example_manager.py
# OLD:
DEFAULT_DATA_DIR = Path(__file__).parent / "data"
# NEW:
DEFAULT_DATA_DIR = Path(__file__).parent.parent / "output-styles" / "examples" / "seeds"2. few_shot_examples_training_data/retriever.py
# Add configuration for example locations
class ExamplesConfig:
EXAMPLES_ROOT = Path("output-styles/examples")
SEEDS_DIR = EXAMPLES_ROOT / "seeds"
STRUCTURED_DIR = EXAMPLES_ROOT / "by-scenario"
EDGE_CASES_DIR = EXAMPLES_ROOT / "edge-cases"
EMBEDDINGS_DIR = EXAMPLES_ROOT / "embeddings"3. scripts/seed.examples.py
# Update to use new seeds location
JSONL_DIR = Path("output-styles/examples/seeds")π― Benefits
β Unified Data Location
- All examples in one place:
output-styles/examples/ - Clear separation: structured JSON vs JSONL seeds vs embeddings
- Single source of truth for documentation
β Clearer Purpose
- output-styles/examples/: All example data (JSON, JSONL, embeddings)
- few_shot_examples_training_data/: Pure infrastructure (API, retrieval, caching)
β Better Organization
output-styles/examples/
βββ seeds/ # Database seeding (JSONL)
βββ by-scenario/ # Human-readable reference (JSON)
βββ edge-cases/ # Edge case examples (JSON)
βββ embeddings/ # Vector embeddings (NPY/JSON)β Retrieval Flexibility
- Structured JSON: Direct file reading for documentation/reference
- JSONL β Database: Fast indexed queries via Prisma
- Embeddings: Semantic similarity search
- All three approaches work from same unified location
π¨ Migration Risks & Mitigation
Risk 1: Broken Import Paths
Mitigation: Update all imports systematically, test each module
Risk 2: Lost Data
Mitigation:
- Backup before migration:
cp -r few_shot_examples_training_data/data backup/ - Verify file counts match
- Run validation tests
Risk 3: Database Seed Script Fails
Mitigation:
- Update seed script path first
- Test on single category
- Run full seed after verification
β Validation Checklist
- All JSONL files moved to
output-styles/examples/seeds/ - Infrastructure code updated with new paths
- Seed script works with new location
- API can retrieve examples from new location
- Retriever can load from both JSON and JSONL
- Tests pass
- Documentation updated
- Old data directory removed
π Post-Migration
Update Documentation
output-styles/examples/README.md- Master examples documentationoutput-styles/examples/seeds/README.md- JSONL format guidefew_shot_examples_training_data/README.md- Update infrastructure docs
Test All Retrieval Methods
# Test 1: Direct JSON retrieval
from pathlib import Path
import json
examples = json.load(Path("output-styles/examples/by-scenario/triage.json").open())
# Test 2: JSONL database seeding
uv run python scripts/seed.examples.py --category triage
# Test 3: API retrieval
from database.few_shot_examples_training_data import FewShotExamplesAPI
api = FewShotExamplesAPI()
examples = await api.get_examples_for_prompt("partnership inquiry", "triage")
# Test 4: Embedding-based retrieval (future)
from database.few_shot_examples_training_data import SemanticMatcher
matcher = SemanticMatcher()
similar = await matcher.find_similar("partnership inquiry", top_k=5)π Summary
Before:
β Two separate data locations
β Unclear which is source of truth
β No embedding storage
β Infrastructure mixed with dataAfter:
β
Unified data location: output-styles/examples/
β
Clear separation: data vs infrastructure
β
Multiple retrieval strategies: JSON, JSONLβDB, embeddings
β
Clean, maintainable structureNext Steps:
- Execute Phase 1-5 migration
- Run validation checklist
- Update all documentation
- Archive old structure
- Implement embedding generation pipeline (future enhancement)