Source: data_layer/docs/CONSOLIDATION_PLAN.md

Examples Consolidation Plan

🎯 Goal

Consolidate example data under output-styles/examples/ while keeping retrieval infrastructure in few_shot_examples_training_data/.

📋 Current State

Data Location 1: `output-styles/examples/`

✅ Structured JSON examples (by-scenario/, edge-cases/)
✅ Master index (examples_index.json)
✅ Validation framework
❌ No JSONL seed files
❌ No embedding storage

Data Location 2: `few_shot_examples_training_data/data/`

✅ JSONL seed files (triage.jsonl, contract-generation.jsonl, etc.)
✅ Source-of-truth for database seeding
❌ Separated from structured examples
❌ No clear organization by scenario

Infrastructure: `few_shot_examples_training_data/` (Python modules)

✅ api.py - High-level API interface
✅ retriever.py - Core retrieval logic
✅ matcher.py - Semantic similarity matching
✅ cache.py - LRU caching system
✅ example_manager.py - JSONL utilities

🚀 Proposed Structure

output-styles/
└── examples/
    ├── README.md                          # Master documentation
    ├── examples_index.json                # Existing index (keep)
    ├── validation-framework.json          # Existing validation (keep)
    │
    ├── by-scenario/                       # Existing (keep)
    │   ├── triage.json
    │   ├── response-generation.json
    │   ├── pdf-processing.json
    │   ├── contract-generation.json
    │   ├── onboarding.json
    │   └── workflow-chain.json
    │
    ├── edge-cases/                        # Existing (keep)
    │   └── edge-cases.json
    │
    ├── seeds/                             # NEW: JSONL files for DB seeding
    │   ├── README.md                      # JSONL format documentation
    │   ├── triage.jsonl                   # Move from few_shot/data/
    │   ├── contract-generation.jsonl
    │   ├── pdf-processing.jsonl
    │   ├── response-generation.jsonl
    │   ├── onboarding-response.jsonl
    │   ├── league_examples.jsonl
    │   ├── questionnaires.jsonl
    │   └── schema_definitions.jsonl
    │
    └── embeddings/                        # NEW: Vector embeddings storage
        ├── README.md
        ├── triage_embeddings.npy
        ├── contract_embeddings.npy
        ├── response_embeddings.npy
        └── metadata.json

database/few_shot_examples_training_data/    # Keep infrastructure
├── __init__.py
├── api.py                                   # Keep
├── retriever.py                             # Keep (update paths)
├── matcher.py                               # Keep
├── cache.py                                 # Keep
├── example_manager.py                       # Keep (update paths)
├── README.md                                # Update with new paths
└── MIGRATION.md                             # Migration notes

📝 Step-by-Step Migration

Phase 1: Setup New Structure ✅

# Create new directories
mkdir -p output-styles/examples/seeds
mkdir -p output-styles/examples/embeddings
 
# Create documentation
touch output-styles/examples/seeds/README.md
touch output-styles/examples/embeddings/README.md

Phase 2: Move JSONL Files ✅

# Move JSONL files to seeds/
mv database/few_shot_examples_training_data/data/*.jsonl \
   output-styles/examples/seeds/
 
# Move subdirectories if needed
mv database/few_shot_examples_training_data/data/contract_sections \
   output-styles/examples/seeds/

Phase 3: Update Infrastructure Code 🔄

# In example_manager.py, update default path:
DEFAULT_DATA_DIR = Path("output-styles/examples/seeds")
 
# In retriever.py, update paths:
EXAMPLES_DIR = Path("output-styles/examples")
SEEDS_DIR = EXAMPLES_DIR / "seeds"
EMBEDDINGS_DIR = EXAMPLES_DIR / "embeddings"

Phase 4: Update Seed Script 🔄

# In scripts/seed.examples.py
DATA_DIR = Path("output-styles/examples/seeds")

Phase 5: Clean Up Old Structure ✅

# Remove old data directory (after verifying migration)
rm -rf database/few_shot_examples_training_data/data/
 
# Update README files to reflect new structure

🔄 Update Required Files

1. `few_shot_examples_training_data/example_manager.py`

# OLD:
DEFAULT_DATA_DIR = Path(__file__).parent / "data"
 
# NEW:
DEFAULT_DATA_DIR = Path(__file__).parent.parent / "output-styles" / "examples" / "seeds"

2. `few_shot_examples_training_data/retriever.py`

# Add configuration for example locations
class ExamplesConfig:
    EXAMPLES_ROOT = Path("output-styles/examples")
    SEEDS_DIR = EXAMPLES_ROOT / "seeds"
    STRUCTURED_DIR = EXAMPLES_ROOT / "by-scenario"
    EDGE_CASES_DIR = EXAMPLES_ROOT / "edge-cases"
    EMBEDDINGS_DIR = EXAMPLES_ROOT / "embeddings"

3. `scripts/seed.examples.py`

# Update to use new seeds location
JSONL_DIR = Path("output-styles/examples/seeds")

🎯 Benefits

✅ Unified Data Location

All examples in one place: output-styles/examples/
Clear separation: structured JSON vs JSONL seeds vs embeddings
Single source of truth for documentation

✅ Clearer Purpose

output-styles/examples/: All example data (JSON, JSONL, embeddings)
few_shot_examples_training_data/: Pure infrastructure (API, retrieval, caching)

✅ Better Organization

output-styles/examples/
├── seeds/              # Database seeding (JSONL)
├── by-scenario/        # Human-readable reference (JSON)
├── edge-cases/         # Edge case examples (JSON)
└── embeddings/         # Vector embeddings (NPY/JSON)

✅ Retrieval Flexibility

Structured JSON: Direct file reading for documentation/reference
JSONL → Database: Fast indexed queries via Prisma
Embeddings: Semantic similarity search
All three approaches work from same unified location

🚨 Migration Risks & Mitigation

Risk 1: Broken Import Paths

Mitigation: Update all imports systematically, test each module

Risk 2: Lost Data

Mitigation:

Backup before migration: cp -r few_shot_examples_training_data/data backup/
Verify file counts match
Run validation tests

Risk 3: Database Seed Script Fails

Mitigation:

Update seed script path first
Test on single category
Run full seed after verification

✅ Validation Checklist

All JSONL files moved to output-styles/examples/seeds/
Infrastructure code updated with new paths
Seed script works with new location
API can retrieve examples from new location
Retriever can load from both JSON and JSONL
Tests pass
Documentation updated
Old data directory removed

📚 Post-Migration

Update Documentation

output-styles/examples/README.md - Master examples documentation
output-styles/examples/seeds/README.md - JSONL format guide
few_shot_examples_training_data/README.md - Update infrastructure docs

Test All Retrieval Methods

# Test 1: Direct JSON retrieval
from pathlib import Path
import json
examples = json.load(Path("output-styles/examples/by-scenario/triage.json").open())
 
# Test 2: JSONL database seeding
uv run python scripts/seed.examples.py --category triage
 
# Test 3: API retrieval
from database.few_shot_examples_training_data import FewShotExamplesAPI
api = FewShotExamplesAPI()
examples = await api.get_examples_for_prompt("partnership inquiry", "triage")
 
# Test 4: Embedding-based retrieval (future)
from database.few_shot_examples_training_data import SemanticMatcher
matcher = SemanticMatcher()
similar = await matcher.find_similar("partnership inquiry", top_k=5)

🎓 Summary

Before:

❌ Two separate data locations
❌ Unclear which is source of truth
❌ No embedding storage
❌ Infrastructure mixed with data

After:

✅ Unified data location: output-styles/examples/
✅ Clear separation: data vs infrastructure
✅ Multiple retrieval strategies: JSON, JSONL→DB, embeddings
✅ Clean, maintainable structure

Next Steps:

Execute Phase 1-5 migration
Run validation checklist
Update all documentation
Archive old structure
Implement embedding generation pipeline (future enhancement)

DUPLICATION AUDIT REPORT Examples Consolidation Plan (REVISED)

Examples Consolidation Plan

🎯 Goal

📋 Current State

Data Location 1: `output-styles/examples/`

Data Location 2: `few_shot_examples_training_data/data/`

Infrastructure: `few_shot_examples_training_data/` (Python modules)

🚀 Proposed Structure

📝 Step-by-Step Migration

Phase 1: Setup New Structure ✅

Phase 2: Move JSONL Files ✅

Phase 3: Update Infrastructure Code 🔄

Phase 4: Update Seed Script 🔄

Phase 5: Clean Up Old Structure ✅

🔄 Update Required Files

1. `few_shot_examples_training_data/example_manager.py`

2. `few_shot_examples_training_data/retriever.py`

3. `scripts/seed.examples.py`

🎯 Benefits

✅ Unified Data Location

✅ Clearer Purpose

✅ Better Organization

✅ Retrieval Flexibility

🚨 Migration Risks & Mitigation

Risk 1: Broken Import Paths

Risk 2: Lost Data

Risk 3: Database Seed Script Fails

✅ Validation Checklist

📚 Post-Migration

Update Documentation

Test All Retrieval Methods

🎓 Summary

Platform

Documentation

Community

Support

Examples Consolidation Plan

🎯 Goal

📋 Current State

Data Location 1: output-styles/examples/

Data Location 2: few_shot_examples_training_data/data/

Infrastructure: few_shot_examples_training_data/ (Python modules)

🚀 Proposed Structure

📝 Step-by-Step Migration

Phase 1: Setup New Structure ✅

Phase 2: Move JSONL Files ✅

Phase 3: Update Infrastructure Code 🔄

Phase 4: Update Seed Script 🔄

Phase 5: Clean Up Old Structure ✅

🔄 Update Required Files

1. few_shot_examples_training_data/example_manager.py

2. few_shot_examples_training_data/retriever.py

3. scripts/seed.examples.py

🎯 Benefits

✅ Unified Data Location

✅ Clearer Purpose

✅ Better Organization

✅ Retrieval Flexibility

🚨 Migration Risks & Mitigation

Risk 1: Broken Import Paths

Risk 2: Lost Data

Risk 3: Database Seed Script Fails

✅ Validation Checklist

📚 Post-Migration

Update Documentation

Test All Retrieval Methods

🎓 Summary

Platform

Documentation

Community

Support

Data Location 1: `output-styles/examples/`

Data Location 2: `few_shot_examples_training_data/data/`

Infrastructure: `few_shot_examples_training_data/` (Python modules)

1. `few_shot_examples_training_data/example_manager.py`

2. `few_shot_examples_training_data/retriever.py`

3. `scripts/seed.examples.py`