Source: data_layer/docs/CLEANUP_PLAN.md
Data Cleanup & Optimization Plan
π― Objective
Optimize the data architecture for maximum clarity and performance while maintaining all valuable data.
π Current State Analysis
Seeds Directory
- β 240+ individual files - Well organized
- β Clear categories - leagues, questionnaires, schemas, samples, emails
- β
Active use - Referenced in code via
load_seed() - β Version controlled - Good for git tracking
Storage/Examples Directory
- β 11 JSONL files - Production-ready format
- β 349 total lines - Manageable size
- β οΈ
legacy_seeds.jsonl- 156 lines of migrated data - β Full retrieval stack - API, cache, matcher working
Schema/Examples Directory
- β
few_shot/subdirectory - Reserved for future use - β README added - Purpose documented
π§Ή Cleanup Tasks
Priority 1: Resolve Legacy Seeds (High Impact)
Issue: database/storage/examples/data/legacy_seeds.jsonl contains 156 migrated schema definitions
Options:
Option A: Archive (Recommended if no longer used)
# 1. Check if referenced in code
grep -r "legacy_seeds" database/ app/ --include="*.py"
# 2. If not used, archive
mkdir -p database/storage/examples/data/_archive
mv database/storage/examples/data/legacy_seeds.jsonl \
database/storage/examples/data/_archive/legacy_seeds.jsonl.bak
# 3. Document
echo "Archived 2025-01-14: Legacy seeds migrated, no longer in production use" > \
database/storage/examples/data/_archive/README.mdOption B: Integrate (Recommended if actively used)
# 1. Split into appropriate categories
python scripts/split_legacy_seeds.py
# 2. Merge with existing JSONL files
python scripts/merge_examples.py --source legacy_seeds.jsonl
# 3. Remove original after verification
rm database/storage/examples/data/legacy_seeds.jsonlOption C: Keep as Historical Reference
# Rename to make purpose clear
mv database/storage/examples/data/legacy_seeds.jsonl \
database/storage/examples/data/_historical_schemas.jsonl
# Update documentation
vim database/storage/examples/data/README.mdRecommendation: Choose Option A if not actively used, Option B if integrated into queries.
Priority 2: Resolve Duplication Audit Issues (Medium Impact)
From DUPLICATION_AUDIT_REPORT.md:
# 1. Fix exact duplicates
# Merge knowledge/ schemas into domain/v1/
python scripts/consolidate_schemas.py --dry-run
python scripts/consolidate_schemas.py --execute
# 2. Update references
# Update 16 agent prompts to reference new locations
python scripts/update_prompt_references.py
# 3. Clean up
rm -rf database/schemas/knowledge/schemas/definitions/Priority 3: Optimize JSONL Files (Low Impact, High Value)
Goal: Improve retrieval performance and quality
# 1. Audit example quality
python scripts/audit_example_quality.py
# 2. Identify low-quality examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score, usage_count
FROM \"FewShotExample\"
WHERE quality_score < 0.70
ORDER BY usage_count DESC;
"
# 3. Review and improve or remove
# Edit JSONL files, then reseed
# 4. Verify improvement
python scripts/test_retrieval_accuracy.pyPriority 4: Add Missing Documentation (Low Effort, High Clarity)
# 1. Create seed conversion guide
cp database/storage/examples/JSONL_MIGRATION.md \
database/SEED_TO_EXAMPLE_GUIDE.md
# 2. Add workflow diagrams
# (Already added in DATA_ARCHITECTURE_GUIDE.md)
# 3. Create troubleshooting guide
# (Already added in QUICK_REFERENCE.md)π Implementation Timeline
Week 1: Legacy Seeds Resolution
- Day 1: Audit
legacy_seeds.jsonlusage - Day 2: Execute chosen option (A, B, or C)
- Day 3: Verify no broken references
- Day 4: Update documentation
- Day 5: Deploy and monitor
Week 2: Duplication Cleanup
- Day 1: Test consolidation script
- Day 2: Execute schema consolidation
- Day 3: Update prompt references
- Day 4: Clean up old files
- Day 5: Verify and test
Week 3: Quality Optimization
- Day 1: Audit example quality
- Day 2-3: Improve low-quality examples
- Day 4: Reseed database
- Day 5: Run accuracy tests
Week 4: Documentation & Polish
- Day 1-2: Create remaining guides
- Day 3: Add workflow diagrams
- Day 4: Team review
- Day 5: Final cleanup
β Success Criteria
Technical Metrics
- No duplicate schema definitions
- All JSONL files have quality_score β₯ 0.80 average
- Retrieval accuracy β₯ 90%
- Query performance < 100ms (p95)
- Zero broken references in code
Organizational Metrics
- Clear documentation for all three layers
- Developer onboarding time reduced
- Zero confusion about which system to use
- All team members understand architecture
π¨ Risks & Mitigation
Risk 1: Breaking Production Queries
Mitigation:
- Run full test suite before/after
- Monitor production logs for errors
- Keep backups of all JSONL files
- Deploy during low-traffic window
Risk 2: Data Loss
Mitigation:
- Archive before deleting (don't
rm -f) - Keep git history
- Database backups before reseeding
- Dry-run all scripts first
Risk 3: Performance Regression
Mitigation:
- Benchmark before changes
- Monitor query performance
- A/B test if possible
- Rollback plan ready
π Rollback Plan
# If issues arise:
# 1. Restore JSONL files from git
git checkout HEAD -- database/storage/examples/data/
# 2. Reseed database
uv run python scripts/seed.examples.py --clear
# 3. Restart services
docker-compose restart backend
# 4. Verify
python scripts/smoke_test.pyπ Monitoring
During Cleanup
# Watch for errors
tail -f logs/backend.log | grep -i error
# Monitor query performance
psql $DATABASE_URL -c "
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%FewShotExample%'
ORDER BY mean_exec_time DESC;
"Post-Cleanup
# Weekly quality check
python scripts/weekly_quality_check.py
# Monthly optimization
python scripts/optimize_examples.py --autoπ Lessons Learned (To Document)
After completion, document:
- What worked well
- What could be improved
- Unexpected issues encountered
- Time estimates vs actual
- Updates needed to this plan
π Related Documentation
DATA_ARCHITECTURE_GUIDE.md- Full architecture overviewQUICK_REFERENCE.md- Common commandsDUPLICATION_AUDIT_REPORT.md- Known duplication issuesstorage/MIGRATION_GUIDE.md- Historical migration context
Status: π Planning Phase
Owner: Database Team
Review Date: 2025-01-21
Target Completion: 2025-02-14
Last Updated: 2025-01-14