Architecture
Data Cleanup & Optimization Plan

Source: data_layer/docs/CLEANUP_PLAN.md

Data Cleanup & Optimization Plan

🎯 Objective

Optimize the data architecture for maximum clarity and performance while maintaining all valuable data.

πŸ“Š Current State Analysis

Seeds Directory

  • βœ… 240+ individual files - Well organized
  • βœ… Clear categories - leagues, questionnaires, schemas, samples, emails
  • βœ… Active use - Referenced in code via load_seed()
  • βœ… Version controlled - Good for git tracking

Storage/Examples Directory

  • βœ… 11 JSONL files - Production-ready format
  • βœ… 349 total lines - Manageable size
  • ⚠️ legacy_seeds.jsonl - 156 lines of migrated data
  • βœ… Full retrieval stack - API, cache, matcher working

Schema/Examples Directory

  • βœ… few_shot/ subdirectory - Reserved for future use
  • βœ… README added - Purpose documented

🧹 Cleanup Tasks

Priority 1: Resolve Legacy Seeds (High Impact)

Issue: database/storage/examples/data/legacy_seeds.jsonl contains 156 migrated schema definitions

Options:

Option A: Archive (Recommended if no longer used)

# 1. Check if referenced in code
grep -r "legacy_seeds" database/ app/ --include="*.py"
 
# 2. If not used, archive
mkdir -p database/storage/examples/data/_archive
mv database/storage/examples/data/legacy_seeds.jsonl \
   database/storage/examples/data/_archive/legacy_seeds.jsonl.bak
 
# 3. Document
echo "Archived 2025-01-14: Legacy seeds migrated, no longer in production use" > \
   database/storage/examples/data/_archive/README.md

Option B: Integrate (Recommended if actively used)

# 1. Split into appropriate categories
python scripts/split_legacy_seeds.py
 
# 2. Merge with existing JSONL files
python scripts/merge_examples.py --source legacy_seeds.jsonl
 
# 3. Remove original after verification
rm database/storage/examples/data/legacy_seeds.jsonl

Option C: Keep as Historical Reference

# Rename to make purpose clear
mv database/storage/examples/data/legacy_seeds.jsonl \
   database/storage/examples/data/_historical_schemas.jsonl
 
# Update documentation
vim database/storage/examples/data/README.md

Recommendation: Choose Option A if not actively used, Option B if integrated into queries.

Priority 2: Resolve Duplication Audit Issues (Medium Impact)

From DUPLICATION_AUDIT_REPORT.md:

# 1. Fix exact duplicates
# Merge knowledge/ schemas into domain/v1/
python scripts/consolidate_schemas.py --dry-run
python scripts/consolidate_schemas.py --execute
 
# 2. Update references
# Update 16 agent prompts to reference new locations
python scripts/update_prompt_references.py
 
# 3. Clean up
rm -rf database/schemas/knowledge/schemas/definitions/

Priority 3: Optimize JSONL Files (Low Impact, High Value)

Goal: Improve retrieval performance and quality

# 1. Audit example quality
python scripts/audit_example_quality.py
 
# 2. Identify low-quality examples
psql $DATABASE_URL -c "
SELECT example_id, category, quality_score, usage_count
FROM \"FewShotExample\"
WHERE quality_score < 0.70
ORDER BY usage_count DESC;
"
 
# 3. Review and improve or remove
# Edit JSONL files, then reseed
 
# 4. Verify improvement
python scripts/test_retrieval_accuracy.py

Priority 4: Add Missing Documentation (Low Effort, High Clarity)

# 1. Create seed conversion guide
cp database/storage/examples/JSONL_MIGRATION.md \
   database/SEED_TO_EXAMPLE_GUIDE.md
 
# 2. Add workflow diagrams
# (Already added in DATA_ARCHITECTURE_GUIDE.md)
 
# 3. Create troubleshooting guide
# (Already added in QUICK_REFERENCE.md)

πŸ“… Implementation Timeline

Week 1: Legacy Seeds Resolution

  • Day 1: Audit legacy_seeds.jsonl usage
  • Day 2: Execute chosen option (A, B, or C)
  • Day 3: Verify no broken references
  • Day 4: Update documentation
  • Day 5: Deploy and monitor

Week 2: Duplication Cleanup

  • Day 1: Test consolidation script
  • Day 2: Execute schema consolidation
  • Day 3: Update prompt references
  • Day 4: Clean up old files
  • Day 5: Verify and test

Week 3: Quality Optimization

  • Day 1: Audit example quality
  • Day 2-3: Improve low-quality examples
  • Day 4: Reseed database
  • Day 5: Run accuracy tests

Week 4: Documentation & Polish

  • Day 1-2: Create remaining guides
  • Day 3: Add workflow diagrams
  • Day 4: Team review
  • Day 5: Final cleanup

βœ… Success Criteria

Technical Metrics

  • No duplicate schema definitions
  • All JSONL files have quality_score β‰₯ 0.80 average
  • Retrieval accuracy β‰₯ 90%
  • Query performance < 100ms (p95)
  • Zero broken references in code

Organizational Metrics

  • Clear documentation for all three layers
  • Developer onboarding time reduced
  • Zero confusion about which system to use
  • All team members understand architecture

🚨 Risks & Mitigation

Risk 1: Breaking Production Queries

Mitigation:

  • Run full test suite before/after
  • Monitor production logs for errors
  • Keep backups of all JSONL files
  • Deploy during low-traffic window

Risk 2: Data Loss

Mitigation:

  • Archive before deleting (don't rm -f)
  • Keep git history
  • Database backups before reseeding
  • Dry-run all scripts first

Risk 3: Performance Regression

Mitigation:

  • Benchmark before changes
  • Monitor query performance
  • A/B test if possible
  • Rollback plan ready

πŸ”„ Rollback Plan

# If issues arise:
 
# 1. Restore JSONL files from git
git checkout HEAD -- database/storage/examples/data/
 
# 2. Reseed database
uv run python scripts/seed.examples.py --clear
 
# 3. Restart services
docker-compose restart backend
 
# 4. Verify
python scripts/smoke_test.py

πŸ“Š Monitoring

During Cleanup

# Watch for errors
tail -f logs/backend.log | grep -i error
 
# Monitor query performance
psql $DATABASE_URL -c "
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%FewShotExample%'
ORDER BY mean_exec_time DESC;
"

Post-Cleanup

# Weekly quality check
python scripts/weekly_quality_check.py
 
# Monthly optimization
python scripts/optimize_examples.py --auto

πŸŽ“ Lessons Learned (To Document)

After completion, document:

  1. What worked well
  2. What could be improved
  3. Unexpected issues encountered
  4. Time estimates vs actual
  5. Updates needed to this plan

πŸ“š Related Documentation

  • DATA_ARCHITECTURE_GUIDE.md - Full architecture overview
  • QUICK_REFERENCE.md - Common commands
  • DUPLICATION_AUDIT_REPORT.md - Known duplication issues
  • storage/MIGRATION_GUIDE.md - Historical migration context

Status: πŸ“‹ Planning Phase
Owner: Database Team
Review Date: 2025-01-21
Target Completion: 2025-02-14

Last Updated: 2025-01-14

Platform

Documentation

Community

Support

partnership@altsportsdata.comdev@altsportsleagues.ai

2025 Β© AltSportsLeagues.ai. Powered by AI-driven sports business intelligence.

πŸ€– AI-Enhancedβ€’πŸ“Š Data-Drivenβ€’βš‘ Real-Time