Architecture
LangMem Indexing & Semantic Search Setup

Source: data_layer/docs/LANGMEM_SETUP.md

LangMem Indexing & Semantic Search Setup

Date: October 18, 2025 Purpose: Enable fast semantic search and natural language prompt retrieval


🎯 Overview

The LangMem indexing system creates semantic embeddings for all prompts, enabling:

  • βœ… Natural Language Queries: "Find prompts for league onboarding and database upsert"
  • βœ… Fast Retrieval: < 100ms response time for searches
  • βœ… Semantic Understanding: Matches intent, not just keywords
  • βœ… Confidence Filtering: Only retrieve high-quality prompts
  • βœ… Type Filtering: Search within specific prompt types

πŸ“‹ Prerequisites

1. Install LangMem

pip install langmem

Or add to requirements.txt:

langmem>=0.1.0

2. OpenAI API Key

LangMem uses OpenAI embeddings by default:

# Add to .env file
export OPENAI_API_KEY="your-openai-api-key"

3. Complete Phase 1 & 2

Ensure you have:

  • βœ… Prompt registry built (scan_prompts.py)
  • βœ… Enriched docs generated (generate_prompt_docs.py)

πŸš€ Quick Start

First Time Indexing

# Index all prompts in LangMem
python data_layer/scripts/index_prompts.py
 
# Output:
# πŸ” Indexing prompts in LangMem
#
# Total prompts: 116
# Namespace: altsports_prompts
#
# βœ… Indexed: specs.contracts.tier-1-partnership
# βœ… Indexed: workflows.email-to-contract-pipeline
# ... (116 prompts indexed)
#
# ============================================================
# INDEXING SUMMARY
# ============================================================
# βœ… Indexed: 116
# ⏭️  Skipped: 0
# ❌ Errors: 0
# πŸ“ Total: 116
# ============================================================

Search with Natural Language

# Search for prompts
python data_layer/scripts/index_prompts.py \
    --search "league questionnaire extraction and database upsert"
 
# Output:
# πŸ” Searching: 'league questionnaire extraction and database upsert'
#    Top K: 5
#
# Found 5 results:
#
# 1. Enhanced Data Processor
#    ID: altsportsdata.104-enhanced-data-processor
#    Type: agent
#    Tags: extraction, processing, data
#    Confidence: 70%
#    Relevance: 0.892
#    Description: Process emails with structured JSON/JSONL output...
#
# 2. Document Processor
#    ID: document-processor
#    Type: agent
#    Tags: extraction, ocr, questionnaire
#    Confidence: 70%
#    Relevance: 0.875
#    Description: League questionnaire processing with comprehensive...
#
# ... (3 more results)

Filter by Type

# Search only workflow prompts
python data_layer/scripts/index_prompts.py \
    --search "contract generation" \
    --type workflow
 
# Search only contract templates
python data_layer/scripts/index_prompts.py \
    --search "tier 1 premium partnership" \
    --type contract_template

Check Statistics

python data_layer/scripts/index_prompts.py --stats
 
# Output:
# πŸ“Š Indexing Statistics
#
# Namespace: altsports_prompts
# Total Prompts: 116
# Indexed: 116
# Last Indexed: 2025-10-18T14:30:45.123456
# Version: 1.0.0

πŸ’» Programmatic Usage

Basic Search

from data_layer.scripts.index_prompts import PromptRetriever
 
# Initialize retriever
retriever = PromptRetriever()
 
# Search with natural language
results = retriever.find_prompts_for_task(
    task_description="league onboarding and database upsert",
    top_k=5
)
 
# Process results
for result in results:
    print(f"Title: {result['title']}")
    print(f"ID: {result['id']}")
    print(f"Type: {result['type']}")
    print(f"Confidence: {result['confidence']}")
    print(f"Relevance: {result['score']}")
    print()

Find Specific Types

# Find workflow prompts
workflows = retriever.find_workflow("email processing pipeline")
 
# Find contract templates
contracts = retriever.find_contract_template("tier 1 partnership")
 
# Find agent prompts
agents = retriever.find_agent_prompt("document extraction")
 
# Get specific prompt by ID
prompt = retriever.get_by_id("specs.contracts.tier-1-partnership")

Advanced Search with Filters

from data_layer.scripts.index_prompts import PromptIndexer
 
indexer = PromptIndexer()
 
# Search with filters
results = indexer.search(
    query="legal compliance and data privacy",
    top_k=5,
    filter_type="legal_template",
    min_confidence=0.70
)
 
for result in results:
    print(f"{result['title']} - Confidence: {result['confidence']*100:.0f}%")

πŸ”§ Complete Workflow Examples

Use Case 1: League Onboarding & Database Upsert

from data_layer.scripts.index_prompts import PromptRetriever
from data_layer.scripts.generate_adapters import (
    LeagueQuestionnaireSchema,
    TierClassificationSchema
)
 
# 1. Find relevant prompts
retriever = PromptRetriever()
prompts = retriever.find_prompts_for_task(
    "league questionnaire extraction database upsert"
)
 
print(f"Found {len(prompts)} relevant prompts:")
for p in prompts[:3]:
    print(f"  β€’ {p['title']} (relevance: {p['score']:.3f})")
 
# 2. Use prompts in workflow
# Prompt 1: Extract questionnaire data
extraction_prompt = retriever.get_by_id(prompts[0]['id'])
 
# Prompt 2: Enrich with market data
enrichment_prompt = retriever.get_by_id(prompts[1]['id'])
 
# Prompt 3: Classify tier
classification_prompt = retriever.get_by_id(prompts[2]['id'])
 
# 3. Execute workflow (pseudo-code)
extracted_data = extract_questionnaire(
    prompt=extraction_prompt,
    input_pdf="./questionnaire.pdf"
)
 
# Validate with Pydantic
validated = LeagueQuestionnaireSchema(**extracted_data)
 
# Enrich data
enriched = enrich_league_data(
    prompt=enrichment_prompt,
    data=validated
)
 
# Classify tier
tier_result = classify_league(
    prompt=classification_prompt,
    data=enriched
)
tier = TierClassificationSchema(**tier_result)
 
# Upsert to database
db_result = upsert_to_firestore(
    league_data=enriched,
    tier=tier
)
 
print(f"βœ… League stored: {db_result.id}")
print(f"   Tier: {tier.tier} - {tier.tier_name}")

Use Case 2: Contract Generation

from data_layer.scripts.index_prompts import PromptRetriever
from data_layer.scripts.generate_adapters import (
    ContractTermsSchema,
    NegotiationPackageSchema
)
 
# 1. Find contract generation prompts
retriever = PromptRetriever()
prompts = retriever.find_prompts_for_task(
    "tier 1 premium partnership contract generation"
)
 
print(f"Found {len(prompts)} contract prompts:")
for p in prompts[:3]:
    print(f"  β€’ {p['title']} (relevance: {p['score']:.3f})")
 
# 2. Load league profile
league_profile = load_from_firestore(league_id="elite-soccer-league")
 
# 3. Generate contract terms
contract_terms = generate_contract_terms(
    prompt=retriever.get_by_id(prompts[0]['id']),
    league_profile=league_profile
)
 
# Validate
validated_terms = ContractTermsSchema(**contract_terms)
 
# 4. Create pricing variants
variants = create_pricing_variants(
    prompt=retriever.get_by_id(prompts[1]['id']),
    base_terms=validated_terms
)
 
# 5. Generate contract documents
package = generate_contract_documents(
    prompt=retriever.get_by_id(prompts[2]['id']),
    pricing_variants=variants,
    league_profile=league_profile
)
 
# Validate output
validated_package = NegotiationPackageSchema(**package)
 
print(f"βœ… Contracts generated: {validated_package.output_folder}")
print(f"   Files: {', '.join(validated_package.files_generated)}")
print(f"   Recommended: {validated_package.recommended_variant}")
print(f"   Quality Score: {validated_package.quality_score*100:.0f}%")

πŸ—οΈ Architecture

Embedding Storage

data_layer/storage/embeddings/
└── langmem_index/
    β”œβ”€β”€ index_metadata.json      # Indexing metadata
    └── [LangMem internal files]  # Vector database files

Index Metadata

{
  "namespace": "altsports_prompts",
  "indexed_at": "2025-10-18T14:30:45.123456",
  "total_prompts": 116,
  "indexed_count": 116,
  "skipped_count": 0,
  "error_count": 0,
  "index_version": "1.0.0"
}

How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SOURCE (Prompt Files)                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ data_layer/prompts/*.md                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
                    [scan_prompts.py]
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ REGISTRY (Metadata)                                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ kb_catalog/manifests/prompt_registry.json                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
                    [index_prompts.py]
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LANGMEM (Vector Database)                                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ storage/embeddings/langmem_index/                            β”‚
β”‚                                                              β”‚
β”‚ For each prompt:                                             β”‚
β”‚ β€’ Text: title + description + content + tags                β”‚
β”‚ β€’ Vector: OpenAI embedding (1536 dimensions)                β”‚
β”‚ β€’ Metadata: type, tags, confidence, schemas, agents         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
                    [Natural Language Query]
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SEARCH RESULTS (Ranked by Relevance)                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1. Enhanced Data Processor (score: 0.892)                   β”‚
β”‚ 2. Document Processor (score: 0.875)                        β”‚
β”‚ 3. Email Handler (score: 0.854)                             β”‚
β”‚ ... (top 5 results)                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” Search Query Examples

Task-Based Queries

# League onboarding
"extract league questionnaire and upsert to database"
 
# Contract generation
"generate tier 1 partnership contract with pricing variants"
 
# Email processing
"classify incoming emails and route to appropriate workflow"
 
# Data validation
"validate league data quality and completeness"
 
# Market analysis
"analyze competitive market position and opportunities"

Feature-Based Queries

# By functionality
"document extraction with OCR"
"real-time data streaming"
"payment processing integration"
 
# By data type
"sports betting odds and markets"
"player statistics and rosters"
"event schedules and results"
 
# By regulation
"GDPR compliance and data privacy"
"sports betting licensing requirements"

Technology-Based Queries

"Firebase Firestore database operations"
"Google Cloud Vision API integration"
"OpenAI GPT-4 prompt templates"
"REST API endpoint specifications"

⚑ Performance

Search Performance

  • Query Time: < 100ms for semantic search
  • Indexing Time: ~2-3 minutes for 116 prompts
  • Embedding Dimensions: 1536 (OpenAI text-embedding-3-small)
  • Storage Size: ~5MB for embeddings

Optimization Tips

# Cache frequently used searches
from functools import lru_cache
 
@lru_cache(maxsize=100)
def cached_search(query: str, top_k: int = 5):
    retriever = PromptRetriever()
    return retriever.find_prompts_for_task(query, top_k)
 
# Use appropriate top_k
results = retriever.find_prompts_for_task(
    "league onboarding",
    top_k=3  # Only get what you need
)
 
# Filter by confidence early
results = indexer.search(
    query="contract generation",
    top_k=5,
    min_confidence=0.85  # Only high-quality prompts
)

πŸ› οΈ Maintenance

Re-indexing

Re-index when prompts are added or updated:

# Rebuild registry first
python data_layer/scripts/scan_prompts.py
 
# Regenerate docs
python data_layer/scripts/generate_prompt_docs.py
 
# Re-index (only new/changed prompts)
python data_layer/scripts/index_prompts.py
 
# Force re-index all
python data_layer/scripts/index_prompts.py --force

Monitoring

# Check index health
python data_layer/scripts/index_prompts.py --stats
 
# Test search quality
python data_layer/scripts/index_prompts.py \
    --search "test query" \
    --top-k 5

🎯 Integration with Phases 1-3

Phase 1: Registry

LangMem uses the registry as source of truth:

  • βœ… Prompt metadata (tags, type, confidence)
  • βœ… Schema requirements
  • βœ… Agent suggestions
  • βœ… Version tracking

Phase 2: Documentation

Enriched docs provide better search context:

  • βœ… Full template content
  • βœ… Schema examples
  • βœ… Usage instructions
  • βœ… Agent descriptions

Phase 3: Google Drive

Search results can link to Drive docs:

result = retriever.find_prompts_for_task("league onboarding")
prompt = retriever.get_by_id(result[0]['id'])
 
if prompt.get('drive_id'):
    drive_url = f"https://drive.google.com/file/d/{prompt['drive_id']}"
    print(f"View in Drive: {drive_url}")

πŸ“Š Demo Workflows

See demo_prompt_workflows.py for complete demonstrations:

# Run full demonstration
python data_layer/scripts/demo_prompt_workflows.py
 
# Output shows:
# βœ… System statistics
# βœ… League onboarding workflow (3-5 prompts)
# βœ… Contract generation workflow (3-5 prompts)
# βœ… Natural language search examples
# βœ… Complete execution code samples

βœ… Verification

Verify the system is working:

# 1. Check indexing
python data_layer/scripts/index_prompts.py --stats
 
# 2. Test search
python data_layer/scripts/index_prompts.py \
    --search "league questionnaire"
 
# 3. Run demo
python data_layer/scripts/demo_prompt_workflows.py
 
# Expected results:
# βœ… 116 prompts indexed
# βœ… Search returns relevant results
# βœ… Both workflows generate complete execution plans

πŸš€ What's Next?

With Phase 4 complete, move to Phase 5: Enhanced Prompt Builder:

This will integrate:

  • Registry-based lookup
  • LangMem semantic search
  • Dynamic composition
  • Performance tracking
  • Confidence scoring

Phase 4 Status: βœ… COMPLETE Date: October 18, 2025 Next Phase: Enhanced Prompt Builder (Phase 5) Overall Progress: 80% (4/5 phases complete)

Platform

Documentation

Community

Support

partnership@altsportsdata.comdev@altsportsleagues.ai

2025 Β© AltSportsLeagues.ai. Powered by AI-driven sports business intelligence.

πŸ€– AI-Enhancedβ€’πŸ“Š Data-Drivenβ€’βš‘ Real-Time