Source: data_layer/docs/LANGMEM_SETUP.md

LangMem Indexing & Semantic Search Setup

Date: October 18, 2025 Purpose: Enable fast semantic search and natural language prompt retrieval

🎯 Overview

The LangMem indexing system creates semantic embeddings for all prompts, enabling:

✅ Natural Language Queries: "Find prompts for league onboarding and database upsert"
✅ Fast Retrieval: < 100ms response time for searches
✅ Semantic Understanding: Matches intent, not just keywords
✅ Confidence Filtering: Only retrieve high-quality prompts
✅ Type Filtering: Search within specific prompt types

📋 Prerequisites

1. Install LangMem

pip install langmem

Or add to requirements.txt:

langmem>=0.1.0

2. OpenAI API Key

LangMem uses OpenAI embeddings by default:

# Add to .env file
export OPENAI_API_KEY="your-openai-api-key"

3. Complete Phase 1 & 2

Ensure you have:

✅ Prompt registry built (scan_prompts.py)
✅ Enriched docs generated (generate_prompt_docs.py)

🚀 Quick Start

First Time Indexing

# Index all prompts in LangMem
python data_layer/scripts/index_prompts.py
 
# Output:
# 🔍 Indexing prompts in LangMem
#
# Total prompts: 116
# Namespace: altsports_prompts
#
# ✅ Indexed: specs.contracts.tier-1-partnership
# ✅ Indexed: workflows.email-to-contract-pipeline
# ... (116 prompts indexed)
#
# ============================================================
# INDEXING SUMMARY
# ============================================================
# ✅ Indexed: 116
# ⏭️  Skipped: 0
# ❌ Errors: 0
# 📁 Total: 116
# ============================================================

Search with Natural Language

# Search for prompts
python data_layer/scripts/index_prompts.py \
    --search "league questionnaire extraction and database upsert"
 
# Output:
# 🔍 Searching: 'league questionnaire extraction and database upsert'
#    Top K: 5
#
# Found 5 results:
#
# 1. Enhanced Data Processor
#    ID: altsportsdata.104-enhanced-data-processor
#    Type: agent
#    Tags: extraction, processing, data
#    Confidence: 70%
#    Relevance: 0.892
#    Description: Process emails with structured JSON/JSONL output...
#
# 2. Document Processor
#    ID: document-processor
#    Type: agent
#    Tags: extraction, ocr, questionnaire
#    Confidence: 70%
#    Relevance: 0.875
#    Description: League questionnaire processing with comprehensive...
#
# ... (3 more results)

Filter by Type

# Search only workflow prompts
python data_layer/scripts/index_prompts.py \
    --search "contract generation" \
    --type workflow
 
# Search only contract templates
python data_layer/scripts/index_prompts.py \
    --search "tier 1 premium partnership" \
    --type contract_template

Check Statistics

python data_layer/scripts/index_prompts.py --stats
 
# Output:
# 📊 Indexing Statistics
#
# Namespace: altsports_prompts
# Total Prompts: 116
# Indexed: 116
# Last Indexed: 2025-10-18T14:30:45.123456
# Version: 1.0.0

💻 Programmatic Usage

Basic Search

from data_layer.scripts.index_prompts import PromptRetriever
 
# Initialize retriever
retriever = PromptRetriever()
 
# Search with natural language
results = retriever.find_prompts_for_task(
    task_description="league onboarding and database upsert",
    top_k=5
)
 
# Process results
for result in results:
    print(f"Title: {result['title']}")
    print(f"ID: {result['id']}")
    print(f"Type: {result['type']}")
    print(f"Confidence: {result['confidence']}")
    print(f"Relevance: {result['score']}")
    print()

Find Specific Types

# Find workflow prompts
workflows = retriever.find_workflow("email processing pipeline")
 
# Find contract templates
contracts = retriever.find_contract_template("tier 1 partnership")
 
# Find agent prompts
agents = retriever.find_agent_prompt("document extraction")
 
# Get specific prompt by ID
prompt = retriever.get_by_id("specs.contracts.tier-1-partnership")

Advanced Search with Filters

from data_layer.scripts.index_prompts import PromptIndexer
 
indexer = PromptIndexer()
 
# Search with filters
results = indexer.search(
    query="legal compliance and data privacy",
    top_k=5,
    filter_type="legal_template",
    min_confidence=0.70
)
 
for result in results:
    print(f"{result['title']} - Confidence: {result['confidence']*100:.0f}%")

🔧 Complete Workflow Examples

Use Case 1: League Onboarding & Database Upsert

from data_layer.scripts.index_prompts import PromptRetriever
from data_layer.scripts.generate_adapters import (
    LeagueQuestionnaireSchema,
    TierClassificationSchema
)
 
# 1. Find relevant prompts
retriever = PromptRetriever()
prompts = retriever.find_prompts_for_task(
    "league questionnaire extraction database upsert"
)
 
print(f"Found {len(prompts)} relevant prompts:")
for p in prompts[:3]:
    print(f"  • {p['title']} (relevance: {p['score']:.3f})")
 
# 2. Use prompts in workflow
# Prompt 1: Extract questionnaire data
extraction_prompt = retriever.get_by_id(prompts[0]['id'])
 
# Prompt 2: Enrich with market data
enrichment_prompt = retriever.get_by_id(prompts[1]['id'])
 
# Prompt 3: Classify tier
classification_prompt = retriever.get_by_id(prompts[2]['id'])
 
# 3. Execute workflow (pseudo-code)
extracted_data = extract_questionnaire(
    prompt=extraction_prompt,
    input_pdf="./questionnaire.pdf"
)
 
# Validate with Pydantic
validated = LeagueQuestionnaireSchema(**extracted_data)
 
# Enrich data
enriched = enrich_league_data(
    prompt=enrichment_prompt,
    data=validated
)
 
# Classify tier
tier_result = classify_league(
    prompt=classification_prompt,
    data=enriched
)
tier = TierClassificationSchema(**tier_result)
 
# Upsert to database
db_result = upsert_to_firestore(
    league_data=enriched,
    tier=tier
)
 
print(f"✅ League stored: {db_result.id}")
print(f"   Tier: {tier.tier} - {tier.tier_name}")

Use Case 2: Contract Generation

from data_layer.scripts.index_prompts import PromptRetriever
from data_layer.scripts.generate_adapters import (
    ContractTermsSchema,
    NegotiationPackageSchema
)
 
# 1. Find contract generation prompts
retriever = PromptRetriever()
prompts = retriever.find_prompts_for_task(
    "tier 1 premium partnership contract generation"
)
 
print(f"Found {len(prompts)} contract prompts:")
for p in prompts[:3]:
    print(f"  • {p['title']} (relevance: {p['score']:.3f})")
 
# 2. Load league profile
league_profile = load_from_firestore(league_id="elite-soccer-league")
 
# 3. Generate contract terms
contract_terms = generate_contract_terms(
    prompt=retriever.get_by_id(prompts[0]['id']),
    league_profile=league_profile
)
 
# Validate
validated_terms = ContractTermsSchema(**contract_terms)
 
# 4. Create pricing variants
variants = create_pricing_variants(
    prompt=retriever.get_by_id(prompts[1]['id']),
    base_terms=validated_terms
)
 
# 5. Generate contract documents
package = generate_contract_documents(
    prompt=retriever.get_by_id(prompts[2]['id']),
    pricing_variants=variants,
    league_profile=league_profile
)
 
# Validate output
validated_package = NegotiationPackageSchema(**package)
 
print(f"✅ Contracts generated: {validated_package.output_folder}")
print(f"   Files: {', '.join(validated_package.files_generated)}")
print(f"   Recommended: {validated_package.recommended_variant}")
print(f"   Quality Score: {validated_package.quality_score*100:.0f}%")

🏗️ Architecture

Embedding Storage

data_layer/storage/embeddings/
└── langmem_index/
    ├── index_metadata.json      # Indexing metadata
    └── [LangMem internal files]  # Vector database files

Index Metadata

{
  "namespace": "altsports_prompts",
  "indexed_at": "2025-10-18T14:30:45.123456",
  "total_prompts": 116,
  "indexed_count": 116,
  "skipped_count": 0,
  "error_count": 0,
  "index_version": "1.0.0"
}

How It Works

┌─────────────────────────────────────────────────────────────┐
│ SOURCE (Prompt Files)                                        │
├─────────────────────────────────────────────────────────────┤
│ data_layer/prompts/*.md                                      │
└─────────────────────────────────────────────────────────────┘
                            ↓
                    [scan_prompts.py]
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ REGISTRY (Metadata)                                          │
├─────────────────────────────────────────────────────────────┤
│ kb_catalog/manifests/prompt_registry.json                    │
└─────────────────────────────────────────────────────────────┘
                            ↓
                    [index_prompts.py]
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ LANGMEM (Vector Database)                                    │
├─────────────────────────────────────────────────────────────┤
│ storage/embeddings/langmem_index/                            │
│                                                              │
│ For each prompt:                                             │
│ • Text: title + description + content + tags                │
│ • Vector: OpenAI embedding (1536 dimensions)                │
│ • Metadata: type, tags, confidence, schemas, agents         │
└─────────────────────────────────────────────────────────────┘
                            ↓
                    [Natural Language Query]
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ SEARCH RESULTS (Ranked by Relevance)                         │
├─────────────────────────────────────────────────────────────┤
│ 1. Enhanced Data Processor (score: 0.892)                   │
│ 2. Document Processor (score: 0.875)                        │
│ 3. Email Handler (score: 0.854)                             │
│ ... (top 5 results)                                          │
└─────────────────────────────────────────────────────────────┘

🔍 Search Query Examples

Task-Based Queries

# League onboarding
"extract league questionnaire and upsert to database"
 
# Contract generation
"generate tier 1 partnership contract with pricing variants"
 
# Email processing
"classify incoming emails and route to appropriate workflow"
 
# Data validation
"validate league data quality and completeness"
 
# Market analysis
"analyze competitive market position and opportunities"

Feature-Based Queries

# By functionality
"document extraction with OCR"
"real-time data streaming"
"payment processing integration"
 
# By data type
"sports betting odds and markets"
"player statistics and rosters"
"event schedules and results"
 
# By regulation
"GDPR compliance and data privacy"
"sports betting licensing requirements"

Technology-Based Queries

"Firebase Firestore database operations"
"Google Cloud Vision API integration"
"OpenAI GPT-4 prompt templates"
"REST API endpoint specifications"

⚡ Performance

Search Performance

Query Time: < 100ms for semantic search
Indexing Time: ~2-3 minutes for 116 prompts
Embedding Dimensions: 1536 (OpenAI text-embedding-3-small)
Storage Size: ~5MB for embeddings

Optimization Tips

# Cache frequently used searches
from functools import lru_cache
 
@lru_cache(maxsize=100)
def cached_search(query: str, top_k: int = 5):
    retriever = PromptRetriever()
    return retriever.find_prompts_for_task(query, top_k)
 
# Use appropriate top_k
results = retriever.find_prompts_for_task(
    "league onboarding",
    top_k=3  # Only get what you need
)
 
# Filter by confidence early
results = indexer.search(
    query="contract generation",
    top_k=5,
    min_confidence=0.85  # Only high-quality prompts
)

🛠️ Maintenance

Re-indexing

Re-index when prompts are added or updated:

# Rebuild registry first
python data_layer/scripts/scan_prompts.py
 
# Regenerate docs
python data_layer/scripts/generate_prompt_docs.py
 
# Re-index (only new/changed prompts)
python data_layer/scripts/index_prompts.py
 
# Force re-index all
python data_layer/scripts/index_prompts.py --force

Monitoring

# Check index health
python data_layer/scripts/index_prompts.py --stats
 
# Test search quality
python data_layer/scripts/index_prompts.py \
    --search "test query" \
    --top-k 5

🎯 Integration with Phases 1-3

Phase 1: Registry

LangMem uses the registry as source of truth:

✅ Prompt metadata (tags, type, confidence)
✅ Schema requirements
✅ Agent suggestions
✅ Version tracking

Phase 2: Documentation

Enriched docs provide better search context:

✅ Full template content
✅ Schema examples
✅ Usage instructions
✅ Agent descriptions

Phase 3: Google Drive

Search results can link to Drive docs:

result = retriever.find_prompts_for_task("league onboarding")
prompt = retriever.get_by_id(result[0]['id'])
 
if prompt.get('drive_id'):
    drive_url = f"https://drive.google.com/file/d/{prompt['drive_id']}"
    print(f"View in Drive: {drive_url}")

📊 Demo Workflows

See demo_prompt_workflows.py for complete demonstrations:

# Run full demonstration
python data_layer/scripts/demo_prompt_workflows.py
 
# Output shows:
# ✅ System statistics
# ✅ League onboarding workflow (3-5 prompts)
# ✅ Contract generation workflow (3-5 prompts)
# ✅ Natural language search examples
# ✅ Complete execution code samples

✅ Verification

Verify the system is working:

# 1. Check indexing
python data_layer/scripts/index_prompts.py --stats
 
# 2. Test search
python data_layer/scripts/index_prompts.py \
    --search "league questionnaire"
 
# 3. Run demo
python data_layer/scripts/demo_prompt_workflows.py
 
# Expected results:
# ✅ 116 prompts indexed
# ✅ Search returns relevant results
# ✅ Both workflows generate complete execution plans

🚀 What's Next?

With Phase 4 complete, move to Phase 5: Enhanced Prompt Builder:

This will integrate:

Registry-based lookup
LangMem semantic search
Dynamic composition
Performance tracking
Confidence scoring

Phase 4 Status: ✅ COMPLETE Date: October 18, 2025 Next Phase: Enhanced Prompt Builder (Phase 5) Overall Progress: 80% (4/5 phases complete)

Knowledge vs Context: Architecture Guide Maximum Value Summary: Data Architecture Optimization