Source: data_layer/docs/LANGMEM_SETUP.md
LangMem Indexing & Semantic Search Setup
Date: October 18, 2025 Purpose: Enable fast semantic search and natural language prompt retrieval
π― Overview
The LangMem indexing system creates semantic embeddings for all prompts, enabling:
- β Natural Language Queries: "Find prompts for league onboarding and database upsert"
- β Fast Retrieval: < 100ms response time for searches
- β Semantic Understanding: Matches intent, not just keywords
- β Confidence Filtering: Only retrieve high-quality prompts
- β Type Filtering: Search within specific prompt types
π Prerequisites
1. Install LangMem
pip install langmemOr add to requirements.txt:
langmem>=0.1.02. OpenAI API Key
LangMem uses OpenAI embeddings by default:
# Add to .env file
export OPENAI_API_KEY="your-openai-api-key"3. Complete Phase 1 & 2
Ensure you have:
- β
Prompt registry built (
scan_prompts.py) - β
Enriched docs generated (
generate_prompt_docs.py)
π Quick Start
First Time Indexing
# Index all prompts in LangMem
python data_layer/scripts/index_prompts.py
# Output:
# π Indexing prompts in LangMem
#
# Total prompts: 116
# Namespace: altsports_prompts
#
# β
Indexed: specs.contracts.tier-1-partnership
# β
Indexed: workflows.email-to-contract-pipeline
# ... (116 prompts indexed)
#
# ============================================================
# INDEXING SUMMARY
# ============================================================
# β
Indexed: 116
# βοΈ Skipped: 0
# β Errors: 0
# π Total: 116
# ============================================================Search with Natural Language
# Search for prompts
python data_layer/scripts/index_prompts.py \
--search "league questionnaire extraction and database upsert"
# Output:
# π Searching: 'league questionnaire extraction and database upsert'
# Top K: 5
#
# Found 5 results:
#
# 1. Enhanced Data Processor
# ID: altsportsdata.104-enhanced-data-processor
# Type: agent
# Tags: extraction, processing, data
# Confidence: 70%
# Relevance: 0.892
# Description: Process emails with structured JSON/JSONL output...
#
# 2. Document Processor
# ID: document-processor
# Type: agent
# Tags: extraction, ocr, questionnaire
# Confidence: 70%
# Relevance: 0.875
# Description: League questionnaire processing with comprehensive...
#
# ... (3 more results)Filter by Type
# Search only workflow prompts
python data_layer/scripts/index_prompts.py \
--search "contract generation" \
--type workflow
# Search only contract templates
python data_layer/scripts/index_prompts.py \
--search "tier 1 premium partnership" \
--type contract_templateCheck Statistics
python data_layer/scripts/index_prompts.py --stats
# Output:
# π Indexing Statistics
#
# Namespace: altsports_prompts
# Total Prompts: 116
# Indexed: 116
# Last Indexed: 2025-10-18T14:30:45.123456
# Version: 1.0.0π» Programmatic Usage
Basic Search
from data_layer.scripts.index_prompts import PromptRetriever
# Initialize retriever
retriever = PromptRetriever()
# Search with natural language
results = retriever.find_prompts_for_task(
task_description="league onboarding and database upsert",
top_k=5
)
# Process results
for result in results:
print(f"Title: {result['title']}")
print(f"ID: {result['id']}")
print(f"Type: {result['type']}")
print(f"Confidence: {result['confidence']}")
print(f"Relevance: {result['score']}")
print()Find Specific Types
# Find workflow prompts
workflows = retriever.find_workflow("email processing pipeline")
# Find contract templates
contracts = retriever.find_contract_template("tier 1 partnership")
# Find agent prompts
agents = retriever.find_agent_prompt("document extraction")
# Get specific prompt by ID
prompt = retriever.get_by_id("specs.contracts.tier-1-partnership")Advanced Search with Filters
from data_layer.scripts.index_prompts import PromptIndexer
indexer = PromptIndexer()
# Search with filters
results = indexer.search(
query="legal compliance and data privacy",
top_k=5,
filter_type="legal_template",
min_confidence=0.70
)
for result in results:
print(f"{result['title']} - Confidence: {result['confidence']*100:.0f}%")π§ Complete Workflow Examples
Use Case 1: League Onboarding & Database Upsert
from data_layer.scripts.index_prompts import PromptRetriever
from data_layer.scripts.generate_adapters import (
LeagueQuestionnaireSchema,
TierClassificationSchema
)
# 1. Find relevant prompts
retriever = PromptRetriever()
prompts = retriever.find_prompts_for_task(
"league questionnaire extraction database upsert"
)
print(f"Found {len(prompts)} relevant prompts:")
for p in prompts[:3]:
print(f" β’ {p['title']} (relevance: {p['score']:.3f})")
# 2. Use prompts in workflow
# Prompt 1: Extract questionnaire data
extraction_prompt = retriever.get_by_id(prompts[0]['id'])
# Prompt 2: Enrich with market data
enrichment_prompt = retriever.get_by_id(prompts[1]['id'])
# Prompt 3: Classify tier
classification_prompt = retriever.get_by_id(prompts[2]['id'])
# 3. Execute workflow (pseudo-code)
extracted_data = extract_questionnaire(
prompt=extraction_prompt,
input_pdf="./questionnaire.pdf"
)
# Validate with Pydantic
validated = LeagueQuestionnaireSchema(**extracted_data)
# Enrich data
enriched = enrich_league_data(
prompt=enrichment_prompt,
data=validated
)
# Classify tier
tier_result = classify_league(
prompt=classification_prompt,
data=enriched
)
tier = TierClassificationSchema(**tier_result)
# Upsert to database
db_result = upsert_to_firestore(
league_data=enriched,
tier=tier
)
print(f"β
League stored: {db_result.id}")
print(f" Tier: {tier.tier} - {tier.tier_name}")Use Case 2: Contract Generation
from data_layer.scripts.index_prompts import PromptRetriever
from data_layer.scripts.generate_adapters import (
ContractTermsSchema,
NegotiationPackageSchema
)
# 1. Find contract generation prompts
retriever = PromptRetriever()
prompts = retriever.find_prompts_for_task(
"tier 1 premium partnership contract generation"
)
print(f"Found {len(prompts)} contract prompts:")
for p in prompts[:3]:
print(f" β’ {p['title']} (relevance: {p['score']:.3f})")
# 2. Load league profile
league_profile = load_from_firestore(league_id="elite-soccer-league")
# 3. Generate contract terms
contract_terms = generate_contract_terms(
prompt=retriever.get_by_id(prompts[0]['id']),
league_profile=league_profile
)
# Validate
validated_terms = ContractTermsSchema(**contract_terms)
# 4. Create pricing variants
variants = create_pricing_variants(
prompt=retriever.get_by_id(prompts[1]['id']),
base_terms=validated_terms
)
# 5. Generate contract documents
package = generate_contract_documents(
prompt=retriever.get_by_id(prompts[2]['id']),
pricing_variants=variants,
league_profile=league_profile
)
# Validate output
validated_package = NegotiationPackageSchema(**package)
print(f"β
Contracts generated: {validated_package.output_folder}")
print(f" Files: {', '.join(validated_package.files_generated)}")
print(f" Recommended: {validated_package.recommended_variant}")
print(f" Quality Score: {validated_package.quality_score*100:.0f}%")ποΈ Architecture
Embedding Storage
data_layer/storage/embeddings/
βββ langmem_index/
βββ index_metadata.json # Indexing metadata
βββ [LangMem internal files] # Vector database filesIndex Metadata
{
"namespace": "altsports_prompts",
"indexed_at": "2025-10-18T14:30:45.123456",
"total_prompts": 116,
"indexed_count": 116,
"skipped_count": 0,
"error_count": 0,
"index_version": "1.0.0"
}How It Works
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SOURCE (Prompt Files) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β data_layer/prompts/*.md β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[scan_prompts.py]
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REGISTRY (Metadata) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β kb_catalog/manifests/prompt_registry.json β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[index_prompts.py]
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LANGMEM (Vector Database) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β storage/embeddings/langmem_index/ β
β β
β For each prompt: β
β β’ Text: title + description + content + tags β
β β’ Vector: OpenAI embedding (1536 dimensions) β
β β’ Metadata: type, tags, confidence, schemas, agents β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
[Natural Language Query]
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SEARCH RESULTS (Ranked by Relevance) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Enhanced Data Processor (score: 0.892) β
β 2. Document Processor (score: 0.875) β
β 3. Email Handler (score: 0.854) β
β ... (top 5 results) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ Search Query Examples
Task-Based Queries
# League onboarding
"extract league questionnaire and upsert to database"
# Contract generation
"generate tier 1 partnership contract with pricing variants"
# Email processing
"classify incoming emails and route to appropriate workflow"
# Data validation
"validate league data quality and completeness"
# Market analysis
"analyze competitive market position and opportunities"Feature-Based Queries
# By functionality
"document extraction with OCR"
"real-time data streaming"
"payment processing integration"
# By data type
"sports betting odds and markets"
"player statistics and rosters"
"event schedules and results"
# By regulation
"GDPR compliance and data privacy"
"sports betting licensing requirements"Technology-Based Queries
"Firebase Firestore database operations"
"Google Cloud Vision API integration"
"OpenAI GPT-4 prompt templates"
"REST API endpoint specifications"β‘ Performance
Search Performance
- Query Time: < 100ms for semantic search
- Indexing Time: ~2-3 minutes for 116 prompts
- Embedding Dimensions: 1536 (OpenAI text-embedding-3-small)
- Storage Size: ~5MB for embeddings
Optimization Tips
# Cache frequently used searches
from functools import lru_cache
@lru_cache(maxsize=100)
def cached_search(query: str, top_k: int = 5):
retriever = PromptRetriever()
return retriever.find_prompts_for_task(query, top_k)
# Use appropriate top_k
results = retriever.find_prompts_for_task(
"league onboarding",
top_k=3 # Only get what you need
)
# Filter by confidence early
results = indexer.search(
query="contract generation",
top_k=5,
min_confidence=0.85 # Only high-quality prompts
)π οΈ Maintenance
Re-indexing
Re-index when prompts are added or updated:
# Rebuild registry first
python data_layer/scripts/scan_prompts.py
# Regenerate docs
python data_layer/scripts/generate_prompt_docs.py
# Re-index (only new/changed prompts)
python data_layer/scripts/index_prompts.py
# Force re-index all
python data_layer/scripts/index_prompts.py --forceMonitoring
# Check index health
python data_layer/scripts/index_prompts.py --stats
# Test search quality
python data_layer/scripts/index_prompts.py \
--search "test query" \
--top-k 5π― Integration with Phases 1-3
Phase 1: Registry
LangMem uses the registry as source of truth:
- β Prompt metadata (tags, type, confidence)
- β Schema requirements
- β Agent suggestions
- β Version tracking
Phase 2: Documentation
Enriched docs provide better search context:
- β Full template content
- β Schema examples
- β Usage instructions
- β Agent descriptions
Phase 3: Google Drive
Search results can link to Drive docs:
result = retriever.find_prompts_for_task("league onboarding")
prompt = retriever.get_by_id(result[0]['id'])
if prompt.get('drive_id'):
drive_url = f"https://drive.google.com/file/d/{prompt['drive_id']}"
print(f"View in Drive: {drive_url}")π Demo Workflows
See demo_prompt_workflows.py for complete demonstrations:
# Run full demonstration
python data_layer/scripts/demo_prompt_workflows.py
# Output shows:
# β
System statistics
# β
League onboarding workflow (3-5 prompts)
# β
Contract generation workflow (3-5 prompts)
# β
Natural language search examples
# β
Complete execution code samplesβ Verification
Verify the system is working:
# 1. Check indexing
python data_layer/scripts/index_prompts.py --stats
# 2. Test search
python data_layer/scripts/index_prompts.py \
--search "league questionnaire"
# 3. Run demo
python data_layer/scripts/demo_prompt_workflows.py
# Expected results:
# β
116 prompts indexed
# β
Search returns relevant results
# β
Both workflows generate complete execution plansπ What's Next?
With Phase 4 complete, move to Phase 5: Enhanced Prompt Builder:
This will integrate:
- Registry-based lookup
- LangMem semantic search
- Dynamic composition
- Performance tracking
- Confidence scoring
Phase 4 Status: β COMPLETE Date: October 18, 2025 Next Phase: Enhanced Prompt Builder (Phase 5) Overall Progress: 80% (4/5 phases complete)