Source: data_layer/docs/MULTI_DATABASE_ARCHITECTURE.md

Multi-Database Architecture Guide

🎯 Overview

This system implements a unified seeding and querying architecture across three databases, each serving a specific purpose while maintaining data consistency.

JSONL Seeds (Source of Truth)
    ↓
PostgreSQL (Prisma) ← Fast queries, transactions, relational data
    ↓
Neo4j              ← Graph relationships, archetype navigation
    ↓
Pinecone           ← Vector embeddings, semantic search

🏗️ Architecture Principles

1. Single Source of Truth: JSONL Files

Location: database/seeds/ and database/seed.examples-kb/data/

Why JSONL?

✅ Version control friendly (line-by-line diffs)
✅ Easy to edit and curate
✅ Fast sequential processing
✅ Append-only for new examples

Example:

{"id": "racing_001", "league_name": "Formula 1", "sport_name": "Formula 1", "sport_archetype": "racing", "location": "Global"}
{"id": "combat_001", "league_name": "UFC", "sport_name": "MMA", "sport_archetype": "combat", "location": "USA"}

2. PostgreSQL (via Prisma): Primary Database

Purpose: Fast queries, ACID transactions, structured data

Enhanced Schema (schemas/prisma/schema.v2.enhanced.prisma):

model League {
  id String @id @default(uuid())
  league_tag String @unique
  league_name String
 
  // NEW: Archetype integration
  sport_archetype String?  // racing, combat, team_sport, precision, large_field
 
  // NEW: Vector embedding
  embedding_vector Json?   // Stored as JSON array for Pinecone sync
  embedding_model  String?
  embedding_updated_at DateTime?
 
  // NEW: Neo4j sync tracking
  neo4j_node_id    String?
  neo4j_synced_at  DateTime?
  neo4j_sync_status String? @default("pending")
 
  // NEW: Pinecone sync tracking
  pinecone_id       String?
  pinecone_namespace String?
  pinecone_synced_at DateTime?
  pinecone_sync_status String? @default("pending")
 
  // Relationships
  LeagueRelationships LeagueRelationship[] @relation("SourceLeague")
  RelatedLeagues LeagueRelationship[] @relation("TargetLeague")
}
 
// Tracks graph relationships from Neo4j
model LeagueRelationship {
  source_league_id  String
  target_league_id  String
  relationship_type String  // SIMILAR_TO, COMPETES_WITH, SAME_ARCHETYPE
  strength_score    Decimal?
  neo4j_rel_id      String?
 
  SourceLeague League @relation("SourceLeague", ...)
  TargetLeague League @relation("TargetLeague", ...)
}
 
// Sport archetypes as data
model SportArchetype {
  archetype_name    String @unique
  display_name      String
  code_pattern      String  // RaceEvent, FightEvent, TeamMatch
  characteristics   Json    // Betting patterns, data needs
  embedding_vector  Json?
  neo4j_node_id     String?
}

Querying:

from prisma import Prisma
 
db = Prisma()
await db.connect()
 
# Find leagues synced to all databases
leagues = await db.league.find_many(
    where={
        "neo4j_sync_status": "synced",
        "pinecone_sync_status": "synced",
        "sport_archetype": "combat"
    }
)

3. Neo4j: Graph Relationships

Purpose: Navigate relationships, discover patterns, archetype-based queries

Schema (schemas/models/neo4j/init_db.cql):

// Create constraints
CREATE CONSTRAINT league_name_unique IF NOT EXISTS
FOR (l:League) REQUIRE l.league_name IS UNIQUE;
 
CREATE CONSTRAINT archetype_name_unique IF NOT EXISTS
FOR (a:SportArchetype) REQUIRE a.name IS UNIQUE;
 
// Archetypes as first-class nodes
MERGE (racing:SportArchetype {
  name: "racing",
  display_name: "Racing & Speed Sports",
  code_pattern: "RaceEvent"
})
 
MERGE (combat:SportArchetype {
  name: "combat",
  display_name: "Combat Sports",
  code_pattern: "FightEvent"
})
 
// Leagues with archetype relationships
MERGE (f1:League {
  league_tag: "formula_1",
  league_name: "Formula 1",
  tier: "TIER1"
})
 
MERGE (ufc:League {
  league_tag: "ufc",
  league_name: "UFC",
  tier: "TIER1"
})
 
// Relationships enable powerful graph queries
MERGE (f1)-[:BELONGS_TO_ARCHETYPE]->(racing)
MERGE (ufc)-[:BELONGS_TO_ARCHETYPE]->(combat)
 
// Similarity relationships (calculated via embeddings)
MERGE (league1:League)-[:SIMILAR_TO {score: 0.85}]->(league2:League)

Graph Queries:

// Find all combat sports leagues
MATCH (l:League)-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype {name: "combat"})
RETURN l
 
// Find similar leagues within same archetype
MATCH (l:League {league_tag: "ufc"})-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype)
MATCH (similar:League)-[:BELONGS_TO_ARCHETYPE]->(a)
WHERE similar.league_tag <> "ufc"
RETURN similar
ORDER BY similar.tier
 
// Discover cross-archetype patterns
MATCH (l1:League)-[:SIMILAR_TO]->(l2:League)
WHERE l1.sport_archetype <> l2.sport_archetype
RETURN l1, l2, l1.sport_archetype, l2.sport_archetype

4. Pinecone: Vector Search

Purpose: Semantic similarity, "find leagues like this" queries

Index Structure:

# Three separate indexes for different entity types
indexes = {
    "altsports-archetypes": {
        "dimension": 1536,  # text-embedding-3-small
        "metric": "cosine",
        "namespace": None   # Single namespace
    },
    "altsports-leagues": {
        "dimension": 1536,
        "metric": "cosine",
        "namespace": "default"
    },
    "altsports-examples": {
        "dimension": 1536,
        "metric": "cosine",
        "namespace": "category"  # Namespaced by category (triage, contract_generation, etc.)
    }
}

Vector Search:

from pinecone import Pinecone
 
pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
index = pc.Index("altsports-leagues")
 
# Semantic search: "Find leagues similar to Formula 1"
query_embedding = openai.embeddings.create(
    model="text-embedding-3-small",
    input="Formula 1 racing high-speed global competition"
).data[0].embedding
 
results = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    filter={"status": "active"}
)
 
# Results contain similar leagues with metadata
for match in results['matches']:
    print(f"{match['metadata']['league_name']}: {match['score']}")

🔄 Unified Seeding Workflow

Quick Start

# Install dependencies
uv add prisma neo4j pinecone-client openai
 
# Generate Prisma client
cd database/schemas/prisma
uv run prisma generate --schema=schema.v2.enhanced.prisma
 
# Seed everything
uv run python database/scripts/seed_unified_multi_db.py --all
 
# Or seed selectively
uv run python database/scripts/seed_unified_multi_db.py --archetypes
uv run python database/scripts/seed_unified_multi_db.py --leagues --limit 10
uv run python database/scripts/seed_unified_multi_db.py --examples --category triage

What Happens During Seeding

For each entity (League, FewShotExample, SportArchetype):

Step 1: Generate Embedding (if enabled)

embedding_text = create_league_embedding_text(league)
embedding = openai.embeddings.create(
    model="text-embedding-3-small",
    input=embedding_text
).data[0].embedding

Step 2: Upsert to PostgreSQL

league = await prisma.league.upsert(
    where={"league_tag": league_tag},
    create={
        "league_name": "Formula 1",
        "sport_archetype": "racing",
        "embedding_vector": embedding,
        "embedding_model": "text-embedding-3-small",
        # ... other fields
    },
    update={...}
)

Step 3: Sync to Neo4j (if connected)

with neo4j_driver.session() as session:
    # Create node
    result = session.run("""
        MERGE (l:League {league_tag: $league_tag})
        SET l.league_name = $league_name,
            l.sport_archetype = $archetype
        RETURN elementId(l) as node_id
    """, league_tag=league.league_tag, ...)
 
    # Create archetype relationship
    session.run("""
        MATCH (l:League {league_tag: $league_tag})
        MATCH (a:SportArchetype {name: $archetype})
        MERGE (l)-[:BELONGS_TO_ARCHETYPE]->(a)
    """)
 
    # Update PostgreSQL with Neo4j node ID
    await prisma.league.update(
        where={"id": league.id},
        data={
            "neo4j_node_id": result.single()["node_id"],
            "neo4j_synced_at": datetime.utcnow(),
            "neo4j_sync_status": "synced"
        }
    )

Step 4: Sync to Pinecone (if connected)

index = pinecone_client.Index("altsports-leagues")
 
index.upsert([(
    league.league_tag,              # ID
    embedding,                      # Vector
    {                               # Metadata
        "league_name": league.league_name,
        "sport_archetype": league.sport_archetype,
        "status": league.status
    }
)])
 
await prisma.league.update(
    where={"id": league.id},
    data={
        "pinecone_id": league.league_tag,
        "pinecone_synced_at": datetime.utcnow(),
        "pinecone_sync_status": "synced"
    }
)

🎯 Query Patterns

Pattern 1: PostgreSQL → Neo4j → Pinecone

Use Case: Find similar leagues to UFC using graph structure

# 1. Get league from PostgreSQL
ufc = await db.league.find_unique(where={"league_tag": "ufc"})
 
# 2. Get graph relationships from Neo4j
with neo4j_driver.session() as session:
    result = session.run("""
        MATCH (ufc:League {league_tag: 'ufc'})-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype)
        MATCH (similar:League)-[:BELONGS_TO_ARCHETYPE]->(a)
        WHERE similar.league_tag <> 'ufc'
        RETURN similar.league_tag as tag, similar.league_name as name
        LIMIT 10
    """)
 
    related_tags = [record["tag"] for record in result]
 
# 3. Get semantic similarity from Pinecone
query_vector = ufc.embedding_vector  # Already have it!
pinecone_results = index.query(
    vector=query_vector,
    top_k=20,
    include_metadata=True
)
 
# 4. Combine results
combined_leagues = set(related_tags) | {m['id'] for m in pinecone_results['matches']}

Pattern 2: Pinecone → PostgreSQL

Use Case: Semantic search with structured filtering

# 1. Vector search in Pinecone
query_text = "High-speed racing competitions with global viewership"
query_embedding = generate_embedding(query_text)
 
matches = index.query(
    vector=query_embedding,
    top_k=50,
    filter={"status": "active"}  # Pinecone metadata filter
)
 
# 2. Get full data from PostgreSQL
league_tags = [m['id'] for m in matches['matches']]
leagues = await db.league.find_many(
    where={"league_tag": {"in": league_tags}},
    include={"Documents": True, "Contracts": True}
)

Pattern 3: Neo4j → PostgreSQL

Use Case: Archetype-based discovery with full metadata

# 1. Graph query for pattern discovery
with neo4j_driver.session() as session:
    result = session.run("""
        MATCH (l:League)-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype)
        WHERE l.tier = 'TIER1'
        WITH a, count(l) as league_count
        RETURN a.name, league_count
        ORDER BY league_count DESC
    """)
 
    top_archetypes = [record["a.name"] for record in result]
 
# 2. Get leagues from PostgreSQL
for archetype in top_archetypes:
    leagues = await db.league.find_many(
        where={"sport_archetype": archetype, "status": "active"}
    )
    print(f"{archetype}: {len(leagues)} leagues")

📊 When to Use Each Database

Use Case	Database	Why?
Get league by ID	PostgreSQL	Fast indexed lookup
List all combat sports	PostgreSQL or Neo4j	Both work; PostgreSQL for metadata, Neo4j for relationships
"Find leagues like Formula 1"	Pinecone → PostgreSQL	Semantic similarity
Discover archetype patterns	Neo4j	Graph traversal
Filter by tier + contract status	PostgreSQL	Complex filtering
"Which archetypes have most TIER1 leagues?"	Neo4j	Aggregation over graph
"Similar contract examples for soccer/premium"	Pinecone (namespace=contract_generation)	Semantic search with filters
Track sync operations	PostgreSQL (SyncOperation model)	Audit trail

🔧 Environment Setup

Required .env variables:

# PostgreSQL (Required)
DATABASE_URL="postgresql://user:password@localhost:5432/altsports?schema=public"
 
# Neo4j (Optional but recommended)
NEO4J_URI="bolt://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="your_password"
 
# Pinecone (Optional but recommended)
PINECONE_API_KEY="your_pinecone_key"
 
# OpenAI for embeddings (Required if using Pinecone)
OPENAI_API_KEY="your_openai_key"

🚀 Advanced Features

Feature 1: Automatic Relationship Detection

After seeding, calculate similarity relationships:

# Calculate embedding similarity and create Neo4j relationships
for league1 in leagues:
    # Query Pinecone for similar leagues
    similar = index.query(
        vector=league1.embedding_vector,
        top_k=10,
        filter={"sport_archetype": league1.sport_archetype}
    )
 
    # Create SIMILAR_TO relationships in Neo4j
    for match in similar['matches']:
        if match['score'] > 0.80:  # High similarity threshold
            with neo4j_driver.session() as session:
                session.run("""
                    MATCH (l1:League {league_tag: $tag1})
                    MATCH (l2:League {league_tag: $tag2})
                    MERGE (l1)-[:SIMILAR_TO {score: $score}]->(l2)
                """, tag1=league1.league_tag, tag2=match['id'], score=match['score'])
 
            # Track in PostgreSQL
            await db.leaguerelationship.create({
                "source_league_id": league1.id,
                "target_league_id": match['id'],
                "relationship_type": "SIMILAR_TO",
                "strength_score": Decimal(str(match['score']))
            })

Feature 2: Sync Status Dashboard

# Get sync health across databases
sync_status = await db.league.group_by(
    by=["neo4j_sync_status", "pinecone_sync_status"],
    count=True
)
 
print("Sync Status:")
for status in sync_status:
    print(f"  Neo4j: {status['neo4j_sync_status']}, "
          f"Pinecone: {status['pinecone_sync_status']}, "
          f"Count: {status['_count']}")

Feature 3: Embedding Updates

# Re-generate embeddings when league data changes
league = await db.league.update(
    where={"league_tag": "formula_1"},
    data={"league_name": "Formula 1 World Championship"}
)
 
# Trigger embedding update
embedding_text = create_league_embedding_text(league)
new_embedding = generate_embedding(embedding_text)
 
# Update all databases
await db.league.update(
    where={"id": league.id},
    data={
        "embedding_vector": new_embedding,
        "embedding_updated_at": datetime.utcnow(),
        "pinecone_sync_status": "pending"  # Mark for re-sync
    }
)

📚 Best Practices

✅ DO

Edit JSONL files as source of truth, then re-seed
Use PostgreSQL for primary queries (it's the fastest)
Use Neo4j for relationship discovery and graph patterns
Use Pinecone for semantic search and similarity
Track sync status in PostgreSQL for debugging
Generate embeddings for all user-facing entities

❌ DON'T

Don't edit databases directly - edit JSONL → re-seed
Don't over-use graph queries - PostgreSQL is faster for simple lookups
Don't skip embedding generation - Pinecone requires it
Don't create duplicate data - each database syncs from PostgreSQL
Don't ignore sync errors - check SyncOperation table

🎓 Examples

Example 1: Seed 5 Leagues with Full Multi-DB Sync

uv run python database/scripts/seed_unified_multi_db.py --leagues --limit 5

Output:

🔌 Connecting to databases...
  ✅ PostgreSQL (Prisma)
  ✅ Neo4j
  ✅ Pinecone
  ✅ OpenAI (embeddings)

🏟️  Seeding Leagues...
📁 Found 12 league seed files

  ✨ Created: formula_1
    ✅ Synced to Neo4j
    ✅ Synced to Pinecone
  ✨ Created: ufc
    ✅ Synced to Neo4j
    ✅ Synced to Pinecone
...

📊 MULTI-DATABASE SEEDING SUMMARY
==================================================
Total Processed: 5

PostgreSQL (Prisma):
  ✨ Created: 5
  ✏️  Updated: 0
  ❌ Errors:  0

Neo4j:
  ✅ Synced: 5
  ❌ Errors: 0

Pinecone:
  ✅ Synced: 5
  ❌ Errors: 0
==================================================

Example 2: Semantic Search + Graph Traversal

# Find leagues semantically similar to "elite combat sports with global reach"
query_embedding = generate_embedding("elite combat sports with global reach")
 
# Search Pinecone
pinecone_matches = index.query(vector=query_embedding, top_k=10)
 
# Get full data from PostgreSQL
league_tags = [m['id'] for m in pinecone_matches['matches']]
leagues = await db.league.find_many(
    where={"league_tag": {"in": league_tags}}
)
 
# For each, get Neo4j relationships
for league in leagues:
    with neo4j_driver.session() as session:
        archetype_leagues = session.run("""
            MATCH (l:League {league_tag: $tag})-[:BELONGS_TO_ARCHETYPE]->(a)
            MATCH (similar:League)-[:BELONGS_TO_ARCHETYPE]->(a)
            RETURN similar.league_name as name
            LIMIT 5
        """, tag=league.league_tag)
 
        print(f"{league.league_name}:")
        print(f"  Archetype: {league.sport_archetype}")
        print(f"  Similar: {[r['name'] for r in archetype_leagues]}")

Next Steps:

Run seed_unified_multi_db.py --archetypes to set up sport archetypes
Run seed_unified_multi_db.py --leagues --limit 10 to test league seeding
Run seed_unified_multi_db.py --examples --category triage to seed examples
Build queries using the patterns above!

This architecture gives you the best of all three databases while maintaining a single source of truth in JSONL files. 🚀

🌐 Data Layer Overview Database Architecture Strategy