Source: data_layer/docs/MULTI_DATABASE_ARCHITECTURE.md
Multi-Database Architecture Guide
π― Overview
This system implements a unified seeding and querying architecture across three databases, each serving a specific purpose while maintaining data consistency.
JSONL Seeds (Source of Truth)
β
PostgreSQL (Prisma) β Fast queries, transactions, relational data
β
Neo4j β Graph relationships, archetype navigation
β
Pinecone β Vector embeddings, semantic searchποΈ Architecture Principles
1. Single Source of Truth: JSONL Files
Location: database/seeds/ and database/seed.examples-kb/data/
Why JSONL?
- β Version control friendly (line-by-line diffs)
- β Easy to edit and curate
- β Fast sequential processing
- β Append-only for new examples
Example:
{"id": "racing_001", "league_name": "Formula 1", "sport_name": "Formula 1", "sport_archetype": "racing", "location": "Global"}
{"id": "combat_001", "league_name": "UFC", "sport_name": "MMA", "sport_archetype": "combat", "location": "USA"}2. PostgreSQL (via Prisma): Primary Database
Purpose: Fast queries, ACID transactions, structured data
Enhanced Schema (schemas/prisma/schema.v2.enhanced.prisma):
model League {
id String @id @default(uuid())
league_tag String @unique
league_name String
// NEW: Archetype integration
sport_archetype String? // racing, combat, team_sport, precision, large_field
// NEW: Vector embedding
embedding_vector Json? // Stored as JSON array for Pinecone sync
embedding_model String?
embedding_updated_at DateTime?
// NEW: Neo4j sync tracking
neo4j_node_id String?
neo4j_synced_at DateTime?
neo4j_sync_status String? @default("pending")
// NEW: Pinecone sync tracking
pinecone_id String?
pinecone_namespace String?
pinecone_synced_at DateTime?
pinecone_sync_status String? @default("pending")
// Relationships
LeagueRelationships LeagueRelationship[] @relation("SourceLeague")
RelatedLeagues LeagueRelationship[] @relation("TargetLeague")
}
// Tracks graph relationships from Neo4j
model LeagueRelationship {
source_league_id String
target_league_id String
relationship_type String // SIMILAR_TO, COMPETES_WITH, SAME_ARCHETYPE
strength_score Decimal?
neo4j_rel_id String?
SourceLeague League @relation("SourceLeague", ...)
TargetLeague League @relation("TargetLeague", ...)
}
// Sport archetypes as data
model SportArchetype {
archetype_name String @unique
display_name String
code_pattern String // RaceEvent, FightEvent, TeamMatch
characteristics Json // Betting patterns, data needs
embedding_vector Json?
neo4j_node_id String?
}Querying:
from prisma import Prisma
db = Prisma()
await db.connect()
# Find leagues synced to all databases
leagues = await db.league.find_many(
where={
"neo4j_sync_status": "synced",
"pinecone_sync_status": "synced",
"sport_archetype": "combat"
}
)3. Neo4j: Graph Relationships
Purpose: Navigate relationships, discover patterns, archetype-based queries
Schema (schemas/models/neo4j/init_db.cql):
// Create constraints
CREATE CONSTRAINT league_name_unique IF NOT EXISTS
FOR (l:League) REQUIRE l.league_name IS UNIQUE;
CREATE CONSTRAINT archetype_name_unique IF NOT EXISTS
FOR (a:SportArchetype) REQUIRE a.name IS UNIQUE;
// Archetypes as first-class nodes
MERGE (racing:SportArchetype {
name: "racing",
display_name: "Racing & Speed Sports",
code_pattern: "RaceEvent"
})
MERGE (combat:SportArchetype {
name: "combat",
display_name: "Combat Sports",
code_pattern: "FightEvent"
})
// Leagues with archetype relationships
MERGE (f1:League {
league_tag: "formula_1",
league_name: "Formula 1",
tier: "TIER1"
})
MERGE (ufc:League {
league_tag: "ufc",
league_name: "UFC",
tier: "TIER1"
})
// Relationships enable powerful graph queries
MERGE (f1)-[:BELONGS_TO_ARCHETYPE]->(racing)
MERGE (ufc)-[:BELONGS_TO_ARCHETYPE]->(combat)
// Similarity relationships (calculated via embeddings)
MERGE (league1:League)-[:SIMILAR_TO {score: 0.85}]->(league2:League)Graph Queries:
// Find all combat sports leagues
MATCH (l:League)-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype {name: "combat"})
RETURN l
// Find similar leagues within same archetype
MATCH (l:League {league_tag: "ufc"})-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype)
MATCH (similar:League)-[:BELONGS_TO_ARCHETYPE]->(a)
WHERE similar.league_tag <> "ufc"
RETURN similar
ORDER BY similar.tier
// Discover cross-archetype patterns
MATCH (l1:League)-[:SIMILAR_TO]->(l2:League)
WHERE l1.sport_archetype <> l2.sport_archetype
RETURN l1, l2, l1.sport_archetype, l2.sport_archetype4. Pinecone: Vector Search
Purpose: Semantic similarity, "find leagues like this" queries
Index Structure:
# Three separate indexes for different entity types
indexes = {
"altsports-archetypes": {
"dimension": 1536, # text-embedding-3-small
"metric": "cosine",
"namespace": None # Single namespace
},
"altsports-leagues": {
"dimension": 1536,
"metric": "cosine",
"namespace": "default"
},
"altsports-examples": {
"dimension": 1536,
"metric": "cosine",
"namespace": "category" # Namespaced by category (triage, contract_generation, etc.)
}
}Vector Search:
from pinecone import Pinecone
pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
index = pc.Index("altsports-leagues")
# Semantic search: "Find leagues similar to Formula 1"
query_embedding = openai.embeddings.create(
model="text-embedding-3-small",
input="Formula 1 racing high-speed global competition"
).data[0].embedding
results = index.query(
vector=query_embedding,
top_k=10,
include_metadata=True,
filter={"status": "active"}
)
# Results contain similar leagues with metadata
for match in results['matches']:
print(f"{match['metadata']['league_name']}: {match['score']}")π Unified Seeding Workflow
Quick Start
# Install dependencies
uv add prisma neo4j pinecone-client openai
# Generate Prisma client
cd database/schemas/prisma
uv run prisma generate --schema=schema.v2.enhanced.prisma
# Seed everything
uv run python database/scripts/seed_unified_multi_db.py --all
# Or seed selectively
uv run python database/scripts/seed_unified_multi_db.py --archetypes
uv run python database/scripts/seed_unified_multi_db.py --leagues --limit 10
uv run python database/scripts/seed_unified_multi_db.py --examples --category triageWhat Happens During Seeding
For each entity (League, FewShotExample, SportArchetype):
Step 1: Generate Embedding (if enabled)
embedding_text = create_league_embedding_text(league)
embedding = openai.embeddings.create(
model="text-embedding-3-small",
input=embedding_text
).data[0].embeddingStep 2: Upsert to PostgreSQL
league = await prisma.league.upsert(
where={"league_tag": league_tag},
create={
"league_name": "Formula 1",
"sport_archetype": "racing",
"embedding_vector": embedding,
"embedding_model": "text-embedding-3-small",
# ... other fields
},
update={...}
)Step 3: Sync to Neo4j (if connected)
with neo4j_driver.session() as session:
# Create node
result = session.run("""
MERGE (l:League {league_tag: $league_tag})
SET l.league_name = $league_name,
l.sport_archetype = $archetype
RETURN elementId(l) as node_id
""", league_tag=league.league_tag, ...)
# Create archetype relationship
session.run("""
MATCH (l:League {league_tag: $league_tag})
MATCH (a:SportArchetype {name: $archetype})
MERGE (l)-[:BELONGS_TO_ARCHETYPE]->(a)
""")
# Update PostgreSQL with Neo4j node ID
await prisma.league.update(
where={"id": league.id},
data={
"neo4j_node_id": result.single()["node_id"],
"neo4j_synced_at": datetime.utcnow(),
"neo4j_sync_status": "synced"
}
)Step 4: Sync to Pinecone (if connected)
index = pinecone_client.Index("altsports-leagues")
index.upsert([(
league.league_tag, # ID
embedding, # Vector
{ # Metadata
"league_name": league.league_name,
"sport_archetype": league.sport_archetype,
"status": league.status
}
)])
await prisma.league.update(
where={"id": league.id},
data={
"pinecone_id": league.league_tag,
"pinecone_synced_at": datetime.utcnow(),
"pinecone_sync_status": "synced"
}
)π― Query Patterns
Pattern 1: PostgreSQL β Neo4j β Pinecone
Use Case: Find similar leagues to UFC using graph structure
# 1. Get league from PostgreSQL
ufc = await db.league.find_unique(where={"league_tag": "ufc"})
# 2. Get graph relationships from Neo4j
with neo4j_driver.session() as session:
result = session.run("""
MATCH (ufc:League {league_tag: 'ufc'})-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype)
MATCH (similar:League)-[:BELONGS_TO_ARCHETYPE]->(a)
WHERE similar.league_tag <> 'ufc'
RETURN similar.league_tag as tag, similar.league_name as name
LIMIT 10
""")
related_tags = [record["tag"] for record in result]
# 3. Get semantic similarity from Pinecone
query_vector = ufc.embedding_vector # Already have it!
pinecone_results = index.query(
vector=query_vector,
top_k=20,
include_metadata=True
)
# 4. Combine results
combined_leagues = set(related_tags) | {m['id'] for m in pinecone_results['matches']}Pattern 2: Pinecone β PostgreSQL
Use Case: Semantic search with structured filtering
# 1. Vector search in Pinecone
query_text = "High-speed racing competitions with global viewership"
query_embedding = generate_embedding(query_text)
matches = index.query(
vector=query_embedding,
top_k=50,
filter={"status": "active"} # Pinecone metadata filter
)
# 2. Get full data from PostgreSQL
league_tags = [m['id'] for m in matches['matches']]
leagues = await db.league.find_many(
where={"league_tag": {"in": league_tags}},
include={"Documents": True, "Contracts": True}
)Pattern 3: Neo4j β PostgreSQL
Use Case: Archetype-based discovery with full metadata
# 1. Graph query for pattern discovery
with neo4j_driver.session() as session:
result = session.run("""
MATCH (l:League)-[:BELONGS_TO_ARCHETYPE]->(a:SportArchetype)
WHERE l.tier = 'TIER1'
WITH a, count(l) as league_count
RETURN a.name, league_count
ORDER BY league_count DESC
""")
top_archetypes = [record["a.name"] for record in result]
# 2. Get leagues from PostgreSQL
for archetype in top_archetypes:
leagues = await db.league.find_many(
where={"sport_archetype": archetype, "status": "active"}
)
print(f"{archetype}: {len(leagues)} leagues")π When to Use Each Database
| Use Case | Database | Why? |
|---|---|---|
| Get league by ID | PostgreSQL | Fast indexed lookup |
| List all combat sports | PostgreSQL or Neo4j | Both work; PostgreSQL for metadata, Neo4j for relationships |
| "Find leagues like Formula 1" | Pinecone β PostgreSQL | Semantic similarity |
| Discover archetype patterns | Neo4j | Graph traversal |
| Filter by tier + contract status | PostgreSQL | Complex filtering |
| "Which archetypes have most TIER1 leagues?" | Neo4j | Aggregation over graph |
| "Similar contract examples for soccer/premium" | Pinecone (namespace=contract_generation) | Semantic search with filters |
| Track sync operations | PostgreSQL (SyncOperation model) | Audit trail |
π§ Environment Setup
Required .env variables:
# PostgreSQL (Required)
DATABASE_URL="postgresql://user:password@localhost:5432/altsports?schema=public"
# Neo4j (Optional but recommended)
NEO4J_URI="bolt://localhost:7687"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="your_password"
# Pinecone (Optional but recommended)
PINECONE_API_KEY="your_pinecone_key"
# OpenAI for embeddings (Required if using Pinecone)
OPENAI_API_KEY="your_openai_key"π Advanced Features
Feature 1: Automatic Relationship Detection
After seeding, calculate similarity relationships:
# Calculate embedding similarity and create Neo4j relationships
for league1 in leagues:
# Query Pinecone for similar leagues
similar = index.query(
vector=league1.embedding_vector,
top_k=10,
filter={"sport_archetype": league1.sport_archetype}
)
# Create SIMILAR_TO relationships in Neo4j
for match in similar['matches']:
if match['score'] > 0.80: # High similarity threshold
with neo4j_driver.session() as session:
session.run("""
MATCH (l1:League {league_tag: $tag1})
MATCH (l2:League {league_tag: $tag2})
MERGE (l1)-[:SIMILAR_TO {score: $score}]->(l2)
""", tag1=league1.league_tag, tag2=match['id'], score=match['score'])
# Track in PostgreSQL
await db.leaguerelationship.create({
"source_league_id": league1.id,
"target_league_id": match['id'],
"relationship_type": "SIMILAR_TO",
"strength_score": Decimal(str(match['score']))
})Feature 2: Sync Status Dashboard
# Get sync health across databases
sync_status = await db.league.group_by(
by=["neo4j_sync_status", "pinecone_sync_status"],
count=True
)
print("Sync Status:")
for status in sync_status:
print(f" Neo4j: {status['neo4j_sync_status']}, "
f"Pinecone: {status['pinecone_sync_status']}, "
f"Count: {status['_count']}")Feature 3: Embedding Updates
# Re-generate embeddings when league data changes
league = await db.league.update(
where={"league_tag": "formula_1"},
data={"league_name": "Formula 1 World Championship"}
)
# Trigger embedding update
embedding_text = create_league_embedding_text(league)
new_embedding = generate_embedding(embedding_text)
# Update all databases
await db.league.update(
where={"id": league.id},
data={
"embedding_vector": new_embedding,
"embedding_updated_at": datetime.utcnow(),
"pinecone_sync_status": "pending" # Mark for re-sync
}
)π Best Practices
β DO
- Edit JSONL files as source of truth, then re-seed
- Use PostgreSQL for primary queries (it's the fastest)
- Use Neo4j for relationship discovery and graph patterns
- Use Pinecone for semantic search and similarity
- Track sync status in PostgreSQL for debugging
- Generate embeddings for all user-facing entities
β DON'T
- Don't edit databases directly - edit JSONL β re-seed
- Don't over-use graph queries - PostgreSQL is faster for simple lookups
- Don't skip embedding generation - Pinecone requires it
- Don't create duplicate data - each database syncs from PostgreSQL
- Don't ignore sync errors - check SyncOperation table
π Examples
Example 1: Seed 5 Leagues with Full Multi-DB Sync
uv run python database/scripts/seed_unified_multi_db.py --leagues --limit 5Output:
π Connecting to databases...
β
PostgreSQL (Prisma)
β
Neo4j
β
Pinecone
β
OpenAI (embeddings)
ποΈ Seeding Leagues...
π Found 12 league seed files
β¨ Created: formula_1
β
Synced to Neo4j
β
Synced to Pinecone
β¨ Created: ufc
β
Synced to Neo4j
β
Synced to Pinecone
...
π MULTI-DATABASE SEEDING SUMMARY
==================================================
Total Processed: 5
PostgreSQL (Prisma):
β¨ Created: 5
βοΈ Updated: 0
β Errors: 0
Neo4j:
β
Synced: 5
β Errors: 0
Pinecone:
β
Synced: 5
β Errors: 0
==================================================Example 2: Semantic Search + Graph Traversal
# Find leagues semantically similar to "elite combat sports with global reach"
query_embedding = generate_embedding("elite combat sports with global reach")
# Search Pinecone
pinecone_matches = index.query(vector=query_embedding, top_k=10)
# Get full data from PostgreSQL
league_tags = [m['id'] for m in pinecone_matches['matches']]
leagues = await db.league.find_many(
where={"league_tag": {"in": league_tags}}
)
# For each, get Neo4j relationships
for league in leagues:
with neo4j_driver.session() as session:
archetype_leagues = session.run("""
MATCH (l:League {league_tag: $tag})-[:BELONGS_TO_ARCHETYPE]->(a)
MATCH (similar:League)-[:BELONGS_TO_ARCHETYPE]->(a)
RETURN similar.league_name as name
LIMIT 5
""", tag=league.league_tag)
print(f"{league.league_name}:")
print(f" Archetype: {league.sport_archetype}")
print(f" Similar: {[r['name'] for r in archetype_leagues]}")Next Steps:
- Run
seed_unified_multi_db.py --archetypesto set up sport archetypes - Run
seed_unified_multi_db.py --leagues --limit 10to test league seeding - Run
seed_unified_multi_db.py --examples --category triageto seed examples - Build queries using the patterns above!
This architecture gives you the best of all three databases while maintaining a single source of truth in JSONL files. π