Architecture Best Practices

Proven patterns and practices for maintaining a healthy, scalable, and secure production system based on real-world experience.

🎯 Core Principles

The 5 Pillars of Good Architecture

📝 Development Best Practices

Always Test Locally First

✅

DO

Test backend in Docker: ./deploy-local-docker.sh
Run frontend dev server: npm run dev
Execute E2E tests: npm run test:e2e
Verify API integration locally
Check logs for errors before deploying

❌

DON'T

Deploy directly to production without testing
Skip local verification
Assume "it works on my machine"
Ignore test failures
Deploy late Friday afternoon (Murphy's Law!)

Use Version Control Properly

✅

DO

Commit often with descriptive messages
Use feature branches for new features
Tag releases: git tag v1.0.0
Write meaningful commit messages
Keep .env out of Git (in .gitignore)

❌

DON'T

Commit API keys or secrets
Force push to main branch
Commit untested code
Use vague messages like "fixed stuff"
Work directly on main branch

🏗️ Deployment Best Practices

Incremental Deployment Strategy

Deploy in this order:

Backend First

Deploy backend before frontend if API changes affect frontend

./deploy-all.sh  # Option 2 (Backend only)

Why: Ensures new API endpoints exist before frontend tries to use them

Test Backend Independently

curl https://api.altsportsleagues.ai/health
curl https://api.altsportsleagues.ai/v1/new-endpoint

Then Frontend

./deploy-all.sh  # Option 3 (Frontend only)

Why: Frontend can safely use new backend features

Verify Integration

# Test frontend calling backend
curl https://altsportsleagues.ai/api/v1/new-endpoint

Use Parallel Deployment Wisely

✅

Parallel is Good For:

Independent changes (UI only, docs only)
Bug fixes that don't affect API
Performance optimizations
Documentation updates

❌

Avoid Parallel For:

API breaking changes
Database schema migrations
Auth system changes
New endpoint dependencies

🔐 Security Best Practices

API Key Management

✅

DO

Store keys in Secret Manager (Google Cloud)
Use environment variables
Rotate keys every 90 days
Use different keys for dev/staging/prod
Restrict key scopes (principle of least privilege)
Monitor key usage

❌

DON'T

Hardcode keys in source code
Commit .env files to Git
Share keys via Slack/email
Use production keys in development
Give keys unlimited permissions
Forget to rotate keys

CORS Configuration

# ✅ DO: Be specific with allowed origins
allow_origins = [
    "https://altsportsleagues.ai",
    "https://docs.altsportsleagues.ai",
]
 
# ❌ DON'T: Allow all origins in production
allow_origins = ["*"]  # Security risk!

📊 Performance Best Practices

Database Query Optimization

✅

DO

Add indexes for frequently queried fields
Use pagination for large result sets
Cache frequent queries (Redis)
Run queries in parallel when possible
Use EXPLAIN to analyze query performance
Limit SELECT fields (don't use SELECT *)

❌

DON'T

Load entire tables into memory
Use N+1 query patterns
Skip database indexes
Query databases in loops
Return unlimited results
Ignore slow query logs

Example: Parallel Database Queries

# ✅ GOOD: Parallel queries
async def get_league_data(league_id: str):
    # Run simultaneously
    graph_data, relational_data = await asyncio.gather(
        neo4j.query(league_id),
        supabase.query(league_id)
    )
    return combine(graph_data, relational_data)
 
# ❌ BAD: Sequential queries
async def get_league_data_slow(league_id: str):
    graph_data = await neo4j.query(league_id)  # Wait
    relational_data = await supabase.query(league_id)  # Then wait again
    return combine(graph_data, relational_data)

Caching Strategy

Cache Duration Guidelines:

Data Type	Cache Duration	Rationale
League metadata	1 hour	Changes infrequently
Live scores	30 seconds	Frequent updates
Historical stats	24 hours	Never changes
User preferences	5 minutes	May change during session
API schema	1 week	Rarely changes

🔄 Data Consistency Best Practices

Cross-Database Consistency

Use Transactions:

# ✅ GOOD: All or nothing
async with transaction_manager() as tx:
    await neo4j.create_league(league_data, tx=tx)
    await supabase.create_league(league_data, tx=tx)
    await firebase.notify_update(league_data)
    # Commits all if successful, rolls back if any fail
 
# ❌ BAD: Inconsistent state possible
await neo4j.create_league(league_data)
await supabase.create_league(league_data)  # If this fails, Neo4j has orphaned data

Event-Driven Updates

Implementation:

# Use event bus pattern
from fastapi import BackgroundTasks
 
@app.post("/v1/leagues")
async def create_league(
    league: LeagueCreate,
    background_tasks: BackgroundTasks
):
    # Primary write
    league_id = await supabase.create_league(league)
    
    # Background sync (eventual consistency)
    background_tasks.add_task(sync_to_neo4j, league_id)
    background_tasks.add_task(notify_firebase, league_id)
    
    return {"id": league_id}

🚀 Deployment Best Practices

Pre-Deployment Checklist

Before every production deployment:

All tests passing locally
Code reviewed (if team > 1)
Environment variables documented
Breaking changes documented
Rollback plan prepared
Monitoring alerts configured
Off-hours deployment scheduled (if possible)
Team notified (if coordinated deploy)

Blue-Green Deployment Pattern

Cloud Run makes this easy:

# Deploy new revision (doesn't affect traffic)
gcloud run deploy altsportsleagues-backend \
  --image gcr.io/project/image:v1.1.0 \
  --no-traffic \
  --region us-central1
 
# Test new revision
curl https://REVISION-URL.run.app/health
 
# If good, switch traffic
gcloud run services update-traffic altsportsleagues-backend \
  --to-latest \
  --region us-central1
 
# If bad, rollback instantly (traffic still on old revision)

📈 Monitoring Best Practices

The Four Golden Signals

Monitor These:

1. Latency

# Track request duration
@app.middleware("http")
async def add_process_time_header(request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time
    response.headers["X-Process-Time"] = str(process_time)
    logger.info(f"Request took {process_time:.3f}s")
    return response

2. Traffic

Requests per second
Peak vs average load
Traffic patterns (time of day)

3. Errors

4xx rate (client errors)
5xx rate (server errors)
Error types and frequencies

4. Saturation

CPU utilization
Memory usage
Database connection pool
Disk I/O

Avoid:

Alert Fatigue
Too many low-priority alerts → ignore critical ones
Vanity Metrics
Tracking metrics that don't drive actions
No Baselines
Can't detect anomalies without normal behavior
Reactive Only
Waiting for users to report issues

Alert Priority Levels

Priority	Response Time	Examples
P0 - Critical	Immediate	Service down, data loss, security breach
P1 - High	< 1 hour	High error rate, performance degradation
P2 - Medium	< 4 hours	Elevated errors, slow queries
P3 - Low	< 24 hours	Minor issues, optimization opportunities

🎨 Code Quality Best Practices

API Design

✅

DO

Use RESTful conventions
Version your API (/v1/, /v2/)
Return appropriate HTTP status codes
Provide clear error messages
Document with OpenAPI/Swagger
Use pagination for lists
Validate all inputs

❌

DON'T

Return 200 OK for errors
Use verbs in endpoint names
Expose internal implementation
Break backward compatibility without versioning
Return unbounded arrays
Trust client input without validation

Example: Good API Design

# ✅ GOOD: Clear, RESTful, versioned
@router.get("/v1/leagues", response_model=LeagueListResponse)
async def list_leagues(
    page: int = Query(1, ge=1),
    per_page: int = Query(20, ge=1, le=100),
    sport: Optional[str] = Query(None)
):
    """
    List all leagues with pagination.
    
    - **page**: Page number (starts at 1)
    - **per_page**: Items per page (max 100)
    - **sport**: Filter by sport (optional)
    """
    total = await db.count_leagues(sport=sport)
    leagues = await db.get_leagues(
        skip=(page - 1) * per_page,
        limit=per_page,
        sport=sport
    )
    
    return LeagueListResponse(
        leagues=leagues,
        total=total,
        page=page,
        per_page=per_page,
        pages=math.ceil(total / per_page)
    )

🧪 Testing Best Practices

Test Pyramid

Test Coverage Goals:

Test Type	Coverage Target	Run Frequency
Unit Tests	80%+	Every commit
Integration Tests	60%+	Every PR
E2E Tests	Critical paths only	Pre-deploy

Testing Before Deploy:

# Backend unit tests
cd apps/backend
pytest tests/ -v --cov=.
 
# Frontend E2E tests
cd clients/frontend
npm run test:e2e
 
# Integration test
./test-local-deployment.sh

📚 Documentation Best Practices

Keep Docs in Sync

✅

DO

Update docs in same PR as code
Auto-generate API docs (OpenAPI)
Use schema injection for examples
Add code examples to docs
Include troubleshooting sections
Link related documentation

❌

DON'T

Let docs drift from code
Write docs after the fact
Copy-paste examples (use injection)
Skip error examples
Assume prior knowledge
Write docs that become stale

Our Approach:

// Docs use live schema injection
// In MDX file:
{{schema:schemas/league/tier_classification.py}}
 
// Always shows current schema
// Never gets out of sync
// One source of truth: data_layer/

🔄 CI/CD Best Practices

Recommended Workflow

GitHub Actions Example:

name: Backend CI/CD
 
on:
  push:
    branches: [main]
    paths:
      - 'apps/backend/**'
 
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install -r apps/backend/requirements.txt
      - name: Run tests
        run: pytest apps/backend/tests/
      
  deploy:
    needs: test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to Cloud Run
        run: ./deploy-all.sh
        env:
          GCLOUD_SERVICE_KEY: ${{ secrets.GCLOUD_SERVICE_KEY }}

💡 General Best Practices

Environment Management

✅

DO

Have separate dev, staging, prod environments
Use .env.example as template
Document all required env vars
Validate env vars on startup
Use type-safe config loading

❌

DON'T

Use production keys in development
Have different env vars across environments
Forget to document new env vars
Let app start with missing required vars
Use string parsing for complex config

Config Validation Example:

# ✅ GOOD: Validate on startup
from pydantic import BaseSettings, Field
 
class Settings(BaseSettings):
    openai_api_key: str = Field(..., env='OPENAI_API_KEY')
    database_url: str = Field(..., env='DATABASE_URL')
    environment: str = Field('development', env='ENV')
    
    class Config:
        env_file = '.env'
 
# Fails fast if required vars missing
settings = Settings()

Error Handling

✅

DO

Return descriptive error messages
Log errors with context
Use appropriate HTTP status codes
Handle edge cases explicitly
Provide recovery suggestions

❌

DON'T

Return generic "Error" messages
Expose stack traces to users
Return 200 OK for errors
Silently fail
Assume "it won't happen"

Example:

# ✅ GOOD: Descriptive errors
@app.exception_handler(LeagueNotFoundError)
async def league_not_found_handler(request, exc):
    return JSONResponse(
        status_code=404,
        content={
            "error": "league_not_found",
            "message": f"League '{exc.league_id}' does not exist",
            "suggestion": "Check the league ID or search for leagues",
            "docs": "https://docs.altsportsleagues.ai/api/reference#leagues"
        }
    )
 
# ❌ BAD: Generic error
@app.exception_handler(Exception)
async def catch_all(request, exc):
    return JSONResponse(
        status_code=500,
        content={"error": "Something went wrong"}  # Not helpful!
    )

🎯 Summary: Our Best Practices

Following these practices ensures:

Faster development cycles
Fewer production issues
Easier debugging and maintenance
Better user experience
Lower operational costs

🧪

Test First

Always test locally before deploying to production

📊

Monitor Always

Know what's happening in production at all times

📚

Document Everything

Keep docs in sync with code changes

Architecture Update: Retrieval-First System CLAUDE.md

Architecture Best Practices

🎯 Core Principles

The 5 Pillars of Good Architecture

📝 Development Best Practices

Always Test Locally First

DO

DON'T

Use Version Control Properly

DO

DON'T

🏗️ Deployment Best Practices

Incremental Deployment Strategy

Backend First

Test Backend Independently

Then Frontend

Verify Integration

Use Parallel Deployment Wisely

Parallel is Good For:

Avoid Parallel For:

🔐 Security Best Practices

API Key Management

DO

DON'T

CORS Configuration

📊 Performance Best Practices

Database Query Optimization

DO

DON'T

Caching Strategy

🔄 Data Consistency Best Practices

Cross-Database Consistency

Event-Driven Updates

🚀 Deployment Best Practices

Pre-Deployment Checklist

Blue-Green Deployment Pattern

📈 Monitoring Best Practices

The Four Golden Signals

Monitor These:

Avoid:

Alert Priority Levels

🎨 Code Quality Best Practices

API Design

DO

DON'T

🧪 Testing Best Practices

Test Pyramid

📚 Documentation Best Practices

Keep Docs in Sync

DO

DON'T

🔄 CI/CD Best Practices

Recommended Workflow

💡 General Best Practices

Environment Management

DO

DON'T

Error Handling

DO

DON'T

🎯 Summary: Our Best Practices

Platform

Documentation

Community

Support