Architecture Best Practices
Proven patterns and practices for maintaining a healthy, scalable, and secure production system based on real-world experience.
π― Core Principles
π Development Best Practices
Always Test Locally First
DO
- Test backend in Docker:
./deploy-local-docker.sh - Run frontend dev server:
npm run dev - Execute E2E tests:
npm run test:e2e - Verify API integration locally
- Check logs for errors before deploying
DON'T
- Deploy directly to production without testing
- Skip local verification
- Assume "it works on my machine"
- Ignore test failures
- Deploy late Friday afternoon (Murphy's Law!)
Use Version Control Properly
DO
- Commit often with descriptive messages
- Use feature branches for new features
- Tag releases:
git tag v1.0.0 - Write meaningful commit messages
- Keep
.envout of Git (in.gitignore)
DON'T
- Commit API keys or secrets
- Force push to main branch
- Commit untested code
- Use vague messages like "fixed stuff"
- Work directly on main branch
ποΈ Deployment Best Practices
Incremental Deployment Strategy
Deploy in this order:
Backend First
Deploy backend before frontend if API changes affect frontend
./deploy-all.sh # Option 2 (Backend only)Why: Ensures new API endpoints exist before frontend tries to use them
Test Backend Independently
curl https://api.altsportsleagues.ai/health
curl https://api.altsportsleagues.ai/v1/new-endpointThen Frontend
./deploy-all.sh # Option 3 (Frontend only)Why: Frontend can safely use new backend features
Verify Integration
# Test frontend calling backend
curl https://altsportsleagues.ai/api/v1/new-endpointUse Parallel Deployment Wisely
Parallel is Good For:
- Independent changes (UI only, docs only)
- Bug fixes that don't affect API
- Performance optimizations
- Documentation updates
Avoid Parallel For:
- API breaking changes
- Database schema migrations
- Auth system changes
- New endpoint dependencies
π Security Best Practices
API Key Management
DO
- Store keys in Secret Manager (Google Cloud)
- Use environment variables
- Rotate keys every 90 days
- Use different keys for dev/staging/prod
- Restrict key scopes (principle of least privilege)
- Monitor key usage
DON'T
- Hardcode keys in source code
- Commit
.envfiles to Git - Share keys via Slack/email
- Use production keys in development
- Give keys unlimited permissions
- Forget to rotate keys
CORS Configuration
# β
DO: Be specific with allowed origins
allow_origins = [
"https://altsportsleagues.ai",
"https://docs.altsportsleagues.ai",
]
# β DON'T: Allow all origins in production
allow_origins = ["*"] # Security risk!π Performance Best Practices
Database Query Optimization
DO
- Add indexes for frequently queried fields
- Use pagination for large result sets
- Cache frequent queries (Redis)
- Run queries in parallel when possible
- Use EXPLAIN to analyze query performance
- Limit SELECT fields (don't use SELECT *)
DON'T
- Load entire tables into memory
- Use N+1 query patterns
- Skip database indexes
- Query databases in loops
- Return unlimited results
- Ignore slow query logs
Example: Parallel Database Queries
# β
GOOD: Parallel queries
async def get_league_data(league_id: str):
# Run simultaneously
graph_data, relational_data = await asyncio.gather(
neo4j.query(league_id),
supabase.query(league_id)
)
return combine(graph_data, relational_data)
# β BAD: Sequential queries
async def get_league_data_slow(league_id: str):
graph_data = await neo4j.query(league_id) # Wait
relational_data = await supabase.query(league_id) # Then wait again
return combine(graph_data, relational_data)Caching Strategy
Cache Duration Guidelines:
| Data Type | Cache Duration | Rationale |
|---|---|---|
| League metadata | 1 hour | Changes infrequently |
| Live scores | 30 seconds | Frequent updates |
| Historical stats | 24 hours | Never changes |
| User preferences | 5 minutes | May change during session |
| API schema | 1 week | Rarely changes |
π Data Consistency Best Practices
Cross-Database Consistency
Use Transactions:
# β
GOOD: All or nothing
async with transaction_manager() as tx:
await neo4j.create_league(league_data, tx=tx)
await supabase.create_league(league_data, tx=tx)
await firebase.notify_update(league_data)
# Commits all if successful, rolls back if any fail
# β BAD: Inconsistent state possible
await neo4j.create_league(league_data)
await supabase.create_league(league_data) # If this fails, Neo4j has orphaned dataEvent-Driven Updates
Implementation:
# Use event bus pattern
from fastapi import BackgroundTasks
@app.post("/v1/leagues")
async def create_league(
league: LeagueCreate,
background_tasks: BackgroundTasks
):
# Primary write
league_id = await supabase.create_league(league)
# Background sync (eventual consistency)
background_tasks.add_task(sync_to_neo4j, league_id)
background_tasks.add_task(notify_firebase, league_id)
return {"id": league_id}π Deployment Best Practices
Pre-Deployment Checklist
Before every production deployment:
- All tests passing locally
- Code reviewed (if team > 1)
- Environment variables documented
- Breaking changes documented
- Rollback plan prepared
- Monitoring alerts configured
- Off-hours deployment scheduled (if possible)
- Team notified (if coordinated deploy)
Blue-Green Deployment Pattern
Cloud Run makes this easy:
# Deploy new revision (doesn't affect traffic)
gcloud run deploy altsportsleagues-backend \
--image gcr.io/project/image:v1.1.0 \
--no-traffic \
--region us-central1
# Test new revision
curl https://REVISION-URL.run.app/health
# If good, switch traffic
gcloud run services update-traffic altsportsleagues-backend \
--to-latest \
--region us-central1
# If bad, rollback instantly (traffic still on old revision)π Monitoring Best Practices
The Four Golden Signals
Monitor These:
1. Latency
# Track request duration
@app.middleware("http")
async def add_process_time_header(request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
response.headers["X-Process-Time"] = str(process_time)
logger.info(f"Request took {process_time:.3f}s")
return response2. Traffic
- Requests per second
- Peak vs average load
- Traffic patterns (time of day)
3. Errors
- 4xx rate (client errors)
- 5xx rate (server errors)
- Error types and frequencies
4. Saturation
- CPU utilization
- Memory usage
- Database connection pool
- Disk I/O
Avoid:
- Alert Fatigue
Too many low-priority alerts β ignore critical ones - Vanity Metrics
Tracking metrics that don't drive actions - No Baselines
Can't detect anomalies without normal behavior - Reactive Only
Waiting for users to report issues
Alert Priority Levels
| Priority | Response Time | Examples |
|---|---|---|
| P0 - Critical | Immediate | Service down, data loss, security breach |
| P1 - High | < 1 hour | High error rate, performance degradation |
| P2 - Medium | < 4 hours | Elevated errors, slow queries |
| P3 - Low | < 24 hours | Minor issues, optimization opportunities |
π¨ Code Quality Best Practices
API Design
DO
- Use RESTful conventions
- Version your API (
/v1/,/v2/) - Return appropriate HTTP status codes
- Provide clear error messages
- Document with OpenAPI/Swagger
- Use pagination for lists
- Validate all inputs
DON'T
- Return 200 OK for errors
- Use verbs in endpoint names
- Expose internal implementation
- Break backward compatibility without versioning
- Return unbounded arrays
- Trust client input without validation
Example: Good API Design
# β
GOOD: Clear, RESTful, versioned
@router.get("/v1/leagues", response_model=LeagueListResponse)
async def list_leagues(
page: int = Query(1, ge=1),
per_page: int = Query(20, ge=1, le=100),
sport: Optional[str] = Query(None)
):
"""
List all leagues with pagination.
- **page**: Page number (starts at 1)
- **per_page**: Items per page (max 100)
- **sport**: Filter by sport (optional)
"""
total = await db.count_leagues(sport=sport)
leagues = await db.get_leagues(
skip=(page - 1) * per_page,
limit=per_page,
sport=sport
)
return LeagueListResponse(
leagues=leagues,
total=total,
page=page,
per_page=per_page,
pages=math.ceil(total / per_page)
)π§ͺ Testing Best Practices
Test Pyramid
Test Coverage Goals:
| Test Type | Coverage Target | Run Frequency |
|---|---|---|
| Unit Tests | 80%+ | Every commit |
| Integration Tests | 60%+ | Every PR |
| E2E Tests | Critical paths only | Pre-deploy |
Testing Before Deploy:
# Backend unit tests
cd apps/backend
pytest tests/ -v --cov=.
# Frontend E2E tests
cd clients/frontend
npm run test:e2e
# Integration test
./test-local-deployment.shπ Documentation Best Practices
Keep Docs in Sync
DO
- Update docs in same PR as code
- Auto-generate API docs (OpenAPI)
- Use schema injection for examples
- Add code examples to docs
- Include troubleshooting sections
- Link related documentation
DON'T
- Let docs drift from code
- Write docs after the fact
- Copy-paste examples (use injection)
- Skip error examples
- Assume prior knowledge
- Write docs that become stale
Our Approach:
// Docs use live schema injection
// In MDX file:
{{schema:schemas/league/tier_classification.py}}
// Always shows current schema
// Never gets out of sync
// One source of truth: data_layer/π CI/CD Best Practices
Recommended Workflow
GitHub Actions Example:
name: Backend CI/CD
on:
push:
branches: [main]
paths:
- 'apps/backend/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r apps/backend/requirements.txt
- name: Run tests
run: pytest apps/backend/tests/
deploy:
needs: test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy to Cloud Run
run: ./deploy-all.sh
env:
GCLOUD_SERVICE_KEY: ${{ secrets.GCLOUD_SERVICE_KEY }}π‘ General Best Practices
Environment Management
DO
- Have separate dev, staging, prod environments
- Use
.env.exampleas template - Document all required env vars
- Validate env vars on startup
- Use type-safe config loading
DON'T
- Use production keys in development
- Have different env vars across environments
- Forget to document new env vars
- Let app start with missing required vars
- Use string parsing for complex config
Config Validation Example:
# β
GOOD: Validate on startup
from pydantic import BaseSettings, Field
class Settings(BaseSettings):
openai_api_key: str = Field(..., env='OPENAI_API_KEY')
database_url: str = Field(..., env='DATABASE_URL')
environment: str = Field('development', env='ENV')
class Config:
env_file = '.env'
# Fails fast if required vars missing
settings = Settings()Error Handling
DO
- Return descriptive error messages
- Log errors with context
- Use appropriate HTTP status codes
- Handle edge cases explicitly
- Provide recovery suggestions
DON'T
- Return generic "Error" messages
- Expose stack traces to users
- Return 200 OK for errors
- Silently fail
- Assume "it won't happen"
Example:
# β
GOOD: Descriptive errors
@app.exception_handler(LeagueNotFoundError)
async def league_not_found_handler(request, exc):
return JSONResponse(
status_code=404,
content={
"error": "league_not_found",
"message": f"League '{exc.league_id}' does not exist",
"suggestion": "Check the league ID or search for leagues",
"docs": "https://docs.altsportsleagues.ai/api/reference#leagues"
}
)
# β BAD: Generic error
@app.exception_handler(Exception)
async def catch_all(request, exc):
return JSONResponse(
status_code=500,
content={"error": "Something went wrong"} # Not helpful!
)π― Summary: Our Best Practices
Following these practices ensures:
- Faster development cycles
- Fewer production issues
- Easier debugging and maintenance
- Better user experience
- Lower operational costs
Always test locally before deploying to production
Know what's happening in production at all times
Keep docs in sync with code changes