AWS Redshift Data Exploration Setup

Source: docs/guides/README_DATA_EXPLORATION.md

AWS Redshift Data Exploration Setup

This guide helps you set up comprehensive AWS Redshift data exploration using MCP (Model Context Protocol) tools, interactive chat interfaces, and Jupyter notebooks.

πŸš€ Quick Start

1. Test Your Connection

python client.py --connect-redshift

2. Start Interactive Exploration

python client.py --data-chat

3. Open Jupyter Notebook

python client.py --notebook

πŸ“‹ What's Included

Core Components

ComponentFileDescription
MCP Toolsapps/backend/tools/aws_redshift_mcp_tool.pyCore Redshift integration tools
Interactive Chatenhanced_mcp_chat_repl.pyCommand-line data exploration interface
Jupyter Notebookdata_exploration_redshift.ipynbAdvanced analysis and visualization
CLI IntegrationEnhanced client.pyIntegrated command-line access
Test Suiteapps/backend/test_aws_redshift_mcp.pyComprehensive testing framework

Available Tools

  1. Connection Management: Test and verify database connectivity
  2. Schema Discovery: Explore databases, schemas, and table structures
  3. Data Analysis: Analyze table statistics and column distributions
  4. Query Execution: Run SQL queries with performance monitoring
  5. Data Export: Export query results to pandas/CSV for analysis
  6. Performance Testing: Monitor and optimize query performance

πŸ”§ Setup Instructions

Prerequisites

  1. AWS Redshift Access: Cluster endpoint, database credentials
  2. Python Environment: Python 3.8+ with required packages
  3. Network Access: Firewall/security group access to Redshift

Step 1: Configure Environment

Create or update your .env file:

# Required AWS Redshift Settings
REDSHIFT_HOST=your-cluster.region.redshift.amazonaws.com
REDSHIFT_PORT=5439
REDSHIFT_DATABASE=your_database
REDSHIFT_USER=your_username
REDSHIFT_PASSWORD=your_password
REDSHIFT_SSL_MODE=require
 
# Optional Settings
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION=us-east-1
AWS_S3_BUCKET=your-s3-bucket-for-exports

Step 2: Install Dependencies

# Core backend dependencies
pip install -r apps/backend/requirements.txt
 
# Optional: Jupyter and visualization tools
pip install jupyter jupyterlab plotly seaborn matplotlib dash

Step 3: Test Your Setup

# Test the MCP tools
python apps/backend/test_aws_redshift_mcp.py
 
# Test connection via CLI
python client.py --connect-redshift

πŸ’» Usage Examples

Interactive Data Chat

Start the enhanced chat interface:

python client.py --data-chat

Example session:

πŸ’ [public] > connect
βœ… Connection successful!

πŸ’ [public] > schemas
πŸ“ Found 3 schemas: public (12 tables), analytics (8 tables), staging (5 tables)

πŸ’ [public] > tables analytics
πŸ“‹ Tables in 'analytics' schema:
   πŸ“„ user_events (15 columns)
   πŸ“„ revenue_daily (6 columns)

πŸ’ [public] > query SELECT COUNT(*) FROM analytics.user_events WHERE date >= CURRENT_DATE - 7
βœ… Query successful - 1 rows returned
πŸ“Š Results:
count
2450000

πŸ’ [public] > export SELECT user_id, event_type, created_at FROM analytics.user_events WHERE date >= CURRENT_DATE - 1 LIMIT 10000
βœ… Exported 10000 rows to export_1640995200.csv

Guided Workflow

Run the guided exploration workflow:

python client.py --explore-data

This will:

  1. Test your database connection
  2. Discover available schemas and tables
  3. Provide next steps for exploration

Jupyter Notebook Analysis

Open the comprehensive analysis notebook:

python client.py --notebook

The notebook includes:

  • Connection testing and setup
  • Schema and table discovery
  • Data distribution analysis
  • Query performance testing
  • Pandas integration for advanced analysis
  • Visualization examples with plotly and matplotlib

🎯 Use Cases

1. Data Quality Assessment

# Start data chat
python client.py --data-chat
 
# Check data quality
> describe users
> stats users
> analyze users email
> query SELECT COUNT(DISTINCT email), COUNT(*) FROM users

2. Performance Monitoring

# Test query performance
> performance SELECT COUNT(*) FROM large_table
> performance SELECT * FROM large_table WHERE indexed_column = 'value'
 
# Explain query execution
> query EXPLAIN SELECT * FROM table1 JOIN table2 ON table1.id = table2.user_id

3. Data Export and Analysis

# Export data for analysis
> export SELECT * FROM users WHERE signup_date >= '2024-01-01' LIMIT 5000
 
# Export aggregated metrics
> export SELECT DATE_TRUNC('day', created_at) as date, COUNT(*) as signups FROM users GROUP BY date ORDER BY date

4. Schema Discovery

# Explore database structure
> schemas
> tables public
> tables analytics
> describe important_table
> stats important_table

πŸ—οΈ Architecture

MCP Tool Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Client Applications                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Enhanced Chat REPL  β”‚  Jupyter Notebook  β”‚  CLI Commands   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                     MCP Tool Layer                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           AWS Redshift MCP Tool Functions                  β”‚
β”‚  β€’ redshift_test_connection                                 β”‚
β”‚  β€’ redshift_execute_query                                   β”‚
β”‚  β€’ redshift_get_schema_info                                 β”‚
β”‚  β€’ redshift_get_table_stats                                 β”‚
β”‚  β€’ redshift_analyze_data_distribution                       β”‚
β”‚  β€’ redshift_get_query_performance                           β”‚
β”‚  β€’ redshift_export_to_pandas                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Database Connection Layer                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚     SQLAlchemy + psycopg2 + Connection Pooling            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                   AWS Redshift                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

  1. User Input: Commands via chat, CLI, or notebook
  2. MCP Tool Processing: Route to appropriate Redshift tool
  3. SQL Execution: Connect to Redshift, execute queries
  4. Result Processing: Format, analyze, and return results
  5. User Output: Display results in chat, notebook, or export files

πŸ› οΈ Customization

Adding Custom Analysis

Extend the MCP tools by adding functions to aws_redshift_mcp_tool.py:

async def custom_analysis_function(table_name: str, analysis_type: str) -> Dict[str, Any]:
    """Custom analysis function"""
    # Your custom logic here
    query = f"SELECT custom_analysis({analysis_type}) FROM {table_name}"
    return await redshift_execute_query(query)

Custom Chat Commands

Add commands to the enhanced chat REPL in enhanced_mcp_chat_repl.py:

# Add to __init__ method commands dictionary
self.commands['custom'] = self.custom_command
 
async def custom_command(self, args: str = ""):
    """Your custom command"""
    # Implementation here
    pass

Notebook Extensions

Customize the Jupyter notebook by:

  1. Adding new analysis cells
  2. Importing additional visualization libraries
  3. Creating custom data processing functions

πŸ” Troubleshooting

Common Issues

IssueSymptomsSolution
Connection Failed"Connection failed" errorCheck .env credentials, security groups
Import ErrorsModuleNotFoundErrorRun pip install -r apps/backend/requirements.txt
SSL IssuesSSL/TLS errorsTry REDSHIFT_SSL_MODE=prefer or disable
Timeout ErrorsConnection timeoutCheck network connectivity, firewall rules
Permission DeniedAccess denied errorsVerify database user permissions

Debug Commands

# Test environment variables
python -c "import os; print('Host:', os.getenv('REDSHIFT_HOST'))"
 
# Run comprehensive tests
python apps/backend/test_aws_redshift_mcp.py
 
# Enable debug mode
python client.py --data-chat --verbose
 
# Test basic connectivity
python -c "import psycopg2; print('psycopg2 available')"

Log Analysis

Enable debug logging by setting:

import logging
logging.basicConfig(level=logging.DEBUG)

πŸ“ˆ Performance Optimization

Query Optimization

  1. Use LIMIT: Always limit large result sets
  2. Index Usage: Query on indexed columns when possible
  3. Column Selection: Use specific columns instead of SELECT *
  4. WHERE Clauses: Filter data early in queries

Connection Optimization

  1. Connection Pooling: Automatic in MCP tools
  2. SSL Overhead: Adjust SSL mode based on security needs
  3. Query Batching: Group related queries together

Memory Management

  1. Result Streaming: Large results are processed efficiently
  2. Export Limits: Use reasonable limits for data exports
  3. Cache Usage: Leverage cached results in chat interface

πŸš€ Advanced Features

Integration with Other Systems

The MCP tools can be integrated with:

  • BI Tools: Export data for Tableau, PowerBI, etc.
  • Machine Learning: Export to pandas for ML workflows
  • APIs: Build REST endpoints using the MCP tools
  • Automation: Schedule data exports and analysis

Scaling for Teams

  1. MCP Server Deployment: Deploy for team access
  2. Shared Notebooks: Version control Jupyter notebooks
  3. Configuration Management: Centralize environment configs
  4. Access Control: Implement user-based access controls

πŸ“š Resources

Documentation

  • Data Discovery Commands: data_discovery_commands.md
  • MCP Tool Reference: apps/backend/tools/aws_redshift_mcp_tool.py
  • Environment Config: .env.sample

Examples

  • Test Suite: apps/backend/test_aws_redshift_mcp.py
  • Chat Interface: enhanced_mcp_chat_repl.py
  • Jupyter Notebook: data_exploration_redshift.ipynb

Support

For additional help:

  1. Review the troubleshooting section
  2. Check existing issues in the repository
  3. Run the test suite to validate your setup
  4. Enable debug mode for detailed logging

Ready to explore your data? Start with python client.py --explore-data and follow the guided workflow!

Platform

Documentation

Community

Support

partnership@altsportsdata.comdev@altsportsleagues.ai

2025 Β© AltSportsLeagues.ai. Powered by AI-driven sports business intelligence.

πŸ€– AI-Enhancedβ€’πŸ“Š Data-Drivenβ€’βš‘ Real-Time