Source: docs/guides/README_DATA_EXPLORATION.md
AWS Redshift Data Exploration Setup
This guide helps you set up comprehensive AWS Redshift data exploration using MCP (Model Context Protocol) tools, interactive chat interfaces, and Jupyter notebooks.
π Quick Start
1. Test Your Connection
python client.py --connect-redshift2. Start Interactive Exploration
python client.py --data-chat3. Open Jupyter Notebook
python client.py --notebookπ What's Included
Core Components
| Component | File | Description |
|---|---|---|
| MCP Tools | apps/backend/tools/aws_redshift_mcp_tool.py | Core Redshift integration tools |
| Interactive Chat | enhanced_mcp_chat_repl.py | Command-line data exploration interface |
| Jupyter Notebook | data_exploration_redshift.ipynb | Advanced analysis and visualization |
| CLI Integration | Enhanced client.py | Integrated command-line access |
| Test Suite | apps/backend/test_aws_redshift_mcp.py | Comprehensive testing framework |
Available Tools
- Connection Management: Test and verify database connectivity
- Schema Discovery: Explore databases, schemas, and table structures
- Data Analysis: Analyze table statistics and column distributions
- Query Execution: Run SQL queries with performance monitoring
- Data Export: Export query results to pandas/CSV for analysis
- Performance Testing: Monitor and optimize query performance
π§ Setup Instructions
Prerequisites
- AWS Redshift Access: Cluster endpoint, database credentials
- Python Environment: Python 3.8+ with required packages
- Network Access: Firewall/security group access to Redshift
Step 1: Configure Environment
Create or update your .env file:
# Required AWS Redshift Settings
REDSHIFT_HOST=your-cluster.region.redshift.amazonaws.com
REDSHIFT_PORT=5439
REDSHIFT_DATABASE=your_database
REDSHIFT_USER=your_username
REDSHIFT_PASSWORD=your_password
REDSHIFT_SSL_MODE=require
# Optional Settings
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION=us-east-1
AWS_S3_BUCKET=your-s3-bucket-for-exportsStep 2: Install Dependencies
# Core backend dependencies
pip install -r apps/backend/requirements.txt
# Optional: Jupyter and visualization tools
pip install jupyter jupyterlab plotly seaborn matplotlib dashStep 3: Test Your Setup
# Test the MCP tools
python apps/backend/test_aws_redshift_mcp.py
# Test connection via CLI
python client.py --connect-redshiftπ» Usage Examples
Interactive Data Chat
Start the enhanced chat interface:
python client.py --data-chatExample session:
π [public] > connect
β
Connection successful!
π [public] > schemas
π Found 3 schemas: public (12 tables), analytics (8 tables), staging (5 tables)
π [public] > tables analytics
π Tables in 'analytics' schema:
π user_events (15 columns)
π revenue_daily (6 columns)
π [public] > query SELECT COUNT(*) FROM analytics.user_events WHERE date >= CURRENT_DATE - 7
β
Query successful - 1 rows returned
π Results:
count
2450000
π [public] > export SELECT user_id, event_type, created_at FROM analytics.user_events WHERE date >= CURRENT_DATE - 1 LIMIT 10000
β
Exported 10000 rows to export_1640995200.csvGuided Workflow
Run the guided exploration workflow:
python client.py --explore-dataThis will:
- Test your database connection
- Discover available schemas and tables
- Provide next steps for exploration
Jupyter Notebook Analysis
Open the comprehensive analysis notebook:
python client.py --notebookThe notebook includes:
- Connection testing and setup
- Schema and table discovery
- Data distribution analysis
- Query performance testing
- Pandas integration for advanced analysis
- Visualization examples with plotly and matplotlib
π― Use Cases
1. Data Quality Assessment
# Start data chat
python client.py --data-chat
# Check data quality
> describe users
> stats users
> analyze users email
> query SELECT COUNT(DISTINCT email), COUNT(*) FROM users2. Performance Monitoring
# Test query performance
> performance SELECT COUNT(*) FROM large_table
> performance SELECT * FROM large_table WHERE indexed_column = 'value'
# Explain query execution
> query EXPLAIN SELECT * FROM table1 JOIN table2 ON table1.id = table2.user_id3. Data Export and Analysis
# Export data for analysis
> export SELECT * FROM users WHERE signup_date >= '2024-01-01' LIMIT 5000
# Export aggregated metrics
> export SELECT DATE_TRUNC('day', created_at) as date, COUNT(*) as signups FROM users GROUP BY date ORDER BY date4. Schema Discovery
# Explore database structure
> schemas
> tables public
> tables analytics
> describe important_table
> stats important_tableποΈ Architecture
MCP Tool Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Applications β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Enhanced Chat REPL β Jupyter Notebook β CLI Commands β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β MCP Tool Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β AWS Redshift MCP Tool Functions β
β β’ redshift_test_connection β
β β’ redshift_execute_query β
β β’ redshift_get_schema_info β
β β’ redshift_get_table_stats β
β β’ redshift_analyze_data_distribution β
β β’ redshift_get_query_performance β
β β’ redshift_export_to_pandas β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Database Connection Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SQLAlchemy + psycopg2 + Connection Pooling β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β AWS Redshift β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββData Flow
- User Input: Commands via chat, CLI, or notebook
- MCP Tool Processing: Route to appropriate Redshift tool
- SQL Execution: Connect to Redshift, execute queries
- Result Processing: Format, analyze, and return results
- User Output: Display results in chat, notebook, or export files
π οΈ Customization
Adding Custom Analysis
Extend the MCP tools by adding functions to aws_redshift_mcp_tool.py:
async def custom_analysis_function(table_name: str, analysis_type: str) -> Dict[str, Any]:
"""Custom analysis function"""
# Your custom logic here
query = f"SELECT custom_analysis({analysis_type}) FROM {table_name}"
return await redshift_execute_query(query)Custom Chat Commands
Add commands to the enhanced chat REPL in enhanced_mcp_chat_repl.py:
# Add to __init__ method commands dictionary
self.commands['custom'] = self.custom_command
async def custom_command(self, args: str = ""):
"""Your custom command"""
# Implementation here
passNotebook Extensions
Customize the Jupyter notebook by:
- Adding new analysis cells
- Importing additional visualization libraries
- Creating custom data processing functions
π Troubleshooting
Common Issues
| Issue | Symptoms | Solution |
|---|---|---|
| Connection Failed | "Connection failed" error | Check .env credentials, security groups |
| Import Errors | ModuleNotFoundError | Run pip install -r apps/backend/requirements.txt |
| SSL Issues | SSL/TLS errors | Try REDSHIFT_SSL_MODE=prefer or disable |
| Timeout Errors | Connection timeout | Check network connectivity, firewall rules |
| Permission Denied | Access denied errors | Verify database user permissions |
Debug Commands
# Test environment variables
python -c "import os; print('Host:', os.getenv('REDSHIFT_HOST'))"
# Run comprehensive tests
python apps/backend/test_aws_redshift_mcp.py
# Enable debug mode
python client.py --data-chat --verbose
# Test basic connectivity
python -c "import psycopg2; print('psycopg2 available')"Log Analysis
Enable debug logging by setting:
import logging
logging.basicConfig(level=logging.DEBUG)π Performance Optimization
Query Optimization
- Use LIMIT: Always limit large result sets
- Index Usage: Query on indexed columns when possible
- Column Selection: Use specific columns instead of SELECT *
- WHERE Clauses: Filter data early in queries
Connection Optimization
- Connection Pooling: Automatic in MCP tools
- SSL Overhead: Adjust SSL mode based on security needs
- Query Batching: Group related queries together
Memory Management
- Result Streaming: Large results are processed efficiently
- Export Limits: Use reasonable limits for data exports
- Cache Usage: Leverage cached results in chat interface
π Advanced Features
Integration with Other Systems
The MCP tools can be integrated with:
- BI Tools: Export data for Tableau, PowerBI, etc.
- Machine Learning: Export to pandas for ML workflows
- APIs: Build REST endpoints using the MCP tools
- Automation: Schedule data exports and analysis
Scaling for Teams
- MCP Server Deployment: Deploy for team access
- Shared Notebooks: Version control Jupyter notebooks
- Configuration Management: Centralize environment configs
- Access Control: Implement user-based access controls
π Resources
Documentation
- Data Discovery Commands:
data_discovery_commands.md - MCP Tool Reference:
apps/backend/tools/aws_redshift_mcp_tool.py - Environment Config:
.env.sample
Examples
- Test Suite:
apps/backend/test_aws_redshift_mcp.py - Chat Interface:
enhanced_mcp_chat_repl.py - Jupyter Notebook:
data_exploration_redshift.ipynb
Support
For additional help:
- Review the troubleshooting section
- Check existing issues in the repository
- Run the test suite to validate your setup
- Enable debug mode for detailed logging
Ready to explore your data? Start with python client.py --explore-data and follow the guided workflow!