AWS Data Discovery & PII Detection Agent
A comprehensive MCP server for automated AWS data discovery, PII detection, and data governance using Lake Formation.
README
AWS Data Discovery & PII Detection Agent
A comprehensive MCP (Model Context Protocol) server for automated AWS data discovery, PII detection, and data governance using Lake Formation. This comprehensive server provides automated data discovery, PII classification, and governance workflows, featuring 14+ operational tools for discovering AWS data sources, creating and running Glue crawlers, applying Lake Formation tags, cataloging data, detecting sensitive data, launching interactive dashboards, and generating compliance documentation.
Available MCP Tools
Data Discovery & Orchestration
orchestrate_data_discovery- Complete data discovery workflow with S3, DynamoDB, Glue cataloging, and PII detectiondiscover_aws_data_sources- Discover S3 buckets and DynamoDB tables across AWS regionsget_dashboard_data- Run data discovery workflow and prepare data for dashboard displaylaunch_data_discovery_dashboard- Launch interactive Streamlit dashboard at http://localhost:8501
Data Cataloging & Classification
catalog_with_glue- Create and run Glue crawlers to catalog S3 and DynamoDB data sourcesclassify_and_tag_data- Classify data and apply Lake Formation tags for governancegenerate_architecture_diagram- Generate AWS architecture diagrams for discovered infrastructure
AWS Labs MCP Integration
list_s3_buckets- List S3 buckets using s3-tables-mcp-servermanage_aws_glue_databases- Create Glue databases using aws-dataprocessing-mcp-serverlist_dynamodb_tables- List DynamoDB tables using dynamodb-mcp-server
Glue Crawler Operations
create_glue_crawler- Create Glue crawlers for S3 and DynamoDB targetsstart_glue_crawler- Start/run Glue crawlers to catalog dataget_glue_crawler_status- Monitor crawler execution status (RUNNING, SUCCEEDED, FAILED)
Lake Formation Integration
create_lf_tags- Create Lake Formation tag definitions for data governanceregister_s3_with_lakeformation- Register S3 locations with Lake Formationregister_table_with_lakeformation- Register Glue tables with Lake Formationapply_lf_tags- Apply Lake Formation tags to resources based on PII detection
Available MCP Resources
discovery://s3/buckets- List of discovered S3 bucketsdiscovery://dynamodb/tables- List of discovered DynamoDB tablescatalog://glue/databases- Cataloged databases in Glueclassification://pii/results- Data classification and PII detection resultslakeformation://tags/definitions- Lake Formation tag definitions for governancelakeformation://resources/registered- S3 locations and tables registered with Lake Formationlakeformation://tags/applied- Applied Lake Formation tags by resource
Available MCP Prompts
classify_data_sensitivity- Classify data sensitivity based on content analysisgenerate_compliance_tags- Generate Lake Formation tags for compliance requirementscreate_data_governance_policy- Create data governance policy based on discovered datasetup_lakeformation_governance- Setup complete Lake Formation governance for discovered resources
Instructions
The MCP Server for AWS data discovery and classification provides a comprehensive set of tools for discovering, cataloging, and classifying sensitive data across AWS environments.
To use these tools, ensure you have proper AWS credentials configured with appropriate permissions for S3, DynamoDB, Glue, and Comprehend operations. The server will automatically use credentials from environment variables or other standard AWS credential sources.
All tools support an optional region parameter to specify which AWS region to operate in. If not provided, it will use the AWS_REGION environment variable or default to 'us-west-2'.
š Features
- Automated Data Discovery: Discover S3 buckets and DynamoDB tables across AWS regions
- PII Detection: Identify sensitive data using AWS Comprehend and pattern matching
- Data Cataloging: Create and manage AWS Glue databases and tables
- Lake Formation Integration: Complete governance with automated tagging and permissions
- Interactive Dashboard: Real-time visualization with Streamlit
- Architecture Diagrams: Auto-generate AWS architecture documentation
- MCP Integration: Leverages official AWS Labs MCP servers
šļø Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā FastMCP Orchestrator Server ā
ā (data-discovery-orchestrator) ā
āāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā AWS Labs MCP Servers ā
ā ⢠aws-dataprocessing-mcp-server ā
ā ⢠dynamodb-mcp-server ā
ā ⢠s3-tables-mcp-server ā
ā ⢠aws-diagram-mcp-server ā
āāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā AWS Services ā
ā ⢠Amazon S3 ā
ā ⢠Amazon DynamoDB ā
ā ⢠AWS Glue ā
ā ⢠AWS Lake Formation ā
ā ⢠Amazon Comprehend ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
š Prerequisites
- Python 3.8+
- Node.js 18+ (required for AWS Labs MCP servers)
- AWS CLI configured with appropriate permissions:
- S3: ListBucket, GetObject
- DynamoDB: ListTables, DescribeTable, Scan
- Glue: CreateDatabase, CreateTable, GetDatabase, GetTable
- Lake Formation: RegisterResource, AddLFTagsToResource
- Comprehend: DetectPiiEntities
š ļø Installation
1. Install AWS Labs MCP Servers
npm install -g @awslabs/aws-dataprocessing-mcp-server
npm install -g @awslabs/dynamodb-mcp-server
npm install -g @awslabs/s3-tables-mcp-server
npm install -g @awslabs/aws-diagram-mcp-server
2. Install Python Dependencies
git clone <repository-url>
cd aws-data-discovery-agent
pip install -r requirements.txt
3. Configure AWS Credentials
aws configure
export AWS_REGION=us-west-2
š Quick Start
Run Complete Data Discovery Workflow
python servers/run_data_discovery_agent.py
This will:
- Discover S3 buckets and DynamoDB tables
- Create Glue databases for cataloging
- Run Glue crawlers to catalog data
- Detect PII in cataloged data
- Register resources with Lake Formation
- Apply governance tags based on PII detection
- Generate architecture diagrams
Launch Interactive Dashboard
streamlit run servers/pii_dashboard.py
Access at http://localhost:8501 to view:
- Real-time data discovery metrics
- PII classification results
- Lake Formation governance status
- Risk assessments and compliance tracking
š§ MCP Server Configuration
Add to your MCP client configuration (e.g., ~/.aws/amazonq/mcp.json):
{
"mcpServers": {
"aws-data-discovery-agent": {
"command": "python",
"args": ["~/aws-data-discovery-agent/servers/mcp_server_orchestrator.py", "--allow-write"],
"env": {
"AWS_REGION": "us-west-2",
"AWS_PROFILE": "default"
},
"disabled": false,
"autoApprove": []
}
}
}
š ļø Available MCP Tools
Data Discovery & Orchestration
orchestrate_data_discovery- Complete workflow with S3, DynamoDB, Glue, and PII detectiondiscover_aws_data_sources- Discover data sources across AWS regionsget_dashboard_data- Prepare data for dashboard displaylaunch_data_discovery_dashboard- Launch Streamlit dashboard
Data Cataloging & Classification
catalog_with_glue- Create and run Glue crawlersclassify_and_tag_data- Classify data and apply Lake Formation tagsgenerate_architecture_diagram- Generate AWS architecture diagrams
Lake Formation Integration
create_lf_tags- Create Lake Formation tag definitionsregister_s3_with_lakeformation- Register S3 locationsregister_table_with_lakeformation- Register Glue tablesapply_lf_tags- Apply tags based on PII detection
AWS Labs MCP Integration
list_s3_buckets- List S3 buckets via s3-tables-mcp-servermanage_aws_glue_databases- Manage Glue databases via aws-dataprocessing-mcp-serverlist_dynamodb_tables- List DynamoDB tables via dynamodb-mcp-server
š·ļø Lake Formation Governance
Automated Tag Definitions
- PIIType: EMAIL, SSN, PHONE, NAME, ADDRESS, CREDIT_CARD, DATE_OF_BIRTH, SALARY, AGE, NONE
- DataClassification: NO_RISK, LOW_RISK, MEDIUM_RISK, HIGH_RISK, CRITICAL_RISK
- AccessLevel: PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED, TOP_SECRET
- DataGovernance: PII_DETECTED, REQUIRES_MASKING, ACCESS_RESTRICTED, PUBLIC
- PIIClassification: SENSITIVE, HIGHLY_SENSITIVE, CONFIDENTIAL
Resource Registration
- S3 locations automatically registered with Lake Formation
- Glue tables registered with Lake Formation
- Handles existing registrations gracefully
Risk-Based Tagging
- Tags applied based on actual PII detection results
- Column-level and table-level tagging
- Automated access control classification
š¤ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
š License
This project is licensed under the MIT License - see the LICENSE file for details.
š Support
- Issues: Report bugs and feature requests via GitHub Issues
- Documentation: See the
docs/directory for detailed documentation
šÆ Roadmap
- [ ] Support for additional AWS data sources (RDS, Redshift)
- [ ] Enhanced PII detection with custom models
- [ ] Integration with AWS Config for compliance monitoring
- [ ] Multi-account support
- [ ] Advanced data lineage tracking
- [ ] Custom governance policies
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.