DEA-C01
AWS Certified Data Engineer - Associate
The AWS Certified Data Engineer - Associate (DEA-C01) validates the ability to design, build, secure, and maintain data pipelines and analytics solutions on the AWS platform. Launched in March 2024, this is one of AWS's newest associate-level certifications, created in response to the growing demand for data engineering skills.
The exam covers four domains: Data Ingestion and Transformation (34%), Data Store Management (26%), Data Operations and Support (22%), and Data Security and Governance (18%). Candidates should have experience with AWS data services including Amazon S3, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Kinesis, Amazon EMR, AWS Lake Formation, Amazon DynamoDB, Amazon RDS, and AWS Step Functions.
Key skills tested include implementing data ingestion solutions with batch and streaming patterns, transforming and modeling data to meet requirements, configuring and managing data stores for performance and cost optimization, orchestrating data pipelines, monitoring data quality, and applying data governance and security best practices. This certification is recommended for professionals with 2-3 years of experience working with AWS data services in a data engineering role.
DEA-C01 Practice Exam 1
Comprehensive 65-question practice exam covering all four DEA-C01 domains: data ingestion and transformation, data store management, data operations and support, and data security and governance.
DEA-C01 Practice Exam 2
Comprehensive 65-question practice exam covering all four DEA-C01 domains with focus on troubleshooting data pipeline issues.
DEA-C01 Practice Exam 3
This practice exam challenges your understanding of multi-service data architecture decisions across the AWS analytics ecosystem. Covering 65 questions on advanced ingestion patterns, transformation pipelines, storage optimization, and governance strategies, it tests your ability to design cohesive data engineering solutions that span multiple AWS services working in concert.
DEA-C01 Practice Exam 4
This advanced practice exam covers complex configurations, edge cases, and nuanced scenarios across all domains of the AWS Data Engineer Associate certification. It tests your ability to handle sophisticated real-world data engineering challenges including advanced streaming patterns, complex ETL transformations, and intricate security and optimization configurations.
DEA-C01 Practice Exam 5
Comprehensive 65-question practice exam covering all four DEA-C01 domains with focus on cost optimization and performance tuning.
DEA-C01 Practice Exam 6
Comprehensive 65-question practice exam covering all four DEA-C01 domains with focus on security-focused data governance and compliance.
Sblocca Tutti i Contenuti per DEA-C01
6 Simulazione/i + Flash Cards — accesso per 3 mesi
o incluso con abbonamento Mensile / Pacchetto Contenuti
Anteprima (10 / 120)
Flash Cards
carte che coprono i concetti chiave di 120 DEA-C01
o incluso con abbonamento Mensile / Pacchetto Contenuti
110 altre carte disponibili dopo lo sblocco
Lingue Disponibili
Argomenti dell'Esame
DEA-C01 Cheat Sheet
Guida di riferimento rapido - 6 sezioni
AWS Certified Data Engineer - Associate (DEA-C01)
The DEA-C01 exam validates your ability to implement, automate, and optimize data pipelines, design and maintain data stores, and operationalize the data lifecycle on AWS. This certification is intended for individuals who perform data engineering roles with 2-3 years of hands-on experience.
Exam Details
| Exam Code | DEA-C01 |
| Duration | 130 minutes |
| Number of Questions | 65 questions (50 scored + 15 unscored) |
| Passing Score | 720 / 1000 |
| Cost | $150 USD |
| Validity | 3 years |
| Question Types | Multiple choice (single & multiple select) |
| Testing Options | Pearson VUE testing center or online proctored |
Domain Weights
| Domain | Weight |
|---|---|
| Domain 1: Data Ingestion & Transformation | 34% |
| Domain 2: Data Store Management | 26% |
| Domain 3: Data Operations & Support | 22% |
| Domain 4: Data Security & Governance | 18% |
Study Tips
- Focus heavily on AWS Glue and Kinesis as they dominate Domain 1 (34% of the exam)
- Understand S3 storage classes, lifecycle policies, and data lake architecture thoroughly
- Practice choosing between Redshift, DynamoDB, and Aurora for different use cases
- Know Lake Formation permissions, LF-tags, and cross-account data sharing
- Understand orchestration tools: Step Functions vs MWAA vs Glue Workflows
- Study data formats (Parquet, ORC, Avro) and when to use each
- Review encryption options: SSE-S3, SSE-KMS, CSE, and in-transit encryption
- Practice with hands-on labs to reinforce theoretical knowledge
Question Strategy Tips
- Read the question stem carefully and identify what is being asked before looking at answers
- Look for keywords like "most cost-effective", "least operational overhead", or "near real-time"
- Eliminate obviously wrong answers first to narrow down your choices
- For multi-select questions, identify how many answers are required before selecting
- Flag difficult questions and return to them later rather than spending too much time
- Managed services are generally preferred over self-managed options in AWS exams
- When two options seem correct, choose the one that is more specific to the use case
- Pace yourself: aim for about 2 minutes per question to leave review time
Domain 1: Data Ingestion & Transformation (34%)
This is the largest domain on the exam. It covers how to ingest data from various sources, transform it, and prepare it for downstream analytics. AWS Glue and Kinesis are the most heavily tested services in this domain.
AWS Glue
AWS Glue is a fully managed serverless ETL (Extract, Transform, Load) service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
- Glue Jobs: Support Spark (Python/Scala) and Python Shell job types. Spark jobs are for large-scale data processing; Python Shell jobs are for lightweight tasks like API calls or small transformations
- DPUs (Data Processing Units): Each DPU provides 4 vCPUs and 16 GB of memory. Standard workers use 1 DPU; G.1X workers use 1 DPU with more memory; G.2X workers use 2 DPUs for memory-intensive workloads. Flex execution is cheaper for non-urgent, preemptible jobs
- Glue Crawlers: Automatically discover data schema and create/update table definitions in the Glue Data Catalog. Crawlers can read from S3, JDBC, DynamoDB, and other sources. Schedule crawlers to run periodically to detect schema changes
- Glue Data Catalog: Central metadata repository compatible with Apache Hive Metastore. Stores table definitions, partition information, and connection details. Used by Athena, Redshift Spectrum, and EMR for schema discovery
- Job Bookmarks: Track previously processed data to avoid reprocessing. Essential for incremental ETL jobs. Enable bookmarks to process only new data since the last run
- Schema Registry: Manage and enforce schemas for streaming data. Supports Avro and JSON Schema formats. Integrates with Kinesis Data Streams, MSK, and Apache Kafka
- Glue DataBrew: Visual data preparation tool for cleaning and normalizing data without writing code. Provides 250+ built-in transformations. Good for data analysts who prefer a GUI
- Glue Studio: Visual ETL authoring interface for creating Glue jobs using a drag-and-drop canvas. Generates Apache Spark code automatically
Amazon Kinesis Family
Kinesis is the primary AWS service for real-time streaming data processing. Understanding the differences between its components is critical for the exam.
| Feature | Kinesis Data Streams | Kinesis Data Firehose | Kinesis Data Analytics |
|---|---|---|---|
| Purpose | Custom real-time processing | Deliver to destinations | SQL/Flink on streams |
| Latency | ~200ms (real-time) | 60+ seconds (near real-time) | ~1 second |
| Scaling | Manual (shard management) | Automatic | Automatic (KPU-based) |
| Data Retention | 1-365 days | No retention (delivery only) | No retention |
| Consumers | Custom apps, Lambda, KCL | S3, Redshift, OpenSearch, Splunk, HTTP | Output to KDS or Firehose |
| Transforms | Consumer-side processing | Lambda transform, format conversion | SQL queries or Apache Flink |
| Pricing | Per shard-hour + PUT payload | Per GB ingested | Per KPU-hour |
Amazon MSK (Managed Streaming for Apache Kafka)
- MSK Provisioned: You manage broker count and instance types. Best for predictable workloads where you need full control over cluster sizing
- MSK Serverless: Fully managed, auto-scaling Kafka. No broker management needed. Best for variable or unpredictable workloads
- MSK Connect: Deploy Kafka Connect connectors as managed services to stream data in/out of Kafka topics
- When to choose MSK over Kinesis: When migrating existing Kafka workloads, when you need Kafka-specific features (topic compaction, consumer groups), or when you need higher throughput per partition
AWS DMS (Database Migration Service)
- Full Load: Migrates all existing data from source to target. Use for initial one-time migration of entire database contents
- CDC (Change Data Capture): Captures ongoing changes after full load. Enables continuous replication from source to target with minimal downtime
- Full Load + CDC: Combines both modes. Performs full load first, then continues with CDC for ongoing replication
- SCT (Schema Conversion Tool): Converts database schema from one engine to another (e.g., Oracle to Aurora PostgreSQL). Required for heterogeneous migrations
- DMS can target: RDS, Aurora, Redshift, S3, DynamoDB, OpenSearch, Kinesis Data Streams, and Kafka
Data Formats Comparison
| Format | Type | Columnar | Splittable | Best For |
|---|---|---|---|---|
| Parquet | Binary | Yes | Yes | Analytics, Athena, Redshift Spectrum |
| ORC | Binary | Yes | Yes | Hive workloads, EMR, heavy writes |
| Avro | Binary | No (row) | Yes | Schema evolution, streaming, Kafka |
| JSON | Text | No (row) | Yes (JSONL) | APIs, semi-structured, flexibility |
| CSV | Text | No (row) | Yes | Simple data exchange, legacy systems |
Compression Comparison
| Algorithm | Speed | Ratio | Splittable | Best For |
|---|---|---|---|---|
| Snappy | Very Fast | Moderate | Yes (with container) | Real-time, Spark default, speed priority |
| GZIP | Slow | High | No | Storage savings, archival, cold data |
| ZSTD | Fast | High | No | Best balance of speed & ratio, Redshift |
| LZO | Fast | Moderate | Yes | MapReduce workloads, splittable needs |
Batch vs Streaming Decision Guide
| Criteria | Batch Processing | Stream Processing |
|---|---|---|
| Latency Requirement | Minutes to hours acceptable | Seconds to milliseconds needed |
| Data Arrival | Periodic, scheduled loads | Continuous, event-driven |
| AWS Services | Glue ETL, EMR, Athena | Kinesis, MSK, Flink |
| Use Cases | Daily reports, data warehouse loads | Fraud detection, live dashboards |
| Cost Model | Pay per job run | Pay for continuous capacity |
Domain 2: Data Store Management (26%)
This domain covers choosing and configuring appropriate data stores for analytics workloads. You must understand the strengths, limitations, and ideal use cases for each AWS data storage service, with a strong focus on S3, Redshift, and DynamoDB.
Amazon S3 Storage Classes
| Storage Class | Availability | Min Duration | Retrieval | Use Case |
|---|---|---|---|---|
| S3 Standard | 99.99% | None | Instant | Frequently accessed data |
| S3 Intelligent-Tiering | 99.9% | None | Instant | Unknown or changing access patterns |
| S3 Standard-IA | 99.9% | 30 days | Instant | Infrequent access, rapid retrieval |
| S3 One Zone-IA | 99.5% | 30 days | Instant | Reproducible infrequent data |
| S3 Glacier Instant | 99.9% | 90 days | Milliseconds | Archive with instant access |
| S3 Glacier Flexible | 99.99% | 90 days | 1-12 hours | Long-term archive, flexible retrieval |
| S3 Glacier Deep Archive | 99.99% | 180 days | 12-48 hours | Lowest cost, compliance archives |
- Lifecycle Policies: Automate transitions between storage classes based on age. Use expiration rules to delete objects after retention period
- S3 Select / Glacier Select: Query data in-place using SQL expressions to retrieve only needed data, reducing data transfer costs
- S3 Versioning: Protect against accidental deletes. Enable MFA Delete for additional security on version deletion
- S3 Object Lock: WORM (Write Once Read Many) model for compliance. Governance mode allows special permissions; Compliance mode cannot be overridden
Amazon Redshift
Fully managed petabyte-scale cloud data warehouse optimized for OLAP (Online Analytical Processing) workloads using columnar storage and massively parallel processing (MPP).
- Distribution Styles:
- KEY: Rows with same key value go to same node. Best for large fact tables joined on a specific column
- ALL: Full copy on every node. Best for small dimension tables (under 3M rows) frequently joined
- EVEN: Round-robin distribution. Best when no clear join key exists. Default distribution
- AUTO: Redshift chooses distribution based on table size. Starts as ALL, switches to EVEN as table grows
- Sort Keys:
- Compound: Ordered prefix-based sorting. Best when queries filter on leading columns of the sort key
- Interleaved: Equal weight to each column. Best when queries filter on any column in the sort key (more maintenance overhead)
- Redshift Spectrum: Query data directly in S3 without loading into Redshift. Uses external tables defined in Glue Data Catalog. Scales independently with dedicated Spectrum nodes
- COPY Command: Most efficient way to load data into Redshift. Supports S3, DynamoDB, EMR, and remote hosts. Use MANIFEST files for reliable S3 loads. Automatically compresses and distributes data
- UNLOAD Command: Export query results from Redshift to S3 in parallel. Supports Parquet format for analytics. Use with partitioning for organized output
- Redshift Serverless: Auto-scaling compute without managing clusters. Pay for RPU (Redshift Processing Units) consumed. Best for intermittent or unpredictable workloads
- Concurrency Scaling: Automatically adds cluster capacity when queries queue up. Pay per-second for additional clusters. Configured per WLM queue
Amazon DynamoDB
Fully managed serverless NoSQL database for key-value and document workloads. Single-digit millisecond performance at any scale.
- Primary Keys:
- Partition Key (PK): Simple primary key. Must be unique per item. Determines data distribution across partitions
- Partition Key + Sort Key (PK+SK): Composite primary key. PK determines partition; SK enables range queries within a partition
- Capacity Modes:
- Provisioned: You specify RCU (Read Capacity Units) and WCU (Write Capacity Units). Use Auto Scaling to adjust. More cost-effective for predictable workloads
- On-Demand: Automatic scaling with pay-per-request pricing. Best for unpredictable traffic patterns. Up to 2.5x more expensive than provisioned
- Secondary Indexes:
- GSI (Global Secondary Index): Different partition key and optional sort key. Has its own provisioned throughput. Eventually consistent reads only. Can be added anytime
- LSI (Local Secondary Index): Same partition key, different sort key. Shares base table throughput. Supports strongly consistent reads. Must be created at table creation
- DAX (DynamoDB Accelerator): In-memory caching layer for DynamoDB. Microsecond read latency. Fully managed, compatible with existing DynamoDB API calls
- DynamoDB Streams: Captures item-level changes in a time-ordered sequence. Integrates with Lambda for event-driven architectures. Enables cross-region replication via Global Tables
- Global Tables: Multi-region, multi-active replication. Requires DynamoDB Streams enabled. Provides low-latency reads and writes in all regions
Other Data Stores
| Service | Type | Best For |
|---|---|---|
| Amazon OpenSearch | Search & analytics engine | Full-text search, log analytics, SIEM, dashboards |
| Amazon ElastiCache | In-memory cache (Redis/Memcached) | Session store, caching, leaderboards, real-time analytics |
| Amazon Neptune | Graph database | Social networks, recommendation engines, fraud detection, knowledge graphs |
| Amazon Keyspaces | Managed Cassandra | Migrating Cassandra workloads to AWS, wide-column store needs |
| Amazon Timestream | Time-series database | IoT sensor data, application metrics, DevOps monitoring |
| Amazon DocumentDB | MongoDB-compatible document DB | MongoDB migrations, content management, JSON document storage |
Data Modeling Patterns
- Star Schema: Central fact table surrounded by dimension tables. Best for Redshift data warehousing. Simple, fast queries for BI reporting
- Snowflake Schema: Normalized dimension tables with sub-dimensions. Reduces redundancy but requires more joins. Better for complex hierarchical data
- Data Lake Pattern: Raw zone (landing), cleaned zone (processed), curated zone (analytics-ready). Use S3 with partitioned Parquet files. Catalog with Glue Data Catalog
- Lake House: Combines data lake flexibility with data warehouse performance. S3 data lake + Redshift warehouse, unified via Redshift Spectrum and Glue
- Single Table Design (DynamoDB): Store multiple entity types in one table. Use composite sort keys for flexible querying. Reduces cost and latency by eliminating joins
- Partitioning Strategy: Partition by date (year/month/day) for time-series data. Use Hive-style partitioning (key=value/) in S3 for Athena compatibility. Avoid too many small partitions (partition overhead)
Domain 3: Data Operations & Support (22%)
This domain focuses on automating data pipelines, monitoring data quality, optimizing query performance, and troubleshooting common data engineering issues. Understanding orchestration services and operational best practices is critical.
Orchestration Services Comparison
| Feature | Step Functions | MWAA (Airflow) | Glue Workflows |
|---|---|---|---|
| Best For | AWS service orchestration | Complex DAGs, multi-system | Glue-centric ETL |
| Definition | JSON/YAML (ASL) | Python DAGs | Console / API |
| Integrations | 220+ AWS services natively | Any system via operators | Glue jobs, crawlers, triggers |
| Error Handling | Built-in retry, catch, fallback | Python-based retry & alerting | Basic trigger conditions |
| Pricing | Per state transition | Per environment-hour | No extra cost (Glue pricing) |
| Complexity | Low to medium | High (Airflow expertise) | Low |
| Visual UI | Yes (Workflow Studio) | Yes (Airflow UI) | Yes (Console) |
- Step Functions Standard vs Express: Standard for long-running (up to 1 year, exactly-once), Express for high-volume short tasks (up to 5 min, at-least-once)
- EventBridge: Use with Step Functions for event-driven pipeline triggers. Schedule rules can replace cron-based scheduling
Amazon Athena Optimization
- Use Columnar Formats: Convert data to Parquet or ORC for up to 90% cost reduction and 30x performance improvement over CSV/JSON
- Partition Data: Use Hive-style partitioning (s3://bucket/table/year=2024/month=01/) to limit scanned data. Add partitions via MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION
- Compress Data: Use Snappy for Parquet (default), GZIP for JSON/CSV. Reduces storage and scan costs
- Bucketing: Hash-distribute data into fixed number of files per partition. Improves join performance on bucketed columns
- Optimize File Sizes: Target 128 MB - 512 MB per file. Too many small files cause overhead; too few large files reduce parallelism
- Use CTAS: CREATE TABLE AS SELECT to create optimized tables with partitioning and columnar format in one query
- Workgroups: Separate workloads, set per-query and per-workgroup data scan limits, and track costs independently
- Athena ACID Transactions: Support for Apache Iceberg tables enabling INSERT, UPDATE, DELETE, and time-travel queries
AWS Glue Data Quality
- DQDL (Data Quality Definition Language): Define rules like Completeness, Uniqueness, RowCount, ColumnValues, and custom SQL expressions
- Evaluation: Run data quality checks as part of Glue ETL jobs or standalone evaluations. Results include pass/fail status and detailed metrics
- Actions on Failure: Stop the job, continue with warnings, route bad records to a quarantine location, or trigger SNS notifications
- Recommendations: Glue can auto-generate quality rules based on data profiling. Review and customize before production use
Monitoring & Observability Best Practices
- CloudWatch Metrics: Monitor Glue job duration, DPU utilization, and failure rates. Set alarms for job failures and performance degradation
- CloudWatch Logs: Enable continuous logging for Glue jobs. Use Log Insights to query and analyze job logs for debugging
- CloudTrail: Audit all API calls across data services. Essential for security investigations and compliance auditing
- SNS Notifications: Configure alerts for pipeline failures, data quality issues, and SLA breaches. Integrate with PagerDuty or Slack via Lambda
- Glue Job Metrics: Enable Spark UI for visual debugging of Glue Spark jobs. Monitor memory usage, data skew, and executor performance
- Redshift Query Monitoring: Use STL_QUERY and SVL_QLOG for query history. WLM queues manage concurrency and prioritize workloads
Amazon QuickSight
- SPICE Engine: Super-fast, Parallel, In-memory Calculation Engine. Import data into SPICE for fast dashboard performance. 10 GB per user in Standard, 500 GB in Enterprise
- Data Sources: Connects to Athena, Redshift, RDS, S3, OpenSearch, and third-party sources (Salesforce, Jira). Use Athena as the primary data source for data lake analytics
- Row-Level Security (RLS): Restrict data visibility per user or group. Define rules using dataset columns to filter rows
- Column-Level Security (CLS): Enterprise Edition only. Restrict specific columns from certain users or groups
- QuickSight Q: Natural language querying using ML. Users type questions in plain English to generate visualizations
- Embedded Dashboards: Embed QuickSight dashboards in web applications. Use GetDashboardEmbedUrl API for secure embedding
Troubleshooting Common Issues
| Issue | Cause | Solution |
|---|---|---|
| Glue job OOM | Data skew or insufficient DPUs | Increase worker type (G.2X), repartition data, filter early |
| Athena query slow | Scanning too much data | Partition data, use columnar format, optimize file sizes |
| Redshift query queue | WLM concurrency limit reached | Enable concurrency scaling, adjust WLM queues |
| DynamoDB throttling | Hot partition or insufficient capacity | Review key design, enable auto scaling, switch to on-demand |
| Kinesis iterator age | Consumer falling behind producer | Add shards, use enhanced fan-out, optimize consumer |
| S3 403 Access Denied | IAM or bucket policy misconfiguration | Check IAM role, bucket policy, S3 Block Public Access, KMS key policy |
| Glue crawler schema mismatch | Mixed file formats or schema evolution | Configure crawler grouping, use schema change policy, separate prefixes |
Domain 4: Data Security & Governance (18%)
This domain covers securing data at rest and in transit, managing fine-grained access controls, ensuring compliance, and implementing data governance strategies. AWS Lake Formation and KMS are the most critical services in this domain.
AWS Lake Formation
Centralized governance and security for data lakes. Lake Formation simplifies permissions management that would otherwise require complex IAM policies, S3 bucket policies, and Glue Data Catalog policies.
- LF-Tags (Tag-Based Access Control): Assign key-value tags to databases, tables, and columns. Grant permissions based on tag expressions rather than individual resources. Scales much better than named-resource grants for large catalogs
- Named Resource Grants: Grant permissions directly on specific databases, tables, or columns. Fine-grained but harder to manage at scale. Use for specific exceptions or small environments
- Data Filters: Row-level and cell-level security for Data Catalog tables. Define filter expressions that restrict which rows or cells a principal can access. Applied at query time by integrated services (Athena, Redshift Spectrum)
- Cross-Account Sharing: Share databases and tables across AWS accounts without copying data. Grantor account registers data locations; recipient accounts get fine-grained permissions. Supports AWS Organizations for simplified multi-account sharing
- Data Location Registration: Register S3 paths with Lake Formation to manage access. IAM role associated with location handles S3 access. Replaces direct S3 bucket policies for data lake access
- Governed Tables: Support ACID transactions on S3 data lake tables. Enable automatic compaction and time-travel queries. Use for workloads requiring update/delete on lake data
AWS KMS (Key Management Service) Encryption
- SSE-S3 (AES-256): Amazon-managed keys. Simplest option with no additional cost. Cannot audit key usage. Default encryption for S3
- SSE-KMS: AWS KMS-managed keys with envelope encryption. Provides audit trail via CloudTrail. Supports key rotation and key policies. Has API call limits (5,500-30,000 requests/sec per region)
- SSE-KMS with CMK: Customer-managed KMS keys. Full control over key policies, rotation, and deletion. Can share keys across accounts for cross-account data access
- CSE (Client-Side Encryption): Data encrypted before sending to AWS. You manage encryption/decryption logic. Maximum security but more operational complexity
- In-Transit Encryption: TLS/SSL for all data in transit. S3 enforces HTTPS via bucket policies (aws:SecureTransport). Redshift, DynamoDB, and Kinesis use TLS by default
| Service | At-Rest Encryption | In-Transit Encryption |
|---|---|---|
| S3 | SSE-S3, SSE-KMS, SSE-C, CSE | TLS (enforce via bucket policy) |
| Redshift | KMS or HSM (cluster-level) | SSL connections (force via parameter group) |
| DynamoDB | AWS owned key, KMS managed, CMK | TLS by default |
| Kinesis | KMS per-stream encryption | TLS by default |
| Glue | KMS for Data Catalog & job bookmarks | TLS for JDBC, security configs for Spark |
IAM for Data Services
- Service Roles: Glue jobs, Redshift clusters, and Kinesis Firehose use IAM roles to access other AWS services. Always follow least-privilege principle
- Resource-Based Policies: S3 bucket policies, KMS key policies, and Glue Data Catalog resource policies control who can access specific resources
- Identity-Based Policies: Attached to IAM users, groups, or roles. Define what actions a principal can perform on which resources
- Permission Boundaries: Set maximum permissions an IAM entity can have. Useful for delegating IAM administration while limiting scope
- SCP (Service Control Policies): Organization-level guardrails. Restrict which services or actions are available across all accounts in an OU
- Cross-Account Access: Use IAM roles with trust policies for cross-account access. Resource-based policies (S3, KMS) can also grant cross-account access directly
Amazon Macie
- Purpose: ML-powered service that discovers, classifies, and protects sensitive data (PII, PHI, financial data) stored in Amazon S3
- Data Discovery: Automated scanning of S3 buckets for sensitive data patterns. Uses managed data identifiers (100+ types) and custom data identifiers (regex-based)
- Findings: Generates findings for policy violations (unencrypted buckets, public access) and sensitive data detections. Integrates with Security Hub and EventBridge
- Integration: Use EventBridge rules to trigger automated remediation (Lambda) when sensitive data is found. Critical for GDPR and HIPAA compliance
Network Security
- VPC Endpoints (Gateway): For S3 and DynamoDB. Free to use. Route traffic through AWS network without internet. Add endpoint policies for additional access control
- VPC Endpoints (Interface): For Kinesis, Glue, KMS, STS, and most other services. Powered by PrivateLink. Incur hourly and data processing charges. Create ENI in your VPC subnet
- Glue VPC Connections: Connect Glue jobs to resources inside a VPC (RDS, Redshift, ElastiCache). Requires NAT Gateway for internet access from within VPC
- Redshift Enhanced VPC Routing: Forces all COPY and UNLOAD traffic through VPC. Required for compliance. Use with VPC endpoints for S3 to keep traffic private
- Security Groups: Stateful firewalls at the ENI level. Control inbound and outbound traffic for Redshift clusters, Glue connections, and EMR clusters
Compliance & Data Protection
- GDPR (General Data Protection Regulation): Right to erasure requires ability to delete personal data. Use S3 Object Lock for data retention. Implement data classification with Macie. Use Lake Formation for access control
- HIPAA (Health Insurance Portability & Accountability Act): Requires BAA with AWS. Encrypt all PHI at rest and in transit. Use CloudTrail for audit logging. Restrict access with fine-grained IAM policies
- Data Retention: Implement S3 lifecycle policies for automated retention management. Use Glacier Vault Lock for compliance archives with WORM protection
- Data Masking: Use Glue DataBrew for PII masking and anonymization. Lake Formation data filters for runtime masking. DynamoDB client-side encryption for field-level protection
- Audit & Logging: CloudTrail for API auditing across all data services. S3 server access logging for bucket-level operations. Redshift audit logging for query and connection history
Key Services Quick Reference
A comprehensive comparison of all major AWS services covered on the DEA-C01 exam. Use this table as a quick lookup for service capabilities, primary use cases, and key features to remember.
| Service | Category | Primary Use Case | Key Features |
|---|---|---|---|
| AWS Glue | ETL | Serverless data integration & transformation | Crawlers, Data Catalog, Spark/Python jobs, bookmarks, DataBrew, Schema Registry, Data Quality |
| Kinesis Data Streams | Streaming | Custom real-time data processing | Shard-based, 1-365 day retention, KCL consumers, enhanced fan-out, on-demand mode |
| Kinesis Firehose | Streaming | Near real-time delivery to destinations | Auto-scaling, Lambda transforms, format conversion, buffering, S3/Redshift/OpenSearch targets |
| Amazon MSK | Streaming | Managed Apache Kafka | Provisioned & serverless, MSK Connect, topic compaction, Kafka APIs compatible |
| AWS DMS | Migration | Database migration & replication | Full load, CDC, SCT, heterogeneous migrations, supports 20+ source/target engines |
| Amazon S3 | Storage | Object storage & data lake foundation | 7 storage classes, lifecycle policies, versioning, Object Lock, S3 Select, event notifications |
| Amazon Redshift | Data Warehouse | Petabyte-scale OLAP analytics | Columnar, MPP, Spectrum, COPY/UNLOAD, distribution styles, sort keys, concurrency scaling, serverless |
| Amazon DynamoDB | NoSQL Database | Key-value & document store at scale | Single-digit ms latency, GSI/LSI, DAX caching, Streams, Global Tables, on-demand/provisioned |
| Amazon Athena | Query Engine | Serverless SQL queries on S3 data | Presto/Trino engine, pay per scan, Glue Catalog integration, CTAS, Iceberg tables, workgroups |
| Step Functions | Orchestration | AWS service workflow orchestration | Standard & Express, 220+ integrations, ASL definition, Workflow Studio, retry/catch |
| Amazon MWAA | Orchestration | Managed Apache Airflow | Python DAGs, complex multi-system workflows, Airflow UI, custom operators, scheduling |
| Lake Formation | Governance | Centralized data lake security | LF-tags, data filters, cross-account sharing, governed tables, fine-grained permissions |
| Amazon Macie | Security | Sensitive data discovery in S3 | ML-powered PII detection, managed & custom identifiers, Security Hub integration, automated scanning |
| Amazon QuickSight | Visualization | BI dashboards & reporting | SPICE engine, RLS/CLS, QuickSight Q (NLP), embedded dashboards, ML insights |
| Amazon EMR | Big Data | Managed Hadoop/Spark clusters | Spark, Hive, Presto, HBase, EMR on EKS, EMR Serverless, instance fleets, spot instances |
| AWS DataBrew | Data Prep | Visual no-code data preparation | 250+ transforms, data profiling, PII masking, recipe-based, S3/Glue/Redshift sources |
| Amazon OpenSearch | Search & Analytics | Full-text search & log analytics | OpenSearch Dashboards, UltraWarm/cold storage, cross-cluster search, anomaly detection, serverless |
| Amazon ElastiCache | Caching | In-memory data store | Redis (persistence, pub/sub, sorted sets) vs Memcached (simple caching, multi-threaded) |
Service Selection Decision Guide
| Requirement | Best Service | Why |
|---|---|---|
| Serverless ETL | AWS Glue | No infrastructure, pay per DPU, built-in catalog |
| Real-time custom processing | Kinesis Data Streams | Sub-second latency, custom consumers, replay |
| Deliver streams to S3 | Kinesis Firehose | Zero admin, auto-batching, format conversion |
| Migrate existing Kafka | Amazon MSK | Kafka API compatibility, minimal code changes |
| Ad-hoc SQL on S3 | Amazon Athena | Serverless, pay per query, no loading needed |
| Complex BI queries (PB) | Amazon Redshift | MPP, columnar, optimized for complex joins |
| Sub-ms key-value lookups | DynamoDB + DAX | Single-digit ms base, microsecond with DAX |
| Database migration | AWS DMS | Supports CDC, minimal downtime, 20+ engines |
| Data lake governance | Lake Formation | Centralized permissions, LF-tags, cross-account |
| PII detection in S3 | Amazon Macie | ML-powered, automated scanning, 100+ patterns |
| Log analytics & search | Amazon OpenSearch | Full-text search, dashboards, anomaly detection |
| Visual data prep (no-code) | AWS DataBrew | 250+ transforms, visual UI, data profiling |
Key Acronyms & Terms
| Term | Full Name | Context |
|---|---|---|
| DPU | Data Processing Unit | Glue job compute capacity (4 vCPU + 16 GB) |
| RPU | Redshift Processing Unit | Redshift Serverless compute |
| KPU | Kinesis Processing Unit | Kinesis Data Analytics compute |
| RCU / WCU | Read / Write Capacity Unit | DynamoDB provisioned throughput |
| CDC | Change Data Capture | DMS continuous replication |
| DQDL | Data Quality Definition Language | Glue Data Quality rules syntax |
| ASL | Amazon States Language | Step Functions workflow definition |
| MPP | Massively Parallel Processing | Redshift query execution architecture |
| OLAP | Online Analytical Processing | Data warehouse workload type (Redshift) |
| WORM | Write Once Read Many | S3 Object Lock / Glacier Vault Lock |