Domain 1 - Data Preparation for Machine Learning

## Notes This is from an AWS Skill Builder course. ## Tasks 1. Ingest and store data (from different AWS storage services) 2. Transform and engineer features 3. Ensure data integrity and prepare for training ## Overview 1. Collect a lot of *quality* data 2. Ensure high quality data 1. Nothing is missing 2. No erroneous data 3. Manipulate the data to be as efficient as possible ## Task 1.1 - Lesson 1 Data lifecycle: 1. Generation (sources such as database, IoT, sensors, etc) 2. Storage 1. Which AWS storage is the best? 2. How will the data be accessed? 3. Is data streaming? 4. Does data need to be merged from multiple sources? 5. Understand trade-offs and cost 3. Ingestion 4. Transformation 5. Serving ### Data Storage Example 1 > You need to store unprocessed data from IoT devices for a new machine learning pipeline. The storage solution should be a centralized and highly available repository. What AWS storage service do you choose to store the unprocessed data? Keywords from the problem statement: - Store unprocessed data - Pipeline - Centralized - Highly available S3 is the best choice here. EFS demands more operational overhead for *pipeline* integration. ### Data Storage Example 2 > You are designing a highly scalable data repository for your machine learning pipeline. You need immediate access to the processed data from your pipeline for 6 months. Your unprocessed data must be accessible within 12 hours and stored for 6 years. What is your cost-effective storage solution? - Immediate access - Highly available Again, S3 is the best choice because of its tiers. - Immediate access works with S3 standard - Unprocessed data, accessible within 12 hours works with S3 Glacier #### SQL Querying Capability Let's say there is an added requirement to query data using SQL. What will change? - Use [[Amazon Athena]] with S3 for additional query capabilities - Cannot use [[DynamoDB]] unless you use [[PartiQL]] - [[RedShift]] also a potentially valid answer, but not the best choice --- ### Task 1.1 - Lesson 1 Continued - Know which AWS services work with **structured data** - Know which AWS services work with **semi-structured data** - Know which AWS services work with **unstructured data** - Know how to model **structured data** in AWS services - Know how to model **semi-structured data** in AWS services - Know how to model **unstructured data** in AWS services - When to choose **compression** - When to choose **splittable formats** - Write **Lambdas** for handling data ingestion tasks - Parsing - Transforming - Loading (into warehouse/database) - **Data processing** - **EMR** and **AWS Glue** - **Spark** for distribute processing across a cluster of **EC2** instances - [[Amazon Kinesis]] for real-time data processing - [[AWS Glue]] and Python for more complex operations - **Workflow orchestration** using [[AWS Step Functions]] #### Example 1 - Streaming > [[Amazon Data Firehose]] delivery stream to ingest **GZIP compressed data**. You need to configure a solution for your data scientist to perform SQL queries against the data stream for real-time insights. What is your solution? Solutions may range from - [[Apache Flink]] and [[AWS Lambda]] to transform data before the SQL queries. - Store data in S3 buckets and then use *Athena* (not a great option because Athena is not real-time and Athena cannot consume directly from Amazon Data Firehose) #### Example 2 - Streaming > Can you migrate data to Amazon S3 using AWS Database Migration Service from an on-premises or other supported database sources? - Yes, S3 can be used as a *DMS* target, and uses CSV by default - Can use [[Parquet]] format for compact storage and faster queries - If using [[AWS SageMaker]], use *pipe mode* for faster ingestion #### Example 3 > Let's say that you are a machine learning engineer and you need to process a large amount of customer data, analyze the data, and get insights so that analysts can make further decisions. To accomplish this task, you need to store the data in a data structure that can handle large volumes of data and efficiently retrieve it as fast as possible. What is your solution? Options - Use EMR for data processing and Hadoop Distributed File System ##### Addition - SageMaker Canvas and Feature Store The data from above needs to be available in SageMaker Canvas and SageMaker Feature Store - Use Feature Groups on ingestion - Use Data Wrangler to engineer features on ingest - Could use EMR with Spark connector to Feature Store ## Task 1.1 - Lesson 3 How to extract data from S3? - S3 Select with SQL statements - Works on CSV, JSON, Parquet, GZIP, BZIP2, and server-side encrypted - Transfer Acceleration - [[AWS Elastic Block Storage]] - Understand maximum performance of EBS volumes - Size & number of input/output operations - Dependent on throughput ### Example 1 > You were asked to redesign and reduce the operational management and the cost of the transformation pipeline that ingests training data from multiple custom sources. Currently, the pipeline uses Amazon EMR and AWS Data Pipeline. What is your solution? Options - Use Glue for ETL job and put into S3 ### Example 2 > You have been asked to redesign and reduce the operational overhead and use AWS services to detect anomalies in transaction data and assign anomaly scores to malicious records. The records are streamed in real-time and stored in an Amazon S3 data lake for processing and analysis. What is your solution? - Use Firehose to stream data and Flink `RANDOM_CUT_FOREST` to find anomalies ### Example 3 > You've been asked to improve the process time to ingest and store data in Amazon Redshift to conduct near real-time analytics. - Kinesis Data Stream - [x] Firehose only works for certain data sources - [[RedShift]] is not a valid destination ### Example 4 > ou are migrating a data analysis solution to AWS. The application produces the data as CSV files in near real time. You need a solution to convert the data format to Apache Parquet before saving it to an S3 bucket. What is your solution? - Kinesis Data Stream with Glue to convert data to Parquet - Glue also has access to Spark Streaming ### Example 5 > You are using Firehose to ingest data records from on premises. The records are compressed using GZIP compression. How can you perform SQL queries against the data stream to gain real-time insights and reduce the latency for queries? - Flink and Lambda to transform data ## Referencias - [[Learning AWS Certified Machine Learning Engineer - Associate]]