How Machine Learning Developers Use AWS SageMaker to Scale AI-based Solutions
A staggering 72% of organizations that began artificial intelligence (AI) pilots before 2019 have not been able to deploy even a single application in production. And yet, enterprises of all sizes confirmed that they accelerated their AI strategy because of COVID-19. Many companies struggle with integrating Machine Learning into their ecosystem and only 13% of organizations have scaled AI across multiple teams as illustrated in the diagram below.
One of the main problems that contributes to this statistic is the complexity involved in deploying a machine learning model to production. This problem, often called Undifferentiated heavy lifting, burdens machine learning developers with a prohibitive volume of complexities, preventing them from focusing on business outcomes. Following are some of the common challenges that organizations grapple with before deploying a model to production.
- Explosion of data – The data growth has been phenomenal over the last decade with the volume of data almost doubling every few years. In addition, most organizations have looked only at structured data until recent years. However, with advancements in Natural Language Processing, Machine Learning can now be leveraged to glean actionable insights from unstructured data residing in emails, chat, voice, and documents. This is significant considering 90% of the data in an organization is unstructured data.
- Transition from batch to real-time prediction – With the advent of IoT and streaming data, real-time prediction has become a high priority use case. Organizations have started to realize that getting actionable insights in real-time is a table stake. Business leaders don’t want to wait hours and days to get the required insights for making decisions. Real-time prediction becomes even more relevant for use cases such as fraud detection and when a customer is live on the phone.
- Data engineering challenges – Machine learning developers deal with a plethora of data engineering issues ranging from data quality, data governance, data engineering and data lineage. It’s not just about putting a model into production and forgetting it. It is much more than that.
- Machine Learning Ops –– In the past, developers used to deploy a model to production and monitor it manually. However, as the number of models increases, manual monitoring becomes unmanageable. Developers have to look for an automated mechanism for ongoing monitoring and measuring model performance.
- Need for model explainability, reproducibility, and auditability – As business users start to understand the implications of decision-making using AI, they start asking questions about model explainability, reproducibility, and auditability. Users wanted to ensure that the decisions they make will stand the test of time. More sophisticated end users are using A/B Testing to determine the efficacy of model prediction.
- Relevance of ethical and responsible AI – As ML becomes pervasive in all aspects of the business, more questions start popping up around ethical and responsible AI to ensure that there is no bias in model predictions. There is a lot of debate around using AI for surveillance and hiring. These questions add to the bucket of complex things that a machine learning developer has to consider before rolling out a model: developers need to ensure complete transparency and accountability of the algorithms developed and underlying libraries used.
- Ever-evolving Infrastructure – As data explodes and new libraries evolve, Machine Learning infrastructure has to be upgraded to get the real-time insights in milliseconds. The evolution of Graphic Processing Unit (GPU) and Tensor Processing Unit (TPU) allows machine learning engineers to have a better and faster infrastructure to run algorithms. That same infrastructure has to be equipped to run modern ML libraries such as TensorFlow, MXNet, PyTorch etc. Companies who are in the forefront of AI, such as Tesla, even unvei their own chips to run the AI models.
- Cloud Security & Privacy – As the organizations move towards cloud, data security and privacy becomes a huge focus area. There were data security breaches reported with Yahoo, LinkedIn, Facebook, and many other Silicon Valley companies. Machine learning developers have to ensure that the data and the inferences are securely stored in a machine learning infrastructure. Also, the infrastructure has to comply with industry regulations such as HIPAA and Dodd Frank etc.
As organizations grapple with these problems, they start looking at a comprehensive machine learning infrastructure that would alleviate some of these concerns. They don’t want to reinvent the wheel and are looking at a secure and scalable machine learning infrastructure used by other companies successfully. The big ones like AWS, Azure, and Google have a strong machine learning infrastructure in place. Outside of that, Data Bricks, Alteryx, SAS, DataBricks, IBM SPSS and H20.ai are few of the specialized vendors that specialize in machine learning platforms as per the Gartner Magic Quadrant.
Key Components of SageMaker
AWS SageMaker is the machine learning infrastructure created by AWS. As ML use cases explode and velocity, variety and veracity of data changes, organizations need a comprehensive framework for end-to-end machine learning.
AWS created SageMaker as a fully managed service that enables data scientists and developers to quickly and easily build, train, and deploy machine learning models at any scale. This way developers didn’t have to worry about the mechanics of machine learning and can focus on delivering business value and generating ROI.
AWS SageMaker has 4 major components. 1) Prepare 2) Build 3) Train and Tune 4) Deploy and Manage
- Prepare – Data preparation is an important step, moving the raw data through a machine learning algorithm to uncover insights and make predictions. It’s critical to feed the right data to algorithms to solve the problem. Most machine learning algorithms require data to be formatted in a very specific way. Developers spend a lot time preparing the data before it can yield useful insights. As the adage goes, ‘garbage in is garbage out’. Good data preparation produces clean and well-curated data, which leads to more practical, accurate model outcomes.
- Build – Once the data is collected and prepared, the data engineering process begins. It involves a sequence of steps, such as feature engineering, data validation, model evaluation, and model interpretation. Developers can use Jupyter Notebooks to build the model. This stage usually generates an artifact where the developers create a production-ready package for deployment.
- Train & Tune – Once the building process is completed, the model needs to be trained and fine-tuned. Fine-tuning the machine learning model is a crucial step and involves making a prediction based on the current state of the model and determining if there are any incorrect predictions. During the training process, the weights and parameters are adjusted to minimize errors and to increase the accuracy of prediction. The process is repeated until the model has converged and can no longer learn.
- Deploy & Manage – After the model is trained, it is deployed using Amazon SageMaker to get predictions in any of the following ways:
- To set up a persistent endpoint to get one prediction at a time, use SageMaker hosting services.
- To get predictions for an entire dataset, use SageMaker batch transform.
Key New Features of AWS SageMaker
1. SageMaker Data Wrangler
Data preparation is a crucial step of the ML process. Data Wrangler has a fully managed integrated development environment (IDE) that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data for ML. The access management is governed by AWS Identity and Access Management (IAM), based on the permissions attached to the SageMaker Studio instance.
Data Wrangler provides five core functionalities.
- Discover – Developers need to understand the data thoroughly and should be able to select data from different sources. SageMaker Data Wrangler lets the developer easily and quickly connect to AWS components like Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and AWS Lake Formation. Other data sources such as CSV files, Parquet, and database tables also can be used as a data source.
- Transform – SageMaker Data Wrangler includes 300+ built-in transformations for finding and replacing data, splitting/renaming/dropping columns, scaling numerical values, encoding categorical values, and so on. The developer selects the transformation from a drop-down list and fills in the required parameters. Developers can then preview the change and decide whether to add it for this dataset.
- Visualize – The easy visualization functionality allows ML developers to identify extreme values and outliers. This allows the developers to visualize the features without having to write code.
- Diagnose and fix – Models can be evaluated quickly for any data inconsistencies before deploying the model to production. If model performance is not up to the mark, developers can do additional feature engineering to improve the model accuracy.
- Deploy – Once model development is complete, developers can easily deploy the model to production using a single click of a button. Data Wrangler easily integrates with SageMaker pipelines to automate model deployment and management.
Here is a brief look at the technical architecture.
Bonus: Storing the features
The transformations defined in Data Wrangler can be stored in an offline feature store so that the features can be shared and reused consistently across an organization, enabling collaboration among data scientists. This standardization is often key to creating a normalized, reusable set of features that can be created, shared, and managed as input into training ML models.
2. SageMaker Pipelines and ML Ops
Amazon SageMaker Pipelines is a purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML).
Even though DevOps and CI/CD have long been established, ML Ops is a newer discipline. This discipline allows ML developers to work closely with IT to deploy models to production. Also, ML Ops requires a different approach than DevOps, where code and data were independent. The below table provides a comparison between DevOps and ML Ops. As illustrated below, DevOps for ML handles a lot more things than standard DevOps.
The goal of applying ML Ops is to enable customers to accelerate the adoption of ML workloads and optimize operational aspects of building, deploying, and operating ML workloads.
Here are a few benefits of ML Ops:
- ML Ops ensures regulatory compliance and adherence to data security and privacy. Leveraging the techniques from CI/ CD, ML Oops ensures that there is a defined process to deploy models with proper version controls, so that there is auditability. The models are monitored to ensure that they are sticking to the defined performance KIPs for measuring security and privacy.
- ML Ops ensures that there is a defined process for ML developers so that the developers can focus on delivering business outcomes and scale models based on data volumes.
- Accelerate the time to value as developerswork closely with IT and DevOps to deploy models to production.
SageMaker Pipelines allows developers to create, automate, and manage end-to-end ML workflows at scale including:
- Automating different steps of the ML workflow, including data loading, data transformation, training, tuning, and deployment using a Python interface.
- Enable developers to build several models at a time with massive data volumes and by running large scale experiments.
- Share and re-use workflows across the developer community that helps with scaling ML processes and best practices throughout the organization.
- Manage dependencies, build correct sequences, and automate steps without getting involved in too much coding.
The below diagram illustrates the end-to-end flow of a model deployment and how different stakeholders are involved. The responsibilities move across different stakeholders – data engineer, ML developer, DevOps engineer and finally to a software engineer.
3. SageMaker Edge Manager
As per IDC, by 2023, 50% of the new IT infrastructure will be deployed over the edge. Edge appliances can be used for a range of industry applications: manufacturing, construction, retail, energy, agriculture, healthcare, and more. Edge devices range from simple sensors to large industrial machines that will spit out enormous amounts of data. Business leaders are looking at ways to capture data, analyze it, and act on it. With advancements in the hardware designed for ML, it is now possible to run multiple complex neural network models on edge devices.
In the past, operating ML models on edge was a challenging task due to limited compute, memory, and other connectivity challenges. The models need to be continuously monitored to avoid quality of decay over time. The models had to be updated back to the edge device as the model is compiled every time. These challenges may disrupt the operation and prevent the smooth functioning of the application.
AWS is an innovative solution that provides model management for edge devices to optimize, secure, monitor, and maintain machine learning models on edge devices. The below diagram illustrates the 3 key features of Edge Manager.
There are 5 components in the Edge Manager workflow
- Once the model is trained or imported using Amazon SageMaker, SageMaker Edge Manager first compiles the model to optimize it for hardware platform using Amazon SageMaker Neo.
- SageMaker Edge Manager packages and signs the model, and stores it in Amazon Simple Storage Service (Amazon S3)
- Deploy models to the devices using IoT Greengrass or other deployment mechanisms
- The models run on SageMaker inference engine (Edge manager agent)
- The models are maintained on devices
Here is sample technical architecture for edge incorporating SageMaker pipeline and Edge Manager.
Case Study – How HGS Operationalized the Voice of Customer solution using SageMaker
1. Business problem
Below are business problems faced by CX executives from a Q1, 2021 survey by McKinsey
- CX executives are looking for answers to optimize customer journeys to create better, smarter, faster, and more frictionless customer experience.
- CX executives are unhappy with how CSAT is measured: small sample size, limited insights and unable to share with broader organization. Only 7% of the customer voice data is shared with CX leaders.
- CX leaders are struggling with aggregating data across systems to make real-time decisions. Only 13% of the CX leaders are confident they can take real-time action based on insights generated.
- Lack of consolidated data limits upsell/cross-sell opportunities.
HGS analytics leverages AI to monitor 100% of the interactions and combines structured and unstructured data (voice, email, chat) to generate meaningful business insights, leading to improved customer experience and reduced cost.
The solution leverages AWS components such as DynamoDB, RedShift and SageMaker to generate business outcomes. The model was deployed as a SageMaker API so that agents are able to provide proactive recommendations to the customer in real-time.
Below are the tangible outcomes from deploying this solution:
- Improvement in NPS by 10% gained from deeper insights about the customer sentiment across different products and geographies.
- Improved retention by 20% and improved customer engagement.
- Reduction in repeat contacts by 10%
- Improved agent efficiency by 10% due to automated call categorization & summarization
- Reduction in manual effort by 50%
AWS SageMaker pricing is purely consumption based. The consumption varies depending on the products used within the SageMaker ecosystem as shown below:
- the number of Jupyter notebooks used by machine learning developers for Notebook instances.
- the type of storage used for the infrastructure.
- the number of processing jobs and number of hours used for inference.
The AWS pricing calculator will give an accurate overview of the cost depending on the consumption – https://calculator.aws/#/createCalculator/SageMaker
As per the study conducted by Appen in 2021, less than 50% of the companies have achieved ROI after AI project deployments. This is an astonishingly low percentage of success considering the kind of investments that have gone into this sector. Several factors have contributed to this, including lack of skilled resources, limited organizational maturity, lack of quality data, and complex infrastructure requirements
A comprehensive machine learning infrastructure like SageMaker will be very beneficial as it has purpose-built tools for every step of ML development, including labeling, data preparation, feature engineering, statistical bias detection, auto-ML, training, tuning, hosting, explainability, monitoring, and workflows. Leveraging automated machine learning infrastructure like AWS SageMaker, machine learning practitioners can focus on delivering business value and achieving ROI rather than worrying about the mechanics of machine learning implementation.
Yasim Kolathayil, VP of Data & Insights at HGS Digital
Swapnil Pawar, Cloud/Software Architect at HGS Digital