class: center, middle, inverse, title-slide # From Prototyping to Deployment at Scale with R and sparklyr ### Kevin Kuo ### June 2018 --- # Menu today - .Large[The deployment problem] - .Large[ML pipelines] - .Large[Model deployment and demo] --- class: inverse, center, middle # The deployment problem --- # Deployment .large[Or, putting ML models *"into production"*.] -- <br /> .large[Basically, make it so that someone else can use your model (i.e. make some predictions with it).] --- # Deployment - latency dimension .pull-left[ .large[**Batch** - Event-based/time-based - E.g. nightly portfolio risk calculations, and - Email campaigns - It's OK to take a while ] ] .pull-right[ .large[**"Real-time"** - On demand - E.g. instant loan approvals, and - Fraud detection on credit card swipes - Gotta be (relatively) fast, seconds to less than a second ] ] --- # Deployment - why is it hard? .full-width[.content-box-red[.Large[Challenge 1/n: Putting ML models into production involves different expertise]]] -- .large[...so it involves more people] -- .large[...mo ppl mo problems] --- # Deployment - mo ppl mo problems <img src="figs/infinite_loop.png" width="70%" /> Credit: [https://youtu.be/-K9SjrWpeys](https://youtu.be/-K9SjrWpeys) [@josh_wills](https://twitter.com/josh_wills) --- # Deployment - why is it hard? .full-width[.content-box-red[.Large[Challenge 2/n: Rapidly changing landscape in deployment options]]] -- What do? - Spark ML persistence? - dbml-local? - PMML? - PFA/Aardpfark? - MLeap? - ONNX? - Roll our own thing? - Re-implement the model in C++, because performance? - Throw it into a container and do orchestration cuz it's cool? --- # Deployment - why is it hard? .full-width[.content-box-red[.Large[Challenge 3/n: Too many ML frameworks and no standardization]]] -- We're focusing on Spark in this session, but we'll acknowledge other technologies we need to deal with --- # Deployment - diversity of ML frameworks Spark ML, xgboost, random CRAN packages, scikit-learn, H2O, ... ![](figs/ml_everywhere.jpg)<!-- --> --- # Deployment - one of many scenarios .large[ **Data scientist**: Hey this random forest loan decision model is ready to go! **Engineer**: OK, we need to recode it in C#/Java, see you in 6 months! ] -- .large[ **Data scientist**: Oh no that's too long, what about I give you this GLM with just a few parameters? **Engineer**: 2 months. ] -- .large[ **Data scientist**: 🤔 ] --- # Deployment - challenges for the R user On *average*, R users tend to... - Be math/stats types - Have little CS/software engineering training So it's slightly tougher for them to collaborate with the folks doing model implementation. -- However, - Data scientists (regardless of background) are becoming more comfortable moving up and down the stack - There has been active development of technology to faciliate the data science-engineering handoff --- # Deployment - technology .full-width[.content-box-blue[ Technology won't solve your people/process/culture issues, but it can *make collaboration easier*! ]] -- .large[Next up: we'll provide a quick review of Spark ML pipelines, and offer a couple ways of "deploying" them using the **sparklyr** ecosystem.] --- class: inverse, center, middle # Spark ML pipelines --- # ML pipelines Pipelines are basically... > A structure in which you can throw in data transformers and ML models. -- Keep in mind that when you deploy a model, you also need to deploy the feature engineering steps in order to feed the right inputs to the model! (E.g. converting a numeric `age` variable into an age range bucket that the model requires.) -- Now let's go through a (very quick) overview of pipeline concepts. --- # ML pipelines - A `Transformer` takes a data frame, via `ml_transform()`, and returns a transformed data frame. ![](figs/transformer.png) --- # ML pipelines - A `Transformer` takes a data frame, via `ml_transform()`, and returns a transformed data frame. - An `Estimator` take a data frame, via `ml_fit()`, and returns a `Transformer`. ![](figs/estimator.png) --- # ML pipelines - A `Transformer` takes a data frame, via `ml_transform()`, and returns a transformed data frame. - An `Estimator` take a data frame, via `ml_fit()`, and returns a `Transformer`. - A `Pipeline` consists of a sequence of stages—`PipelineStage`s—that act on some data in order. - A `PipelineStage` can be either a `Transformer` or an `Estimator`. ![](figs/pipeline.png) --- # ML pipelines - A `Pipeline` consists of a sequence of stages—`PipelineStage`s—that act on some data in order. - A `PipelineStage` can be either a `Transformer` or an `Estimator`. - A `Pipeline` is always an `Estimator`, and its fitted form is called `PipelineModel` which is a `Transformer`. ![](figs/pipeline_model.png) --- # ML pipelines ![](figs/pipeline_fitting2.png) --- # ML pipelines ![](figs/pipeline_transform.png) --- # Serving the model Now, the trick is to persist this `PipelineModel` so we can use it to serve predictions later on. -- We'll demo a couple ways today - Native Spark ML persistence support - MLeap (via the **mleap** R package) --- class: inverse, center, middle # Demo --- # Model deployment paths .pull-left[ .large[**Spark ML Persistence** - Appropriate for batch jobs, scoring lots of records at once - Requires Spark session ] ] .pull-right[ .large[**MLeap** - Better for real-time prediction of a small number of records - Doesn't require Spark session, portable to apps/devices that support JVM ] ] --- # Towards a better deployment story .large[ **Data scientist**: Hey this random forest loan decision model is ready to go! Here is the `.zip` bundle, and here is the documentation you need to use it. **Engineer**: Awesome! We won't need to write a bazillion if-else statements to recreate the model! ] -- .large[When the model needs updating...] -- .large[ **Data scientist**: We decided to use a GBM instead for better accuracy, here's the updated bundle. **Engineer**: Fantabulous! All we need to do is update the model directory! ] --- # Wrap up Slides and code will be available at [https://kevinykuo.com](https://kevinykuo.com). Inspirations/other talks to check out - "Productionizing Spark ML pipelines with the portable format for analytics" [https://youtu.be/h-B0VCkoRkE](https://youtu.be/h-B0VCkoRkE) [@MLnick](https://twitter.com/mlnick) - "How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x" [https://youtu.be/r740xbIpb54](https://youtu.be/r740xbIpb54) Richard Garris - "MLeap and Combust ML" [https://youtu.be/MGZDF6E41r4](https://youtu.be/MGZDF6E41r4) Hollin Wilkins and Mikhail Semeniuk