From Prototyping to Deployment at Scale with R and sparklyr

class: center, middle, inverse, title-slide

# From Prototyping to Deployment at Scale with R and sparklyr
### Kevin Kuo
### June 2018

---

# Menu today

- .Large[The deployment problem]
- .Large[ML pipelines]
- .Large[Model deployment and demo]

---
class: inverse, center, middle

# The deployment problem

---

# Deployment

.large[Or, putting ML models *"into production"*.]

<br />

.large[Basically, make it so that someone else can use your model (i.e. make some predictions with it).]

---

# Deployment - latency dimension

.pull-left[
.large[**Batch**

- Event-based/time-based
- E.g. nightly portfolio risk calculations, and
- Email campaigns
- It's OK to take a while
]
]

.pull-right[
.large[**"Real-time"**

- On demand
- E.g. instant loan approvals, and
- Fraud detection on credit card swipes
- Gotta be (relatively) fast, seconds to less than a second
]
]

---

# Deployment - why is it hard?

.full-width[.content-box-red[.Large[Challenge 1/n: Putting ML models into production involves different expertise]]]

.large[...so it involves more people]

.large[...mo ppl mo problems]

---

# Deployment - mo ppl mo problems

Credit: [https://youtu.be/-K9SjrWpeys](https://youtu.be/-K9SjrWpeys) [@josh_wills](https://twitter.com/josh_wills)

---

# Deployment - why is it hard?

.full-width[.content-box-red[.Large[Challenge 2/n: Rapidly changing landscape in deployment options]]]

What do?

- Spark ML persistence?
- dbml-local?
- PMML?
- PFA/Aardpfark?
- MLeap?
- ONNX?
- Roll our own thing?
- Re-implement the model in C++, because performance?
- Throw it into a container and do orchestration cuz it's cool?

---

# Deployment - why is it hard?

.full-width[.content-box-red[.Large[Challenge 3/n: Too many ML frameworks and no standardization]]]

We're focusing on Spark in this session, but we'll acknowledge other technologies we need to deal with

---

# Deployment - diversity of ML frameworks

Spark ML, xgboost, random CRAN packages, scikit-learn, H2O, ...

![](figs/ml_everywhere.jpg)

---

# Deployment - one of many scenarios

.large[
**Data scientist**: Hey this random forest loan decision model is ready to go!

**Engineer**: OK, we need to recode it in C#/Java, see you in 6 months!
]
--

.large[
**Data scientist**: Oh no that's too long, what about I give you this GLM with just a few parameters?

**Engineer**: 2 months.
]

.large[
**Data scientist**: 🤔
]

---

# Deployment - challenges for the R user

On *average*, R users tend to...

- Be math/stats types
- Have little CS/software engineering training

So it's slightly tougher for them to collaborate with the folks doing model implementation.

However,

- Data scientists (regardless of background) are becoming more comfortable moving up and down the stack
- There has been active development of technology to faciliate the data science-engineering handoff

---

# Deployment - technology

.full-width[.content-box-blue[
Technology won't solve your people/process/culture issues, but it can *make collaboration easier*!
]]

.large[Next up: we'll provide a quick review of Spark ML pipelines, and offer a couple ways of "deploying" them using the **sparklyr** ecosystem.]

---
class: inverse, center, middle

# Spark ML pipelines

---
# ML pipelines

Pipelines are basically...

> A structure in which you can throw in data transformers and ML models.

Keep in mind that when you deploy a model, you also need to deploy the feature engineering steps in order to feed the right inputs to the model! (E.g. converting a numeric `age` variable into an age range bucket that the model requires.)

Now let's go through a (very quick) overview of pipeline concepts.

---

# ML pipelines

- A `Transformer` takes a data frame, via `ml_transform()`, and returns a transformed data frame.

![](figs/transformer.png)

---

# ML pipelines

- A `Transformer` takes a data frame, via `ml_transform()`, and returns a transformed data frame.
- An `Estimator` take a data frame, via `ml_fit()`, and returns a `Transformer`.

![](figs/estimator.png)

---
# ML pipelines

- A `Transformer` takes a data frame, via `ml_transform()`, and returns a transformed data frame.
- An `Estimator` take a data frame, via `ml_fit()`, and returns a `Transformer`. 
- A `Pipeline` consists of a sequence of  stages—`PipelineStage`s—that act on some data in order. 
    - A `PipelineStage` can be either a `Transformer` or an `Estimator`.

![](figs/pipeline.png)

---

# ML pipelines

- A `Pipeline` consists of a sequence of  stages—`PipelineStage`s—that act on some data in order. 
    - A `PipelineStage` can be either a `Transformer` or an `Estimator`. 
- A `Pipeline` is always an `Estimator`, and its fitted form is called `PipelineModel` which is a `Transformer`.

![](figs/pipeline_model.png)

---

# ML pipelines

![](figs/pipeline_fitting2.png)

---

# ML pipelines

![](figs/pipeline_transform.png)

---

# Serving the model

Now, the trick is to persist this `PipelineModel` so we can use it to serve predictions later on.

We'll demo a couple ways today

- Native Spark ML persistence support
- MLeap (via the **mleap** R package)

---
class: inverse, center, middle

# Demo

---

# Model deployment paths

.pull-left[
.large[**Spark ML Persistence**
- Appropriate for batch jobs, scoring lots of records at once
- Requires Spark session
]
]

.pull-right[
.large[**MLeap**
- Better for real-time prediction of a small number of records
- Doesn't require Spark session, portable to apps/devices that support JVM
]
]

---

# Towards a better deployment story

.large[
**Data scientist**: Hey this random forest loan decision model is ready to go! Here is the `.zip` bundle, and here is the documentation you need to use it.

**Engineer**: Awesome! We won't need to write a bazillion if-else statements to recreate the model!
]
--

.large[When the model needs updating...]

.large[
**Data scientist**: We decided to use a GBM instead for better accuracy, here's the updated bundle.

**Engineer**: Fantabulous! All we need to do is update the model directory!
]

---

# Wrap up

Slides and code will be available at [https://kevinykuo.com](https://kevinykuo.com).

Inspirations/other talks to check out

- "Productionizing Spark ML pipelines with the portable format for analytics" [https://youtu.be/h-B0VCkoRkE](https://youtu.be/h-B0VCkoRkE) [@MLnick](https://twitter.com/mlnick)
- "How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2.x" [https://youtu.be/r740xbIpb54](https://youtu.be/r740xbIpb54) Richard Garris
- "MLeap and Combust ML" [https://youtu.be/MGZDF6E41r4](https://youtu.be/MGZDF6E41r4) Hollin Wilkins and Mikhail Semeniuk