Machine Learning Using

Hands-on Machine Learning


February 8, 2017

[Levi Thatcher]

The hands-on session, really practical learnings.  (00:10) from past Health Catalyst webinars that you might have joined.  So, let us just dive right into it.

So first of, I want to introduce my team at Health Catalyst that has been focusing on this project.

Health Catalyst Data Science Team [00:21]

So, I am leading the team and we have Mike Mastanduno who has joined us recently.  Taylor Larsen is on the team and Taylor Miller.  And so, different backgrounds from software engineering to algorithms to BI architecture.  The idea here is that this group of folks is building and using it to improve outcomes.  And we will try to get through these slides fairly quickly, so we can get to the code.

Purpose of today’s chat [00:47]

But we will just go through a little bit about what is and what it does.  And then we will get you started in RStudio after that.  So if you have not downloaded RStudio and if you want to follow along, there is a link in the chat window that Tyler has placed.  And then after that we will talk about the roadmap as to what we want to put into and you guys may be interested in that piece and may want to ask some questions in the chat window or provide suggestions as to what you would like to see, and then finally go through a question and answer period, which will be extended compared to a lot of the webinars as this topic is right for discussion.  We are excited to see what is on your minds.

What does do? [01:29]

So, briefly, what does do?  It basically enables quick model creation and deployment and it does that by giving you pre-processing functions.  So a lot of times when you are building models you need to pre-process the data, you need to draw columns, you need to transform columns, you need to get them ready for the algorithm to work on them.  That is in the package.  And we will provide proper algorithms for healthcare.  So, machine learning is kind of a buzz word and it does not mean it cannot do amazing things but it is super popular right now and it is popular across many industries, whether it be retail, social media, or financial industries.  And what we did is we went out there and looked at machine learning in general and brought back the most typical and appropriate algorithms that would be beneficial for healthcare specifically and put those in the package.

And also, whenever you are building a machine learning model and looking to predict something, to test how accurate your model is, you have to have appropriate metrics, and this could be different depending on questions.  So whether you are in finance or operations, you may have different metrics used to evaluate your model.  And so, we grabbed the ones that are most appropriate for typical healthcare questions that you would see.

And finally, this may be the most important and most unique part of our  So, we let you easily deploy models into your production environment and that is pretty noble.  So across healthcare, you have seen research groups here and there that have created models and predicted sepsis or predicted readmissions and the organization may have had data sciences that have done this already, and that is exciting and interesting, but what happens a lot of times with that is that the models do not get deployed.  So, they do not end up unpacking outcomes most of the time.  What happens is the research gets done and there is a model that is developed, it is accurate, but in the end, the questions (03:26) benefiting because the model languishes in that development environment.  So one thing we are excited about here is the fact that the package makes it super easy for you and your organization to actually develop a model as well as place it into an ETL environment, such that nightly you are able to get refreshed predictions for the new patients you want to learn about.

Who is for? [03:50]

And who is this package for, whose audience today?  So we saw the poll question here a moment ago and that is very appropriate to who we design the package for.  So, broadly for technical people.  So imagine if you are interested in machine learning, you may have some technical bent and you may be interested in software in general.  And so, if you ran a code before in your life, whether it be Java or C or MATLAB or been excited to run code, the package will be very appropriate for yourself, and for business intelligence folks broadly.

So what has happened in a lot of machine learning realms across the different industries is that it is typically the data scientists that are doing the model building, the algorithm work, and we wanted to democratize machine learning to build a tool, build a package that could help more than just the data scientists get their hands dirty and get involved in model creation.  And so, for those out there that have wondered like, you know, hey, a data science machine learning is a little bit intimidating, what we are trying to do with this tool is to provide a general introduction not only to R and Python but to machine learning in general.


But of course data scientists will benefit as well because you do not want to reinvent the wheel every time you try to create a model and put that into production.  It is helpful to have a code that is under unit tests, that is under source control, just really check very thoroughly such that when you are creating a model, you will be able to focus on the things that matter, which is what columns or features they bring you the model and what their impact on model performance.  So whether you are a more of a business intelligence-type person, a software engineer or a data scientist, this tool has been (05:33) audiences and really helps you focus on what is most important in model creation.

Poll question – Operating System [05:39]

And we will go to our first poll question.

[Tyler Morgan]

Alright.  We have our first poll question, which is, at work, what operating system would you use for R/Python?  Please select one of the following – Windows, Mac, or Linux?  We will leave this open for just a few moments to give everyone a chance to respond.

We are getting some questions and comments in regarding where the link is to install R and Python and we have included out the information in a YouTube video that we put the link on to chat, as well as we do recognize we have had some audio issues.  We are working on those to make sure that our audio is as loud and clean as possible.

So let us go ahead and close this poll and let us share the results.

Poll Results [06:30]

So 79 percent Windows, 11 percent Mac, and 11 percent Linux.

[Levi Thatcher]

Thanks Tyler.  Very interesting.  Okay.  So we have an idea.  We were thinking the Windows is popular across healthcare but never had had data as to how popular it is.  So it is very fascinating that Windows dominates so heavily.  And to be honest, the packages are primarily designed for Windows users at heart but we are making sure that it is ambivalent so that the folks out there who are running Linux or Mac can get them all as well.  So we are conscientious of that.

And why do not you do a follow-up poll question as well.

Poll question – R vs Python [07:09]

[Tyler Morgan]

Alright.  Our next poll question is have you run R or Python code before?  And so please select one of the following, either an R, or Python, Neither, or both.  So this will give us a good opportunity to see the experience around these tools.  Everyone, we will leave this open for a few moments to give everyone a chance to respond.

Alright.  Let us go ahead and close the poll and share the results.

Poll Results [07:40]

Results show that 22 percent have run R, 14 percent Python, 36 percent neither, but 28 percent run both.

[Levi Thatcher]

A pretty broad range of experience there.  That is very interesting.

So a little more R-focused than Python-focused in healthcare in process because we focuses webinar on that specific language but it is good to gather this data about the industry so we can all learn from each other and what the community is most excited about really.

Algorithm choices for [08:13]

So moving along with that out of the way, so mainly Windows-focused now up here, a little bit leaned towards R.  So in this R package, how do we choose algorithms and one algorithm is actually in here.  So the first one is Lasso.  So from your status classes in maybe college or high school, you probably have heard about these linear models that you can create, fitting a line to data and they get more sophisticated from there.  But Lasso is a linear model and what it basically allows you to do is not only create a linear model but allows you to see which columns or features are most important in your model.  So, if you are wondering, okay, well I have 20 columns that I brought in, which ones need to be kept around or they maybe want to get rid us on to make your processes more efficient.  But that is where Lasso shines, is that it allows you to simplify your data set to only those columns that are really impactful.

Now, the Random Forest took algorithms also in the package and it is another workforce of the machine learning world.  So that is in the package as well.  And the Random Forest, you may have heard of, maybe not.  What it basically does is it is an ensemble of decision treaties.  So if you imagine that you have attributes about people on your data set, whether they would be admitted to the hospital or not, well what the Random Forest will do is it will create maybe a hundred decision treaties that will map personal attributes to particular outcomes and the Random Forest uses a ruling procedure to say, okay, well amongst all of these decision treaties, what is the most appropriate output for these inputs like most machine algorithms do.  So, one linear algorithm, one ensemble algorithm.

I should note in machine learning problems, we typically talk about algorithms being matched with data lead to models.  So we would use these Lasso or Random Forest algorithms in the package to come up with models.

Difference between R and Python packages? [10:17]

Now, we have a package for R and Python both.  Both are huge in data science and machine learning across many industries today.  And so, we thought if we could not make a decision at this point, then focus just on one of the other since as we could see from the poll question many of you have experienced in one and of the other.  So we are excited to offer both and the difference is between our slide and we will try to minimize them as much as possible but right now the R package has a little more functionality.  We have been focusing on that a little bit more at Health Catalyst, both for our internal education and for the community as well.  One of the reasons for that is that it seems a little more newbie-friendly.  So if you are just getting into data analysis, machine learning, the R environment is a little bit more cohesive for a beginner to understanding.

But Python of course is super important and we will use that to leverage large data sets and what you see happen is as you get a data set that has maybe 500,000 or 700,000 rows, then an algorithm, like the Random Forest or Lasso, only benefits so much from more and more data and this is illustrated in this plot right here.  So this comes from Andrew Ng’s new book, Machine Learning Yearning.  I kind of love that title.  And what we basically see here is your traditional learning algorithms at the bottom in red, things like Lasso or Random Forest, as you feed them more and more data, the performance does not really increase that much.  It is the strong plateau there.  But as you start to use things like neural meds and even deep neural meds, you are able to take advantage of larger data sets in the way that you cannot with Random Forest.  And those type of deep learning packages are more suited for Python and R.  And so, that will be on our priority list in the next couple of months – is making sure that the Python package has deep learning functionality, such that if you have a data set with a couple million rows, you are able to really take advantage of each one of those rows and you do not see this plateau after just creating a model with 500,000 rows.  As you can see, it is what happens in this graph right here.

Workflow of deploying a model using [12:29]

So, what is the workflow?  How do I actually use the tool and how do I think about model creation and deployment?  So the two steps are basically these.  So first of, you think of the development step where you are trying different features, and when I say features, you can think of those as a column, future as sort of a machine learning term.  And so, you are trying different features, you are pairing an algorithm with these features to see how accurate your model is.  We mentioned Lasso and Random Forest.  And so, you will go and you will say, okay, well for these 10 features in Lasso, what is my performance for this model that is created?  And you compare that with Random Forest and you say, okay, well Random Forest did a lot better with these particular features.  And so, I want to deploy a Random Forest model into my production environment.  So that is the next step.  And like I mentioned before, that is really where shines, in that across many industries, it is this deployment step where many problems just sort of flounder and sort of fail is actually putting the model to production such that each night you are getting a new prediction that helps you answer some business questions that is important to you.

Poll question – I/O [13:42]

Now, going to our next poll question.  Tyler?

[Tyler Morgan]

Alright.  Thank you, Levi.  So our next poll question is what type of data connections would you primarily use for R?  CSV, SQL Server, MySQL, Postgres, or Oracle?  We will leave this open to give everyone the opportunity to respond.

Now, we definitely have some data scientists on the line.  We have got some questions about the number of participants in these polls.  We would like to say that when we distribute the slides out to everyone, we will make sure to have the poll questions populated as well as the responses and the number of participants as well.  So you will have all of that data for all of you.

Let us go ahead and close the poll and let us share our results.

Poll Results [14:30]

The 26 percent responded CSV, 51 percent SQL Server, 9 percent My SQL, 3 percent Postgres, 11 percent Oracle.

[Levi Thatcher]

Oh interesting.  Okay.  I did not know SQL Server dominated in terms of databases over Oracle.  Really in healthcare that is really interesting.  And CSV is also very common.  That is good to know.

So, we have of course had to use (14:54) since we are word of mouth and what to focus on as this data kind of have to come by it.  So in designing these packages, we – I guess they mainly focused on SQL Server since Microsoft is pretty popular.  We are also concerned about of course TXT and CSV files as well and that is actually the topic of the slides.

I/O for [15:12]

So, databases, the connections are set up such that they can easily interact with SQL Server from the R package and then it is very easy also to pull in CSV files or TXT files or really any type of file that you use.  And we will use this information to sort of design the roadmap from here.  So Oracle is of course very important, it looks like, and we would love to have the more open source type connections as well with MySQL and things like Postgres.  So we are going to try to become ambivalent as to the type of connections so that we can make the interaction with the community as broad as possible. use cases [15:50]

So, you may be asking yourself, okay, well we have this tool and we can create a model but so what?  So what am I supposed to do with this?  So, let us run down some of the things that we have been doing with the algorithms in our day to day.  So 30-day readmissions is big for a lot of different hospital systems.  There is penalties and it cost a lot of vexation across all systems.  Okay, we have tried this and tried that and we really cannot get our 30-day readmissions down below a certain metric.  And so, we are getting penalized by CMS.  And so, that has come up a lot.  And then that is a nice prediction in that you can get pretty quick feedbacks in terms of how accurate your model is.  And so, if you think about predicting the on-sight of diabetes or heart failure, that is really a long-term prediction.  So in 30-day readmissions, after 30 days, you get really clear picture as to whether your model is accurate or not because you have those actionable results coming in.  The person is either readmitted or they are not.

And then other works like that that that we have been working on are hospital acquired infections like CLABSI.  CLABSI is another pretty quick feedback loop in that people that have lines in them, broad lines in them for a long period usually, and the CLABSI models have turned out to be quite accurate, the most we have been working on.  So that is a great use case in healthcare.  CLABSI is quite a serious illness, Central Line-Associated Blood Stream Infections.  Actually 1 in 12 people with those actually die.  So, a lot of high impact things you can do in your health system pretty quickly with

That is more operational than financial type use cases as well.  So a lot of people are interested in no-shows, how do I both predict who is going to not show up to their appointments, someday we can have intervention and give them a call, a reminder ahead of time, or on the other hand, knowing who will double-book that slot, such that the clinician can have a more efficient date.  Things like propensity to pay if you want to improve your financial outlook.  That has been a big one as well.  Then census, so predicting the number of beds that will be used and occupied in particular units of the hospital such that you can plan on bed utilization but as well as the staffing for the hospital in general which is really critical to get that right.

Examples [18:10]

So, some examples, and this is the most exciting part of the webinar.  So thanks for sticking through those slides.  So if you want to follow along, you can pull out RStudio.  And if not, this will be recorded so you can watch it later.  I am going to jump out of Powerpoint here and hop over RStudio.

[RStudio Demo starts at 18:29]


And what we will do is show the basic RStudio view, just kind of give you a little walkthrough since some of you may be new to this realm.  But what you see on the left-hand side is what is called the Console and this may be familiar to you or not but what you can do with the Console is type commands and I will do computations for you.  So you can start really advancing and do things like 5 plus 1 is 6 and work up from there.  And the Console is really good for interactive use.  So quick calculations, maybe run a T test or something like that, yielded on the Console.

Now, the right-hand side and those other tabs and this may look slightly different from your RStudio but you have things like a Package list.  You will have Help tab where you can do documentation.  Up in the environment tab here you will see some variables loaded into your environment and you can have a history tab as well which shows you past commands to be entered then.

And so, what we want to do today is we want to play with  So if we do that, type “library(healthcareai) and that actually loads the package into the environment, so you can start playing with it.  And to read the documentation at any point, if you ever get stuck or if you have a question, something does not make sense, type ?healthcareai in the Console, and you will notice in this Help tab over here on the right, the documentation comes up.  And the great thing about R is that we can very easily check with each change in the code if the associated examples and documentation are up to date.


So you will notice that there is indeed this two-step process.  So, developing a model, of deploying a model.  Just let us say I have some data that I am excited to play with, I want to create a model on this awesome data set to put together.


So you click in the Random Forest development here and you will notice that you have some parameter description – so what can I do with this Random Forest function.  And scroll down, you will see some example code and that is where it gets fun.


So let us see this Iris example first of.  And Iris is a built-in data set in R, and such a super simple way to get started.  And if you scroll down, you will see a CSV example if you look further, so that it is helpful of course for the CSV folks that we saw in the poll (20:50).


But if you keep going, we are going to actually play with a SQL example which looks like will be relevant to a lot of you.  But it is right here.  And so, when you get started when you are playing with healthcareai in RStudio, what you can do is just simply grab the code, just copy it right here, scroll down a little bit further.  Control C.  And instead of putting this in the Console, what we want to do is open up a new script.  So if you click in this top left corner here, open up a script such that you can run a bunch of commands sequentially and we will just paste an example code right there.


So we will focus on that top left corner here for a minute and kind of describe what exactly is going on, how I actually create a model on this awesome data that I have.  So first of, to do that, what you will focus on primarily is getting the data into RStudio.  That is a pretty simple process with the tools we have created.  So you will notice that we have a connection string and with that, we actually load in the package.  So this gives the functionality of the package working for you.  And let us say we have some data and SQL Server and we have it on a local instance of SQL Server.  So we just use local host as server name.  Say the table of interest is in the SAM database and this is maybe a little bit trivial for some of you but for others it may be of interest.  Then this query here is the main driver of what we are pulling in to R.  And of course you can put into this query whatever you would put into a query when you are looking at a database typically.  And what you do is you run this code down the line 25 using the select data function and we put it into this data frame called df.  I will just kind of pause and describe the data frame.

So in our data frames are first class citizens and it is a tabular way they will get data.  So for those of you that have worked with CSVs or databases, it is a pretty familiar way to work with data that you have columns and rows.  And so, let us go ahead and just run this to line 25.  So if you want to do that, you do control alt B or you can simply select the code of interest and click Run.


So, the same either way.  And awesome.  Okay.  Here is our data that we read in.  Just the top 6 rows are showing for simplicity’s sake but we have some fake data to play with and it is fairly typical of healthcare that A1c, GenderFLQ, LDL, blood pressure.  And okay.  So that is great.  Let us to do something with it now.

So if we keep scrolling, we can see what else happens in this model prediction process.  And so, if you look on line 27, for some reason, maybe you want to take a column out of the data frame.  Of course, you could take it out of the query or you can just use this null command.  Notice that you use a dollar sign operator to access a particular column in the data frame.  And so, right here, what we are doing is assigning this column to null using this operator of less than a Python and that is the R way of saying equals or assigning something to something.  Now, the set.seed is here to make the example reproducible, so it is the same each time.


And you will see right here a bunch of arguments and this is kind of the meat of where we should focus right now.  So, you will have different use cases pop up.  Whether you are predicting length of stay or predicting CLABSI or readmissions, here is where you will adjust to make each of those predictions possible.  And so, classification is typical for binary-type predictions, you know, readmission on CLABSI or not, yes or no for no show or not.  And so, that is why that has to note right there.  Of course, progression would be more appropriate for things like length of stay.  And so, we specify classification.  We go down and we specify our grain column and you typically have those in healthcare.  So this is going to be MRN, patient ID, patient encounter ID.  And if you do not have that, there is no problem.  You can just leave that parameter blank or just delete it entirely.  Then on line 36 is the meat of things.  So if you are wanting to predict the readmission, you simply specify that there.  And then there is a couple other minor things like, okay, well for amputation, yeah, we have some no cells in this data.  So we will turn amputation on and have it go as no cells.  We do not need to debug because this is a pretty simple example (25:25) processor of this computer, just one core.  So it is actually do what this core talks about.


So if we run down to below the Lasso there, then you can see right here we run our Lasso.  (25:36) control of B on line 42 and the computer chugs for a second and we get our results, output to the Console.  And this is the main where (25:46) with these models if you created these development steps.  So we are going to see, okay, well our AU.ROC is 0.74, which is decent.  So this is a common way to evaluate classification algorithms – is using this AU.ROC her, AUC command or metric.  Now that means there is decent predictive power.  Not great, nothing right (26:08) about, but it is okay.  It is better than a lot of the heuristics about that in healthcare.  And there is of course other ways to measure accuracy and the surface AU_PR as well.

Now, one interesting thing that a lot of data scientist workflows do not consider is the variable importance.  So, Lasso right here offers an indication as to which columns you should keep in your data if you want to have a very accurate prediction.  So, you can get rid of LDL if you want.  That did not really impact the prediction.  If you want to get rid of the gender flag, that did not really help.  But A1C number, that was very important.  So if you are tight on space or tight on processing time, you can just keep around the variables that are absolutely necessary.

But it is not great if you do not compare these two algorithms.  So let us scroll down and actually use the Random Forest algorithm as well and we will run right below Random Forest and we will see a comparison.  So there is Lasso above and Random Forest down below.


And the idea here is that you have AU_ROC for both.  And oh look, Random Forest did pretty well in these data sets until 0.91 AUC, which is a pretty great prediction.  Of course 1 is the optimal but that is rarely achieved in reality.  So, what this would tell you is that, okay, on this data, the Random Forest algorithm creates a better model.  And so, let us go ahead and deploy that like that.  That is the one that we want to take into production – is the algorithm that performs best obviously.  And the other process here is that you will likely be changing up your query as you think of more variables to pull in.  So in a modern health system, there are thousands of columns out there, and it takes some SQL work to get certain of them into a data set.  But as you are able to do that and pull more in and iterate, we are trying more algorithm versus another, you can often improve this accuracy over time as you are able to find the variables that are pertinent to your business question.  And of course, as you do that, you compare Lasso and Random Forest again.  And when you get to the point where you are in a range that is appropriate for your use case and your time constraints, you say, okay, well let us deploy and how do we do that?


Well, it is a fairly simple two-step process like we mentioned.  So when you are dealing with this development step and you have your algorithm and you have your feature list, you go over into the deployment paragraph here and you say, okay, well I am going to Random Forest deploy and then of course you had your description as to the parameters.  And if you scroll down of course, awesome example code that you can run with very minor modifications.

And so, that is kind of an overview as to how you connect the SQL Server and create some models in RStudio.  And we are going to turn it over to Tyler her for a quick poll question.

[End of demo at 29.08]

[Tyler Morgan]

Alright.  Before we go into that poll question, we would like to acknowledge, there are some folks that have been having some trouble loading the library, the library package.  We tried to submit some instructions in the chat that that package must be installed first.  The installation instructions can be found at and the instructions to load it on Windows, Mac, and Linux are all on that particular page.  So in that package, it will need to be installed first and then it can be loaded as well.

And as we transition over to Mike, we do have a poll question.  And this is kind of exciting.  We are developing a community.  And Levi, do you want to talk a bit more about the community and what will be involved there?

[Levi Thatcher]

Yeah.  So the idea is that we are all better working together than separately.  So we want people to reach out and to say hey, let us improve the tools together, let us contribute because we have had these learnings in our health system.  And so, we want to have frequent interactions with everybody.  And we will be starting a series of these interactions with a weekly broadcast.  We are going to start off weekly and go from there, and also interacting with the blog as well.  And so, trying to keep these frequent interactions where we can learn from each other we think will be really helpful to everybody.

Poll Question

Would you like to join the community? (newsletter and live broadcasts)


[Tyler Morgan]

And so, I am going to launch a poll to ask if you would like to join the community.  These include a newsletter.  These are some of the blog posts and things that Levi provides on a regular basis, as well as our weekly live broadcast.  And currently we are shooting for, we have a proposed date of starting those live broadcasts.  On February 23rd, we hope to do the first one.  If you join the community, we will make sure that you get all information about this and we are very very excited about this and the openness this community really want to be able to get a lot of participation.

Thank you for responding.  Let us go ahead and share the results.

Poll Results [31:08]

It looks like about 89 percent of those who responded are very favorable towards community.  We are excited about that in working together to make these packages work right for everyone.

So, I am now going to turn the time over to Mike Mastanduno.  Mike.


[Mike Mastanduno]

Thanks Tyler.  So it is really great to see that everyone is excited about being part of the community.  That is actually one of the main reasons that we decided to make this package Open Source – is that if we have more people work on the package, we are going to get more functionality and we really want to make machine learning democratize to everyone.  So, the more enthusiasm we can get behind the package and behind development, the better it is going to be for everybody.

So, I wanted to go through another example and I just cleared my Console and started a new script, but we have a lot of CSV people in the audience and I wanted to just show how easy it can be to load a CSV example, build the model and then kind of evaluate which ones you might want to use.

So I have a CSV file just on the desktop.  So I am just going to navigate over to the desktop, which is here.  And I am going to set that as my working directory.  And then I am going to, just like Levi did, I am going to load up one of the examples.


And we had some people having trouble loading the library and I think that is because you have to install the package first.  Again, you can find the link on how to install that for all three operating systems on the website at  And we are familiar with some issues but the installation process can take some time.  So if you cannot get it installed, do not worry about it.  Maybe you just kind of watch and after the session you can follow along with the documentation or the video and you will get there as well.  Again, we can just type ?healthcareai and then it will get usto the Help pages over here.


We can go into the Random Forest development to copy and paste our example.  And this time, I will actually grab the CSV data.  So I will just highlight that, scroll down a little bit.  And just as Levi did, I will copy that into a script so we can run multiple functions at once.


And so, this is just the example code that comes with the package.  If we want to have our own CSV file on the data set, all we have to do is change the name of that file.  So let me do that to the data set that I have prepared.  And I will just talk a little bit about what that data set is.


So my data set is called Wisconsin Pathology Data and it is actually from an open, the available data source where different cell slides were examined to see if the cell structure was either biopathologist and then features were recorded about like how to slip data.


[Levi Thatcher]

We can just type the full path.

[Mike Mastanduno]

So if we do C…Oh was that it?

[Levi Thatcher]

Yeah, I think that was it.


[Mike Mastanduno]

There we go.  Sorry about that.  With all downloads, we are going to have some hiccups.  So these are characteristics of different cells on pathology slides where a pathology has gone through and kind of labeled the mean radius and the texture, the parameter, the area, all features that a pathologist would use to evaluate whether a particular cell looked like it was part of a malignancy or a benign lesion.

So now that we have our data loaded, all we have to do is – we do not want this line.  So we can get rid of that.  All we have to do is change the model parameters to fit our data.  So we have a column called patient ID which we are still going to use as the grain column.  We are doing a classification.  So let us write that in.


And then the column we actually want to predict is the diagnosis, which is either M or B for malignant or benign.  So then that should be all we need to do to make a model.  So we can try around the Lasso now.  I just run everything up to there.  It is chugging away on the Lasso.  Give that a second.  It is a larger data set than the other one because it has 30 features on the cellular characteristics.  It takes a really long time there.


There it goes.  Okay.  So we scroll up, we can see kind of what the algorithm found and I think it is typical with how long that took.  Maybe it could be a little faster if we trim down some of the features but the Lasso algorithm is great for telling us which ones we might want.  So for instance, the radius mean was not used but the radius standard deviation was quite important just because the number is bigger.  The concavity mean ended up and the concave points mean was quite important.


And then for comparison, and then the accuracy of this model overall is a 1.  So that is perfect.  And we use a couple of different metrics to evaluate models by default.  We use the area under in our C curve or the area under a precision recall curve.  And you can read about that on our blog, which we are trying to post to at least twice a week to kind of help with questions about the range from what is to more technical things like why should you use an area under a PR curve versus area under an ROC curve and what are the benefits of each.  So, we have really tried to appeal to multiple audience.  You can find that blog over at and a link for that should show up in the chat fairly soon if you are interested.  We would love to have you subscribe again.  It is all part of the community.  We want to see involvement and improvement.  So we can get around this Random Forest model as well.


And again, the Random Forest model gets an area under the curve of 1 but the AU-PR is a little lower.  So perhaps the Lasso is a better algorithm for this data set.  But really with such great accuracy, what that really tells us is that this data set is kind of an easy problem for a machine learning to solve, which is why pathologists are so great at their jobs – because, you know, with all the features laid out, the algorithms can do it too.

And similarly to Lasso, Random Forest is great because it will help you decide which features were most important in the evaluation of all the data.  And this can be a really great part of adoption and to clinical use because I know this is just a tri- data set but if you are to show this to a clinician, they might say, okay, well why does this prediction go one way or the other.  And here you can say, well, it is based on the perimeter, the worst perimeter, the worst radius, the worst concave point in the slide and those are kind of things that the pathologist is going to be looking for.  And so, just like that, you started to build a little more trust.


So last thing I want to do is kind of just look at how these different models compare to each other.  So I will just go down a little further in the example code and copy that, cutting that to ROC.  So we can grab that, put it in the script and then I will run that.  I will just highlight and evaluate that section.  And there we go.  We have got an area under the curve, an ROC curve for both models that is nearly perfect.  But I hope this example was able to kind of show you just how easy it is to get a CSV file and change the example code to cater to what you needed to do.

So I guess with that, we are going to go back to the slides.  So Levi can jump back into the slides and I will leave it with him.

[Levi Thatcher]

Thanks Mike.  Fantastic example.  So as you can see, we have tried to make it fairly streamlined to handle multiple types of data connections and we will keep adding those as time goes on.  As we hear from you all, being able to learn what problems you are working on, what points you are getting stuck on, things will get better with the interaction and the community involvement.

Roadmap [40:22]

So, we want to talk a little about that roadmap actually.  So where are we going with this package, what is next for the team, what can you expect.  And you can see right there.  So, as I pull this up, we have CRAN on the docket here this week actually.  So Mike, you just focus working on that.  And actually just submitted the CRAN this morning.  So we will be working with the CRAN folks to get our package up on their service which makes it really easy to download and make sure it is under proper checks and sort of has a stamp of approval on it from the R community.

And then submitting to the PyPI.  So that is the Python package manager.  And so that will get our Python package a similar stamp of approval and similarly increase the ability for users to download it and streamline that process.  We are going to switch to.

Well to consider the input today, we will have to talk about this.  But we want to make our connections sort of ambivalent.  So we want people to be able to use MySQL or Oracle or Postgres.  And so, perhaps we will have to discuss this third bullet here but we want to make sure that we have connections for the typical databases and (41:34) files that are out there.  And so, definitely that includes things like MySQL.

And then deep learning, we talked about a little bit in the context of Python, in that we will be coming up the next few months, both with these type of predictions that Mike and I showed but then also we are excited to get into computer vision, with things like radiology, image classification.  That is a little bit further out.

Want to contribute? [41:56]

Now, in terms of the contributions, for those of you who are excited to actually improve the code and contribute, we can jump out real quick.  Let me show you the website.  I do not think we have shown that quite yet.


So notice the front page here.

Website [42:15]

And then if you go to get started, put on the R package and you will see the instructions there for how to install the package and you will see all the different features here on the left, the thank you and all sorts of intros.  But if you scroll down, we have a link to our github repo.  So if you want to help, check out the repo and I will just click through here to show you.


And of course we are welcoming the extra stars.  If anybody is excited about the project, those are always welcome.  And here are the instructions for contributing.  And of course, we are excited to interact and learn from you as the tool will get better as more and more people are looking at it using it and giving us feedback as to what to improve.

Poll question – What’s impeding you? [43:01]

And just I want to turn back over to Tyler for one of the last poll questions.  We appreciate you guys sticking with us.

[Tyler Morgan]

Alright.  The poll question is what is impeding you from using  Select one of the following – loading data into R, installing the package, do not know how to integrate into database infrastructure, adoption – clinical team is not interested, or not sure what to predict.  We will leave this open.  We know from some of the comments we got today, the folks are working on getting the package loaded.

So we will leave this open for just a few more moments and then let us share the results.

Poll Results [43:44]

Alright.  It actually showed the not sure what to predict at 38 percent is showing the highest right now, 31 percent do not know how to integrate into database infrastructure, with 15 percent on installing the package, then loading data into R, and adoption – the clinical team is not interested.

[Levi Thatcher]

Wow.  That is fascinating.  Okay.  So, that will be very helpful for the upcoming blog post and broadcast we are going to do.  So, integrating into the database infrastructure is something that has been extremely difficult for data science and healthcare up to this point and that is something that we will focus on in terms of the blog post and in our chats.  We will lay that out such that not only can you develop models but also put them up into production so they are helping improve outcomes.  And we will get through the rest of them as well but that was really a good learning as to run from.  Thanks so much guys.

Before we end… [44:38]

So before we end, is our public offering.  It is what we are offering out to the community.  It is open source.  We want contributors, we want feedback, interactions, and of course it is free.  So that is what we have been discussing today.  Now, we are currently working to integrate this package and this machine learning functionality into all of Health Catalyst products and we are excited to (44:59) next week we are ready to do that.  So stay tuned.

And before we end, we just want to encourage you to check out the blog and send us emails, those contact information on the website, and we want to hear from you, and we will be going through these questions here in just one moment but really we do not improve unless you tell us what you need and how we can improve.

And if you do see technical issues, something wrong with the code specifically, you could file issues on Stack Overflow.  So if you guys go to Stack Overflow and write a post, you will notice that we have that “healthcare-ai” tag now, which you can use, which will just pop up automatically if you start typing healthcare and there is the healthcare-ai tag.  In that way, we will be alerted and be able to monitor those posts and quickly respond and make it sort of a community-type archive for how to interact with package and how to improve outcomes.

Questions? [45:53]

Thanks guys.  We will move into the question and answer portion.

[Tyler Morgan]

Alright.  We are about ready to get into the questions and answers.  I would like to say, again, is our Open Source offering.  And then we are working towards adding this machine learning into all of our products.  We are calling that and as a matter of fact, we have a webinar next week, next Thursday, about this specifically.  You can go to and right on our homepage on the slider there you will be able to register for that webinar if you are interested in how exactly we are incorporating this into our products.

So also in that same vein, the question that we would like to ask, you know, these webinars that we put on there are educational and we would like to share them on various topics.  But occasionally we do get questions from folks about who we are and what we do.  So we would like to make sure we give everyone the opportunity.

Are you interested in having someone from Health Catalyst reach out to schedule a demonstration of our solutions? [46:46]

If you are interested in having someone from Health Catalyst reach out to you to schedule a demonstration of our solutions, including about how machine learning can be used in your organization, please answer the following poll question.  And while we have that up, let us go right into the first question we have here.

Here is a general question.  Why is Health Catalyst making this available?  What does Health Catalyst want to get out of this webinar?  Is this general education, potential partnerships, sales, etc.?

[Levi Thatcher]

Again, we are really trying to (47:16) the webinar because we are interested in building a community around our open source package, and the goal of the package is really to make machine learning into healthcare because we strongly believe that predictive analytics and machine learning are the future of healthcare.  They are the future of online advertising and online technology.  And so, there is no reason to think that the healthcare industry cannot benefit as well.  And we are really helping them by making the tools accessible and the community vocal and easy to get help with that we accelerate the process of getting predictive analytics into healthcare and improving outcomes through that.  So that is kind of why we have done that and why we have had the webinar to try and increase awareness and get everybody up and running.

[Tyler Morgan]

What is the content of the data set against with some (48:08) runs?  Is it health organizations own data or is it a generic data set set up by Health Catalyst?

[Levi Thatcher]

The data that is shipped with is a CSV file with dummy data that we have created so that it can be distributed without issue and it tries to model what you typically see in a healthcare data set – so typical columns, somewhat typical values for those columns, so you know, LDL and blood pressure, reasonable ranges, that sort of thing.  But it is fake data since healthcare data is very sensitive.  So, feel free to distribute it and use it as you like and ask me questions that you have about it.

[Tyler Morgan]

We have a lot of different questions around supports, support for Julia, NoSQL, XML, and the like.  Would you like to respond to that in terms of this part within the

[Levi Thatcher]

We have kind of started with a Windows and a SQL server environment just because we take a lot of care to make sure that we can deploy the models easily.  And from the poll questions, I think we got that part right.  That is a hard of machine learning in healthcare.  And I think we also got, you got a selection of SQL server as being the default, with most clients being on either Windows or the SQL Server or SSMS-type environments.  That said, we definitely do want to kind of democratize the package further and go towards Open Source databases like MySQL or Postgres.  So we need to figure out how we want to prioritize adding databases or databases like Julia to figure out which direction we want to take it.

[Tyler Morgan]

How do we deal with missing values in healthcare data?

[Levi Thatcher]

Great question.  I do not know if there is time to go to the code but since we have it right here.


If you are watching, seeing those, on line 21, we have an imputation argument and what imputation does is it helps you handle no cells or no value.  So what this basically will do is fill in numeric columns out with no cells with the column mean and for categorical columns that we use the mode and the most common value in that column.  And so that will quickly get rid of the no values.  And if you have columns with lots of nulls, perhaps you should not put that column in the data set at all or into the model at all, but this imputation flag helps out with those use cases.

[Tyler Morgan]

We have several questions around other use cases within healthcare.  What in particular if you have a recommendation on NLP-based use cases with respect to healthcare?

[Levi Thatcher]

Yeah.  Yeah.  So definitely NLP is something that we are working for.  We are working on a solution now and it is not only for primetime but obviously there is tons of text data in EMRs across the country and the holy grail is leveraging that to improve outcomes.  So as we get more and more in the machine learning, we will have announcements coming up later in the year about that.

[Tyler Morgan]

Alright.  It is not a question but a comment.  They love to see it always addressing the community in the future.  It will be nice to have recommendations for processing power and memory based on the number of columns and rows.  It really helps to have recommendations when setting up servers and desktop for users and working with IT.

[Levi Thatcher]

Fantastic.  That is a great idea.  And I think I would like to just comment on that.  For right now, all of the data sets that are with the package or up to tens of thousands of lines, usually a higher end laptop should be just fine to build and evaluate the models.  I mean as you get towards hundreds of thousands or millions of rows, you might need to start looking at a server instance, like an Azure or an Amazon web services-type environment.

[Tyler Morgan]

Our next question is how much data can this handle?  Is there a number of features limitation?

[Levi Thatcher]

That is a great question.  I do not know if they push it to the max but let us know.  Try to break in and let us know the results so we can fix it.  But really try a couple hundred columns.  We have used it for 50, 70 columns and if you have a computer with that much processing power, then you should be okay, but we are curious to see the one that is indeed.  So let us know.

[Tyler Morgan]

We have another couple comments.  One of the comments is they love to be able to have a Hadoop example in the future.  And also, some folks asking for an email address.  We will make sure that all the links that we have talked about in this webinar and all the contact information we will make sure are in our follow-up email after the webinar that we send out.

Here is a question.  How does this work with unstructured data like in a data lake?

[Levi Thatcher]

So one other things about machine learning is that the data needs to have some structure unless it is free text, which is, you know, then there is this whole class of algorithms that is going to process that text and essentially turn them into structured data of some sort.  So as far as data lake goes, it is not really applicable to this package.  So I think it would be kind of anew to do the pre-processing to get your unstructured data into a tabular structure that we can help you with.

[Tyler Morgan]

Yeah.  We have had several questions on validating the models before deployment.  What are your recommendations on how to validate the models before deploying them or making them actual lie predictors?

[Levi Thatcher]

Yeah.  So Mike talked about this ROC plot.  If you look at the example code, there is also a PR curve that you can look at as well.  So we are looking at ROC, PR curve with an area to each of those and we have tried to choose algorithms that do not overfit.  So you do not have to worry a lot about that.  So last was really great about that and then it is trying to read those columns that are not as necessary, which is basically a way to help you to not overfit.  So using the symmetric list for classification and then of course buying list for regression gets you a long way towards knowing, okay, let us go with algorithm B versus algorithm A when you will deploy.

[Tyler Morgan]

Okay.  We are getting in a lot of questions and this is great.  We only have time for a couple more but we would like to let everyone that if we do not get to your question, we are going to look at every single question, that we will reach out to you and respond to you appropriately with answers to your questions.  These are all very important to us.  Here is the question I talked about, how do we interpret the model results?

[Mike Mastanduno]

So how do we interpret the model results?  That is a great question and I think it is definitely a sticky area.  It really depends a lot on what you want to get out of using the model.  In some cases, your accuracy is going to be maybe 0.75 Levi’s example which, you know, he said was okay, it is pretty good.  But I think you really have to look at the alternatives.  If you have absolutely no way of knowing something and you build them out all that gives you an area under the curve of 0.75, that is really good because it is giving you so much more information than you used to have.  Whereas, if the standard of care is kind of an area under the curve of 0.9, you need to build the nearly perfect model before you are going to improve on that.  And so, you are going to get incremental gains.

As far as evaluating the model itself, you might want to think about how many patients is going to impact if you are to roll it out into production or kind of where is it going to bring the new bar to, how much improvement are you going to see in your organization, and that kind of analysis is all, you know, you have got to do it at the end but it helps to do it at the beginning as well just to make sure that you are answering a relevant question with predictive modeling.

[Tyler Morgan]

Alright.  We have time for this last question.  What about small data sets, how do you know when you have enough to be able to make trust-worthy extrapolations?  Do you do any statistical bootstrapping?

[Mike Mastanduno]

Yeah.  Great question.  So the idea here is that we have tried to extrapolate some of those details.  You know, as much as possible, you can never get all the detail but with these algorithms, you will get a low AUC if you have a poor predictor.  And so, we have tried to pull a lot of those details under the hood and we would love to go onto those details in an email or if you want a (56:58) start working on it, we would love to chat about that, but really if you have a high AUC, we have tried to make sure that that model is going to be relevant when you deploy it as well.

[Tyler Morgan]

Alright.  We have reached our time for today.  We would like to thank everyone for their participation.  And I would like to remind everyone that shortly after this webinar, you will receive an email with links to the recording of the webinar, the presentation slides, as well as all the links shared in today’s session.  Also, please look forward to the transcript notification we will send you once it is ready.

On behalf of Levi Thatcher, Mike Mastanduno, as well as the rest of us here at Health Catalyst, thank you for joining us today.  This webinar is now concluded.