A Data science-based solution needs to address
problems at multiple levels. While it addresses a business problem,
computationally it is comprised of a pipeline of algorithm which, in
turn, operates on relevant data presented in proper format. Thus to understand
the them we need to focus at least at the
- Business level;
- Algorithm level; and
- Data level.
Contrary to the popular belief, almost all
non-trivial data science solutions are needed to be built ground up with minute
and interrelated attention to the details of the problem at all three levels.
In the following we shall try to understand that with the help of an running
example of aspects of a churn analysis solution.
It is vital to understand that in most
real-world cases we are re-purposing the data for building the solution. In
other words the data used is not collected for the purpose of kind of analysis
we want to perform. They are collected as part of transactional and operational
activities of the organization. Thus the strategies for collection, formatting
and storage of the data is optimized for those purpose. Therefore, locating the
relevant data and processing them to enable application of data science
technique can be quite non-trivial, often herculean exercise.
From a user perspective a solution life-cycle
can be understood as following:
- Solution
development: Using historical data, involves extensive experimentation,
testing and validation;
- Solution deployment: Using the solution to get the
insight and/or decision support;
- Solution
assimilation: In the workflow enabling actions based on insight and/or
prediction made by the solution;
- Solution
maintenance and update: Periodic checking and validation of the solution
performance and update to improve performance if required.
It is the job of the data scientist(s) deliver
on the above and for that she have to understand the problem at different
levels.
The Business/Domain level
At this level the broad business context,
desired outcome of the solution are defined along with various parameters
and constraints the solution should/must adhere to. Also, at this level
desired/ acceptable performance parameters can be set in commensurate with
business policies, especially the risk management. Let us try to
understand this with respect to the popular churn analysis problem
Business level context for churn analysis
The churn analysis problem has a quite
straight-forward context. Every business strives to retain
its good/valuable customers. Thus, if it is possible to identify the customers likely to stop doing business with the organization, it may be able to take proactive steps in order to retain the customer. However, we cannot only demand that a system should spout lists of customers whenever asked. We need to specify at the followings:
its good/valuable customers. Thus, if it is possible to identify the customers likely to stop doing business with the organization, it may be able to take proactive steps in order to retain the customer. However, we cannot only demand that a system should spout lists of customers whenever asked. We need to specify at the followings:
- What
should be the prediction lead time? Naturally, we don’t want to know a
customer is going to churn within next five minutes. We cannot possibly do anything
about it with the knowledge. We need enough lead time so that preventive actions can be
taken.
- What
is the acceptable probability/likelihood measure of a prediction being
correct? Predictions are always uncertain to some degree (otherwise they would be called
facts). Usually, most of the customers run through the system will have non-zero likelihood
of churning. This issue is often revisited in later stage and calibrated against the
system output characteristics.
- What
is the acceptable level of accuracy of the predictions? No solution is
going to be 100% accurate (As we humans are neither, after all ”to err is human...”) and
taking action based on the predictions involve cost. Thus a cost/benefit analysis is required in
order to determine an acceptable level of solution performance.
Note:
The meaning of probability or likelihood measure
is not straightforward or very intuitive.
One individual customer will either churn or
not. Thus post-facto or actual probability is either 1 or 0! So what does
this likelihood value, which is somewhere in between, mean? It will always
be ultimately proved wrong, isn’t it?
Well, actually I am misleading you slightly -
probability is not about individuals, but large collections. So, the
interpretation goes like this,
If there are large, very large number of customers
with very same characteristics or attribute values used by the solution,
the likelihood fraction of them will churn. Satisfactory?
Wait, there are many attributes used to in the
solutions and they have varied range of values. How likely is it that many
customers will have just the same values of the attributes? Actually, not
very likely at all. What happened is this, the solution maintains
knowledge, explicit or implicit, of an overall probability
distribution over the attribute space.
This knowledge and knowledge of how to use it to
make prediction constitute the underlying prediction model, which is
usually learned from the training data using the machine-learning
algorithm. This is used to compute the individual likelihoods. So, in
simplistic terms we can interpret this as
If we had a lot of customers like this,
likelihood fraction of them will churn. If the likelihood value is high,
so is the chance of this guy being one of those churners. So let us see if that
is so and we can prevent him to do so.
While it may not be immediately apparent,
understanding the above subtlety is often useful in overall understanding
of system performance.
The Algorithmic Level
Algorithmic level delivers the asks of the
business level. In data science approach, the algorithm level creates,
maintains and applies a model of the process in reality involving the objects
of interest, hereafter referred as simply ”objects”. This process leads to
the events or outcomes of interests. The following are the main
characteristic of the data science algorithms.
- An algorithm works with available data
footprint of the process of interest;
- It discovers the relationships between the process
characteristics and the outcomes;
- The above relationships are, more often than not, in
form of complex patterns;
- Discovering
these patterns require application of powerful learning algorithms on the
historical data;
- Discovered patterns lead to learning the required model
parameters;
- An
analysis/model application algorithm use these parameters
to create the model and apply it on the new data in order to compute the output.
The algorithm level of a data science
application is comprised of one or (often) more of algorithms from the
basic types of algorithms:
- Regression:
Predicts the value of a continuous valued variable from the values of a
set of numerical attributes;
- Classification:
Predict one discrete class/category out of a set of classes using
numerical and/or categorical attributes;
- Clustering: Discovers natural grouping of the objects;
- Association
discovery: Discovers propensity of similar behavior among two or more
objects. described using numerical and/or categorical attributes;
Again, there are many actual algorithms for each
types, differing in their approaches, complexity, interpretability, acceptable
data types and above all their efficacy in a particular problem scenario.
Churn analysis example:
Again, with respect to the churn analysis
problem, we can easily discern that the objects here are the customers and the
process of interest is their reaching a decision about whether to churn or not.
The data about a customer available to the
organization contains the data footprint of his/her decision process,
albeit hidden among a lot of dust and garbage. How to isolate/extract
and possibly enhance the footprint is a matter we shall touch upon in next
section. It can also be seen that here the task demands the system predict
one of two discrete outcomes, churn or not-churn.
Hence, the heart of the system is likely to be a
classification model and algorithms for learning and applying the model.
Note:
We should not straight-jacket an one-to-one
correspondence between the business level and the algorithmic level of a
problem. For example, at the algorithm level we can pose the churn analysis
problem as a regression problem, trying to predict for each customer, after
what time amount of time she is going to churn.
Actually, what you do at algorithm level may
depend on a lot of factors. We should keep an open mind while exploring for
best possible solution for a given problem.
The Data Level
Data Science algorithms work with object data in
form of feature/attribute vectors describing the
objects of analysis/interest. In a real life problem scenario, we seldom have the data readily available in the such form. Usually a lot of effort, often majority of the total, goes into transforming the available raw data into the usable form comprised of vectors of useful features.
objects of analysis/interest. In a real life problem scenario, we seldom have the data readily available in the such form. Usually a lot of effort, often majority of the total, goes into transforming the available raw data into the usable form comprised of vectors of useful features.
This is the stage we identify, isolate and
enhance the data footprint of the process of interest and encode it in form of
feature vectors. Clearly, this is a momentous process that needs to take into
account the format and content of actual data sources available, their quality
as well as deep understanding of the problem domain. This exercise is most
often solution-specific as well as organization-specific (because data
collection and stewardship policies vary across organizations). The whole
process is known as feature engineering.
Problem-specific feature engineering
Coming up with features is difficult,
time-consuming, requires expert knowledge.
"Applied machine learning" is basically feature engineering. —Andrew Ng, Machine
Learning and AI via Brain simulations
"Applied machine learning" is basically feature engineering. —Andrew Ng, Machine
Learning and AI via Brain simulations
Let us again go back to the churn
analysis/prediction problem. In order to enable algorithm level to perform
adequately, it has to work with quality data, i.e., data with
information-rich features.
Technically such features are called
discriminative features, the features which can significantly contribute
towards the algorithm discriminating/distinguishing between churners
and non-churners.
There is often a lot of data available
internally (customer demography, product data, transaction history,
competitor intelligence, call center records, etc.) as well as
accessible from outside (credit rating data, etc.). Not all of these data
are useful nor all of them are readily usable (for example, the
transaction history is event data over time) as features. Clearly, a naive
approach of somehow putting together all the data and wish the algorithms work
out can
meet with disastrous result.
meet with disastrous result.
Thus the feature selection and preparation can
turn out to be a very complex issue to tackle. Most of the time we have to
use understanding of the problem domain in conjunction with application of
feature selection and transformation methods. For example, in our problem we
might consider the following:
- Why a customer would decide to leave?
- Dissatisfied with the product/service
- Product/service is not what is actually
needed/expected
- Has trouble enjoying facilities provided
- Trouble accessing the delivery channels
- ...
- Got a better deal from a competitor
- A similar product for less price
- A better product for similar price
- ...
The above issues, individually or together, may
influence various aspects of customer behavior vis-a-vis the bank, which, in
turn reflects in complex patterns hidden in the data. It is the aim
of feature engineering to
- identify the part of data in which such patterns likely
to be hidden and
- designing suitable processing or transformation of the
data in order to enhance the information content.
Did you notice that earlier I have used the term
“data footprint”? That fits nicely with my favorite analogy of this work - finding
the footprints of a rare animal in a jungle.
Blind feature selection
This is what I feel is the (charitably speaking) lazy approach or (less of charity, more to truth) incompetent approach. Get the whole data dump and run some in-box feature
selection/ranking algorithm without trying to understand the data semantics.
Unfortunately, I am observing too much of this. While this approach may help
quickly build a solution, it will essentially become a black-box kind of
solution. It definitely harms the interpretability/understandability/transparency
of the solution as well as makes the solution maintenance and update a
nightmare.
Incidentally, practitioners of this approach usually finds Deep Learning extremely attractive, almost forming a "celebrity fan-base". Topmost reason for that, as far as I can discern from multiple sources, is that DL networks (supposedly) eliminate the need of feature engineering, both selection as well as transformation, in one fell swoop!
Unfortunately, the reality is a slightly more complicated. But that will be subject of another post.
Incidentally, practitioners of this approach usually finds Deep Learning extremely attractive, almost forming a "celebrity fan-base". Topmost reason for that, as far as I can discern from multiple sources, is that DL networks (supposedly) eliminate the need of feature engineering, both selection as well as transformation, in one fell swoop!
Unfortunately, the reality is a slightly more complicated. But that will be subject of another post.