Tuesday, 4 October 2016

Problem levels analysis for data science solutioning

A Data science-based solution needs to address problems at multiple levels. While it addresses a business problem, computationally it is comprised of a pipeline of algorithm which, in turn, operates on relevant data presented in proper format. Thus to understand the them we need to focus at least at the
  • Business level;
  • Algorithm level; and
  • Data level.
Contrary to the popular belief, almost all non-trivial data science solutions are needed to be built ground up with minute and interrelated attention to the details of the problem at all three levels. In the following we shall try to understand that with the help of an running example of aspects of a churn analysis solution.

It is vital to understand that in most real-world cases we are re-purposing the data for building the solution. In other words the data used is not collected for the purpose of kind of analysis we want to perform. They are collected as part of transactional and operational activities of the organization. Thus the strategies for collection, formatting and storage of the data is optimized for those purpose. Therefore, locating the relevant data and processing them to enable application of data science technique can be quite non-trivial, often herculean exercise.

From a user perspective a solution life-cycle can be understood as following:
  • Solution development: Using historical data, involves extensive experimentation, testing and validation;
  • Solution deployment: Using the solution to get the insight and/or decision support;
  • Solution assimilation: In the workflow enabling actions based on insight and/or prediction made by the solution;
  • Solution maintenance and update: Periodic checking and validation of the solution performance and update to improve performance if required.
It is the job of the data scientist(s) deliver on the above and for that she have to understand the problem at different levels. 

The Business/Domain level

At this level the broad business context, desired outcome of the solution are defined along with various parameters and constraints the solution should/must adhere to. Also, at this level desired/ acceptable performance parameters can be set in commensurate with business policies, especially the risk management. Let us try to understand this with respect to the popular churn analysis problem
Business level context for churn analysis
The churn analysis problem has a quite straight-forward context. Every business strives to retain
its good/valuable customers. Thus, if it is possible to identify the customers likely to stop doing business with the organization, it may be able to take proactive steps in order to retain the customer. However, we cannot only demand that a system should spout lists of customers whenever asked. We need to specify at the followings:
  • What should be the prediction lead time? Naturally, we don’t want to know a customer is going to churn within next five minutes. We cannot possibly do anything about it with the knowledge. We need enough lead time so that preventive actions can be taken.
  • What is the acceptable probability/likelihood measure of a prediction being correct? Predictions are always uncertain to some degree (otherwise they would be called facts). Usually, most of the customers run through the system will have non-zero likelihood of churning. This issue is often revisited in later stage and calibrated against the system output characteristics.
  • What is the acceptable level of accuracy of the predictions? No solution is going to be 100% accurate (As we humans are neither, after all ”to err is human...”) and taking action based on the predictions involve cost. Thus a cost/benefit analysis is required in order to determine an acceptable level of solution performance.
Note:
The meaning of probability or likelihood measure is not straightforward or very intuitive.
One individual customer will either churn or not. Thus post-facto or actual probability is either 1 or 0! So what does this likelihood value, which is somewhere in between, mean? It will always be ultimately proved wrong, isn’t it?
Well, actually I am misleading you slightly - probability is not about individuals, but large collections. So, the interpretation goes like this,
If there are large, very large number of customers with very same characteristics or attribute values used by the solution, the likelihood fraction of them will churn. Satisfactory?
Wait, there are many attributes used to in the solutions and they have varied range of values. How likely is it that many customers will have just the same values of the attributes? Actually, not very likely at all. What happened is this, the solution maintains knowledge, explicit or implicit, of an overall probability distribution over the attribute space.
This knowledge and knowledge of how to use it to make prediction constitute the underlying prediction model, which is usually learned from the training data using the machine-learning algorithm. This is used to compute the individual likelihoods. So, in simplistic terms we can interpret this as
If we had a lot of customers like this, likelihood fraction of them will churn. If the likelihood value is high, so is the chance of this guy being one of those churners. So let us see if that is so and we can prevent him to do so.
While it may not be immediately apparent, understanding the above subtlety is often useful in overall understanding of system performance.

The Algorithmic Level
Algorithmic level delivers the asks of the business level. In data science approach, the algorithm level creates, maintains and applies a model of the process in reality involving the objects of interest, hereafter referred as simply ”objects”. This process leads to the events or outcomes of interests. The following are the main characteristic of the data science algorithms.
  • An algorithm works with available data footprint of the process of interest;
  • It discovers the relationships between the process characteristics and the outcomes;
  • The above relationships are, more often than not, in form of complex patterns;
  • Discovering these patterns require application of powerful learning algorithms on the historical data;
  • Discovered patterns lead to learning the required model parameters;
  • An analysis/model application algorithm use these parameters to create the model and apply it on the new data in order to compute the output.
The algorithm level of a data science application is comprised of one or (often) more of algorithms from the basic types of algorithms:
  • Regression: Predicts the value of a continuous valued variable from the values of a set of numerical attributes;
  • Classification: Predict one discrete class/category out of a set of classes using numerical and/or categorical attributes;
  • Clustering: Discovers natural grouping of the objects;
  • Association discovery: Discovers propensity of similar behavior among two or more objects. described using numerical and/or categorical attributes;
Again, there are many actual algorithms for each types, differing in their approaches, complexity, interpretability, acceptable data types and above all their efficacy in a particular problem scenario.
Churn analysis example:
Again, with respect to the churn analysis problem, we can easily discern that the objects here are the customers and the process of interest is their reaching a decision about whether to churn or not.
The data about a customer available to the organization contains the data footprint of his/her decision process, albeit hidden among a lot of dust and garbage. How to isolate/extract and possibly enhance the footprint is a matter we shall touch upon in next section. It can also be seen that here the task demands the system predict one of two discrete outcomes, churn or not-churn.
Hence, the heart of the system is likely to be a classification model and algorithms for learning and applying the model.
Note:
We should not straight-jacket an one-to-one correspondence between the business level and the algorithmic level of a problem. For example, at the algorithm level we can pose the churn analysis problem as a regression problem, trying to predict for each customer, after what time amount of time she is going to churn.
Actually, what you do at algorithm level may depend on a lot of factors. We should keep an open mind while exploring for best possible solution for a given problem.

The Data Level
Data Science algorithms work with object data in form of feature/attribute vectors describing the
objects of analysis/interest.  In a real life problem scenario, we seldom have the data readily available in the such form. Usually a lot of effort, often majority of the total, goes into transforming the available raw data into the usable form comprised of vectors of useful features.
This is the stage we identify, isolate and enhance the data footprint of the process of interest and encode it in form of feature vectors. Clearly, this is a momentous process that needs to take into account the format and content of actual data sources available, their quality as well as deep understanding of the problem domain. This exercise is most often solution-specific as well as organization-specific (because data collection and stewardship policies vary across organizations). The whole process is known as feature engineering. 
Problem-specific feature engineering
Coming up with features is difficult, time-consuming, requires expert knowledge.
"Applied machine learning" is basically feature engineering. —Andrew Ng, Machine
Learning and AI via Brain simulations

Let us again go back to the churn analysis/prediction problem. In order to enable algorithm level to perform adequately, it has to work with quality data, i.e., data with information-rich features.
Technically such features are called discriminative features, the features which can significantly contribute towards the algorithm discriminating/distinguishing between churners and non-churners.
There is often a lot of data available internally (customer demography, product data, transaction history, competitor intelligence, call center records, etc.) as well as accessible from outside (credit rating data, etc.). Not all of these data are useful nor all of them are readily usable (for example, the transaction history is event data over time) as features. Clearly, a naive approach of somehow putting together all the data and wish the algorithms work out can
meet with disastrous result.
Thus the feature selection and preparation can turn out to be a very complex issue to tackle. Most of the time we have to use understanding of the problem domain in conjunction with application of feature selection and transformation methods. For example, in our problem we might consider the following:
  • Why a customer would decide to leave?
    • Dissatisfied with the product/service
    • Product/service is not what is actually needed/expected
    • Has trouble enjoying facilities provided
    • Trouble accessing the delivery channels
    • ...
  • Got a better deal from a competitor
    • A similar product for less price
    • A better product for similar price
    • ...
The above issues, individually or together, may influence various aspects of customer behavior vis-a-vis the bank, which, in turn reflects in complex patterns hidden in the data. It is the aim of feature engineering to
  • identify the part of data in which such patterns likely to be hidden and
  • designing suitable processing or transformation of the data in order to enhance the information content.
Did you notice that earlier I have used the term “data footprint”? That fits nicely with my favorite analogy of this work - finding the footprints of a rare animal in a jungle.

Blind feature selection

This is what I feel is the (charitably speaking) lazy  approach or (less of charity, more to truth) incompetent approach. Get the whole data dump and run some in-box feature selection/ranking algorithm without trying to understand the data semantics. Unfortunately, I am observing too much of this. While this approach may help quickly build a solution, it will essentially become a black-box kind of solution. It definitely harms the interpretability/understandability/transparency of the solution as well as makes the solution maintenance and update a nightmare. 

Incidentally, practitioners of this approach usually finds Deep Learning extremely attractive, almost forming a "celebrity fan-base". Topmost reason for that, as far as I can discern from multiple sources, is that DL networks (supposedly) eliminate the need of feature engineering, both selection as well as transformation, in one fell swoop! 

Unfortunately, the reality is a slightly more complicated. But that will be subject of another post.