Data Science Practitioner: Problem levels analysis for data science solutioning

A Data science-based solution needs to address problems at multiple levels. While it addresses a business problem, computationally it is comprised of a pipeline of algorithm which, in turn, operates on relevant data presented in proper format. Thus to understand the them we need to focus at least at the

Business level;
Algorithm level; and
Data level.

Contrary to the popular belief, almost all non-trivial data science solutions are needed to be built ground up with minute and interrelated attention to the details of the problem at all three levels. In the following we shall try to understand that with the help of an running example of aspects of a churn analysis solution.

It is vital to understand that in most real-world cases we are re-purposing the data for building the solution. In other words the data used is not collected for the purpose of kind of analysis we want to perform. They are collected as part of transactional and operational activities of the organization. Thus the strategies for collection, formatting and storage of the data is optimized for those purpose. Therefore, locating the relevant data and processing them to enable application of data science technique can be quite non-trivial, often herculean exercise.

From a user perspective a solution life-cycle can be understood as following:

Solution development: Using historical data, involves extensive experimentation, testing and validation;
Solution deployment: Using the solution to get the insight and/or decision support;
Solution assimilation: In the workflow enabling actions based on insight and/or prediction made by the solution;
Solution maintenance and update: Periodic checking and validation of the solution performance and update to improve performance if required.

It is the job of the data scientist(s) deliver on the above and for that she have to understand the problem at different levels.

The Business/Domain level

At this level the broad business context, desired outcome of the solution are defined along with various parameters and constraints the solution should/must adhere to. Also, at this level desired/ acceptable performance parameters can be set in commensurate with business policies, especially the risk management. Let us try to understand this with respect to the popular churn analysis problem

Business level context for churn analysis

The churn analysis problem has a quite straight-forward context. Every business strives to retain
its good/valuable customers. Thus, if it is possible to identify the customers likely to stop doing business with the organization, it may be able to take proactive steps in order to retain the customer. However, we cannot only demand that a system should spout lists of customers whenever asked. We need to specify at the followings:

What should be the prediction lead time? Naturally, we don’t want to know a customer is going to churn within next five minutes. We cannot possibly do anything about it with the knowledge. We need enough lead time so that preventive actions can be taken.
What is the acceptable probability/likelihood measure of a prediction being correct? Predictions are always uncertain to some degree (otherwise they would be called facts). Usually, most of the customers run through the system will have non-zero likelihood of churning. This issue is often revisited in later stage and calibrated against the system output characteristics.
What is the acceptable level of accuracy of the predictions? No solution is going to be 100% accurate (As we humans are neither, after all ”to err is human...”) and taking action based on the predictions involve cost. Thus a cost/benefit analysis is required in order to determine an acceptable level of solution performance.

Note:

The meaning of probability or likelihood measure is not straightforward or very intuitive.

One individual customer will either churn or not. Thus post-facto or actual probability is either 1 or 0! So what does this likelihood value, which is somewhere in between, mean? It will always be ultimately proved wrong, isn’t it?

Well, actually I am misleading you slightly - probability is not about individuals, but large collections. So, the interpretation goes like this,

If there are large, very large number of customers with very same characteristics or attribute values used by the solution, the likelihood fraction of them will churn. Satisfactory?

Wait, there are many attributes used to in the solutions and they have varied range of values. How likely is it that many customers will have just the same values of the attributes? Actually, not very likely at all. What happened is this, the solution maintains knowledge, explicit or implicit, of an overall probability distribution over the attribute space.

This knowledge and knowledge of how to use it to make prediction constitute the underlying prediction model, which is usually learned from the training data using the machine-learning algorithm. This is used to compute the individual likelihoods. So, in simplistic terms we can interpret this as

If we had a lot of customers like this, likelihood fraction of them will churn. If the likelihood value is high, so is the chance of this guy being one of those churners. So let us see if that is so and we can prevent him to do so.

While it may not be immediately apparent, understanding the above subtlety is often useful in overall understanding of system performance.

The Algorithmic Level

Algorithmic level delivers the asks of the business level. In data science approach, the algorithm level creates, maintains and applies a model of the process in reality involving the objects of interest, hereafter referred as simply ”objects”. This process leads to the events or outcomes of interests. The following are the main characteristic of the data science algorithms.

An algorithm works with available data footprint of the process of interest;
It discovers the relationships between the process characteristics and the outcomes;
The above relationships are, more often than not, in form of complex patterns;
Discovering these patterns require application of powerful learning algorithms on the historical data;
Discovered patterns lead to learning the required model parameters;
An analysis/model application algorithm use these parameters to create the model and apply it on the new data in order to compute the output.

The algorithm level of a data science application is comprised of one or (often) more of algorithms from the basic types of algorithms:

Regression: Predicts the value of a continuous valued variable from the values of a set of numerical attributes;
Classification: Predict one discrete class/category out of a set of classes using numerical and/or categorical attributes;
Clustering: Discovers natural grouping of the objects;
Association discovery: Discovers propensity of similar behavior among two or more objects. described using numerical and/or categorical attributes;

Again, there are many actual algorithms for each types, differing in their approaches, complexity, interpretability, acceptable data types and above all their efficacy in a particular problem scenario.

Churn analysis example:

Again, with respect to the churn analysis problem, we can easily discern that the objects here are the customers and the process of interest is their reaching a decision about whether to churn or not.

The data about a customer available to the organization contains the data footprint of his/her decision process, albeit hidden among a lot of dust and garbage. How to isolate/extract and possibly enhance the footprint is a matter we shall touch upon in next section. It can also be seen that here the task demands the system predict one of two discrete outcomes, churn or not-churn.

Hence, the heart of the system is likely to be a classification model and algorithms for learning and applying the model.

Note:

We should not straight-jacket an one-to-one correspondence between the business level and the algorithmic level of a problem. For example, at the algorithm level we can pose the churn analysis problem as a regression problem, trying to predict for each customer, after what time amount of time she is going to churn.

Actually, what you do at algorithm level may depend on a lot of factors. We should keep an open mind while exploring for best possible solution for a given problem.

The Data Level

Data Science algorithms work with object data in form of feature/attribute vectors describing the
objects of analysis/interest. In a real life problem scenario, we seldom have the data readily available in the such form. Usually a lot of effort, often majority of the total, goes into transforming the available raw data into the usable form comprised of vectors of useful features.

This is the stage we identify, isolate and enhance the data footprint of the process of interest and encode it in form of feature vectors. Clearly, this is a momentous process that needs to take into account the format and content of actual data sources available, their quality as well as deep understanding of the problem domain. This exercise is most often solution-specific as well as organization-specific (because data collection and stewardship policies vary across organizations). The whole process is known as feature engineering.

Problem-specific feature engineering

Coming up with features is difficult, time-consuming, requires expert knowledge.
"Applied machine learning" is basically feature engineering. —Andrew Ng, Machine
Learning and AI via Brain simulations

Let us again go back to the churn analysis/prediction problem. In order to enable algorithm level to perform adequately, it has to work with quality data, i.e., data with information-rich features.

Technically such features are called discriminative features, the features which can significantly contribute towards the algorithm discriminating/distinguishing between churners and non-churners.

There is often a lot of data available internally (customer demography, product data, transaction history, competitor intelligence, call center records, etc.) as well as accessible from outside (credit rating data, etc.). Not all of these data are useful nor all of them are readily usable (for example, the transaction history is event data over time) as features. Clearly, a naive approach of somehow putting together all the data and wish the algorithms work out can
meet with disastrous result.

Thus the feature selection and preparation can turn out to be a very complex issue to tackle. Most of the time we have to use understanding of the problem domain in conjunction with application of feature selection and transformation methods. For example, in our problem we might consider the following:

Why a customer would decide to leave?

Dissatisfied with the product/service
Product/service is not what is actually needed/expected
Has trouble enjoying facilities provided
Trouble accessing the delivery channels
...

Got a better deal from a competitor

A similar product for less price
A better product for similar price
...

The above issues, individually or together, may influence various aspects of customer behavior vis-a-vis the bank, which, in turn reflects in complex patterns hidden in the data. It is the aim of feature engineering to

identify the part of data in which such patterns likely to be hidden and
designing suitable processing or transformation of the data in order to enhance the information content.

Did you notice that earlier I have used the term “data footprint”? That fits nicely with my favorite analogy of this work - finding the footprints of a rare animal in a jungle.

Blind feature selection

This is what I feel is the (charitably speaking) lazy approach or (less of charity, more to truth) incompetent approach. Get the whole data dump and run some in-box feature selection/ranking algorithm without trying to understand the data semantics. Unfortunately, I am observing too much of this. While this approach may help quickly build a solution, it will essentially become a black-box kind of solution. It definitely harms the interpretability/understandability/transparency of the solution as well as makes the solution maintenance and update a nightmare.

Incidentally, practitioners of this approach usually finds Deep Learning extremely attractive, almost forming a "celebrity fan-base". Topmost reason for that, as far as I can discern from multiple sources, is that DL networks (supposedly) eliminate the need of feature engineering, both selection as well as transformation, in one fell swoop!

Unfortunately, the reality is a slightly more complicated. But that will be subject of another post.

17 comments:

Technogeekscs5 June 2018 at 03:59
Hi Arijit you did a great job buddy. This blog is awesome and informative. I really enjoyed your information. Thanks for sharing.

Data Science Classes in Pune
Data Science Training in Pune
Data Science Institutes in Pune
Unknown16 June 2018 at 00:23
Thanks for Sharing this Valuable Information i like this i Can Share this with My Friend Circle.
Data Science Interview Questions and Answers
Rajat Sharma19 June 2018 at 05:48
What a fantastic read on Data Science. This has helped me understand a lot in Data Science course. Please keep sharing similar write ups on Data Science. Guys if you are keen to know more on Data Science, must check this wonderful Data Science tutorial and i'm sure you will enjoy learning on Data Science training.:-https://www.youtube.com/watch?v=h_GnVUIISk0
R@tK@K@20 June 2018 at 00:04
Data Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad. Get the best Data Science Courses in Hyderabad.
Unknown18 July 2018 at 22:45
Very interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this Data Science tutorial for updated knowledge on Data Science https://www.youtube.com/watch?v=8gFu30KW-ek&t=270s
amar1 August 2018 at 02:13
Demand for skilled data scientists continues to be sky-high, with IBM recently predicting that there will be a 28% increase in the number of employed data scientists in the next two years.
Businesses in all industries are beginning to capitalize on the vast increase in data and the new big data technologies becoming available for analyzing and gaining value from it.This makes it a great prospect for anyone looking for a well-paid career in an exciting and cutting-edge field Data Science.
Iteanz provides the most comprehensive and extraordinary technical training with our wealth of experience on Data Science.
Unknown4 August 2018 at 01:36
Top Trending Technologies of 2019. Watch here: https://www.youtube.com/watch?v=-y5Z2fmnp-o
pavankanna9 August 2018 at 03:35
Really loved reading through your article. Wonderful job you have done by sharing your thoughts and educating who are lacking in this content.
Data Science Online Training in Bangalore
Data Scientist Online Training in Noida
Data Science Online Training in Delhi
sherlie31 July 2020 at 07:53
This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more.
Web Designing Training in Chennai

Web Designing Course in Chennai

Web Designing Training in Bangalore

Web Designing Course in Bangalore

Web Designing Training in Hyderabad

Web Designing Course in Hyderabad

Web Designing Training in Coimbatore

Web Designing Training

Web Designing Online Training

shital31 October 2020 at 02:19

Great post i must say and thanks for the information.

Data Scientist Course in pune
Ashleel Londa24 March 2021 at 02:50
Thankyou for the valuable content.It was really helpful in understanding the concept.# BOOST Your GOOGLE RANKING.It’s Your Time To Be On #1st Page
Our Motive is not just to create links but to get them indexed as will
Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain
High Quality Backlink Building Service
1000 Backlink at cheapest
50 High Quality for just 50 INR
2000 Backlink at cheapest
5000 Backlink at cheapest
pratik22 September 2021 at 20:58
Ηeуa i am foг thе fіrst time here.
I сame acrοss this boaгԁ and I find Ӏt reallу useful & it hеlped me out a
lot. I hoρe to givе somethіng baсk and helρ others like you helped me.
Here My website for SEO Company in Vijaywada
Ramesh Sampangi8 October 2021 at 05:30
Successfully transit your career into the technology of Data Science by enrolling for the Data Science Course in Hyderabad program offered by AI Patasala training institute.
Data Science Certificate in Hyderabad
Linkfeeder3 December 2021 at 23:57
It was wonderfull reading your article. Great writing style # BOOST Your GOOGLE RANKING.It’s Your Time To Be On #1st Page Our Motive is not just to create links but to get them indexed as will Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain High Quality Backlink Building Service Boost DA upto 15+ at cheapest Boost DA upto 25+ at cheapest Boost DA upto 35+ at cheapest Captured Current News
Silent Girl25 October 2022 at 09:31
The built-in converter can convert your files into any format like MKV, FLV, AVI, MOV, and MP4. How to Download Video from Youtube. Flvto Youtube Downloader Serial
haseeb25 October 2022 at 11:45
Tally ERP 9 Crack plus activation key 2022 free download full version zip is the best management solutions for business problems and also very .Tally ERP 9 GST Crack Free Download
ram28 June 2023 at 16:28
In authority style day range night skin marriage. Most worry along risk moment watch us.

Tuesday, 4 October 2016

Problem levels analysis for data science solutioning

The Business/Domain level

Blind feature selection

17 comments: