A Data science-based solution needs to address
problems at multiple levels. While it addresses a business problem,
computationally it is comprised of a pipeline of algorithm which, in
turn, operates on relevant data presented in proper format. Thus to understand
the them we need to focus at least at the
- Business level;
- Algorithm level; and
- Data level.
Contrary to the popular belief, almost all
non-trivial data science solutions are needed to be built ground up with minute
and interrelated attention to the details of the problem at all three levels.
In the following we shall try to understand that with the help of an running
example of aspects of a churn analysis solution.
It is vital to understand that in most
real-world cases we are re-purposing the data for building the solution. In
other words the data used is not collected for the purpose of kind of analysis
we want to perform. They are collected as part of transactional and operational
activities of the organization. Thus the strategies for collection, formatting
and storage of the data is optimized for those purpose. Therefore, locating the
relevant data and processing them to enable application of data science
technique can be quite non-trivial, often herculean exercise.
From a user perspective a solution life-cycle
can be understood as following:
- Solution
development: Using historical data, involves extensive experimentation,
testing and validation;
- Solution deployment: Using the solution to get the
insight and/or decision support;
- Solution
assimilation: In the workflow enabling actions based on insight and/or
prediction made by the solution;
- Solution
maintenance and update: Periodic checking and validation of the solution
performance and update to improve performance if required.
It is the job of the data scientist(s) deliver
on the above and for that she have to understand the problem at different
levels.
The Business/Domain level
At this level the broad business context,
desired outcome of the solution are defined along with various parameters
and constraints the solution should/must adhere to. Also, at this level
desired/ acceptable performance parameters can be set in commensurate with
business policies, especially the risk management. Let us try to
understand this with respect to the popular churn analysis problem
Business level context for churn analysis
The churn analysis problem has a quite
straight-forward context. Every business strives to retain
its good/valuable customers. Thus, if it is possible to identify the customers likely to stop doing business with the organization, it may be able to take proactive steps in order to retain the customer. However, we cannot only demand that a system should spout lists of customers whenever asked. We need to specify at the followings:
its good/valuable customers. Thus, if it is possible to identify the customers likely to stop doing business with the organization, it may be able to take proactive steps in order to retain the customer. However, we cannot only demand that a system should spout lists of customers whenever asked. We need to specify at the followings:
- What
should be the prediction lead time? Naturally, we don’t want to know a
customer is going to churn within next five minutes. We cannot possibly do anything
about it with the knowledge. We need enough lead time so that preventive actions can be
taken.
- What
is the acceptable probability/likelihood measure of a prediction being
correct? Predictions are always uncertain to some degree (otherwise they would be called
facts). Usually, most of the customers run through the system will have non-zero likelihood
of churning. This issue is often revisited in later stage and calibrated against the
system output characteristics.
- What
is the acceptable level of accuracy of the predictions? No solution is
going to be 100% accurate (As we humans are neither, after all ”to err is human...”) and
taking action based on the predictions involve cost. Thus a cost/benefit analysis is required in
order to determine an acceptable level of solution performance.
Note:
The meaning of probability or likelihood measure
is not straightforward or very intuitive.
One individual customer will either churn or
not. Thus post-facto or actual probability is either 1 or 0! So what does
this likelihood value, which is somewhere in between, mean? It will always
be ultimately proved wrong, isn’t it?
Well, actually I am misleading you slightly -
probability is not about individuals, but large collections. So, the
interpretation goes like this,
If there are large, very large number of customers
with very same characteristics or attribute values used by the solution,
the likelihood fraction of them will churn. Satisfactory?
Wait, there are many attributes used to in the
solutions and they have varied range of values. How likely is it that many
customers will have just the same values of the attributes? Actually, not
very likely at all. What happened is this, the solution maintains
knowledge, explicit or implicit, of an overall probability
distribution over the attribute space.
This knowledge and knowledge of how to use it to
make prediction constitute the underlying prediction model, which is
usually learned from the training data using the machine-learning
algorithm. This is used to compute the individual likelihoods. So, in
simplistic terms we can interpret this as
If we had a lot of customers like this,
likelihood fraction of them will churn. If the likelihood value is high,
so is the chance of this guy being one of those churners. So let us see if that
is so and we can prevent him to do so.
While it may not be immediately apparent,
understanding the above subtlety is often useful in overall understanding
of system performance.
The Algorithmic Level
Algorithmic level delivers the asks of the
business level. In data science approach, the algorithm level creates,
maintains and applies a model of the process in reality involving the objects
of interest, hereafter referred as simply ”objects”. This process leads to
the events or outcomes of interests. The following are the main
characteristic of the data science algorithms.
- An algorithm works with available data
footprint of the process of interest;
- It discovers the relationships between the process
characteristics and the outcomes;
- The above relationships are, more often than not, in
form of complex patterns;
- Discovering
these patterns require application of powerful learning algorithms on the
historical data;
- Discovered patterns lead to learning the required model
parameters;
- An
analysis/model application algorithm use these parameters
to create the model and apply it on the new data in order to compute the output.
The algorithm level of a data science
application is comprised of one or (often) more of algorithms from the
basic types of algorithms:
- Regression:
Predicts the value of a continuous valued variable from the values of a
set of numerical attributes;
- Classification:
Predict one discrete class/category out of a set of classes using
numerical and/or categorical attributes;
- Clustering: Discovers natural grouping of the objects;
- Association
discovery: Discovers propensity of similar behavior among two or more
objects. described using numerical and/or categorical attributes;
Again, there are many actual algorithms for each
types, differing in their approaches, complexity, interpretability, acceptable
data types and above all their efficacy in a particular problem scenario.
Churn analysis example:
Again, with respect to the churn analysis
problem, we can easily discern that the objects here are the customers and the
process of interest is their reaching a decision about whether to churn or not.
The data about a customer available to the
organization contains the data footprint of his/her decision process,
albeit hidden among a lot of dust and garbage. How to isolate/extract
and possibly enhance the footprint is a matter we shall touch upon in next
section. It can also be seen that here the task demands the system predict
one of two discrete outcomes, churn or not-churn.
Hence, the heart of the system is likely to be a
classification model and algorithms for learning and applying the model.
Note:
We should not straight-jacket an one-to-one
correspondence between the business level and the algorithmic level of a
problem. For example, at the algorithm level we can pose the churn analysis
problem as a regression problem, trying to predict for each customer, after
what time amount of time she is going to churn.
Actually, what you do at algorithm level may
depend on a lot of factors. We should keep an open mind while exploring for
best possible solution for a given problem.
The Data Level
Data Science algorithms work with object data in
form of feature/attribute vectors describing the
objects of analysis/interest. In a real life problem scenario, we seldom have the data readily available in the such form. Usually a lot of effort, often majority of the total, goes into transforming the available raw data into the usable form comprised of vectors of useful features.
objects of analysis/interest. In a real life problem scenario, we seldom have the data readily available in the such form. Usually a lot of effort, often majority of the total, goes into transforming the available raw data into the usable form comprised of vectors of useful features.
This is the stage we identify, isolate and
enhance the data footprint of the process of interest and encode it in form of
feature vectors. Clearly, this is a momentous process that needs to take into
account the format and content of actual data sources available, their quality
as well as deep understanding of the problem domain. This exercise is most
often solution-specific as well as organization-specific (because data
collection and stewardship policies vary across organizations). The whole
process is known as feature engineering.
Problem-specific feature engineering
Coming up with features is difficult,
time-consuming, requires expert knowledge.
"Applied machine learning" is basically feature engineering. —Andrew Ng, Machine
Learning and AI via Brain simulations
"Applied machine learning" is basically feature engineering. —Andrew Ng, Machine
Learning and AI via Brain simulations
Let us again go back to the churn
analysis/prediction problem. In order to enable algorithm level to perform
adequately, it has to work with quality data, i.e., data with
information-rich features.
Technically such features are called
discriminative features, the features which can significantly contribute
towards the algorithm discriminating/distinguishing between churners
and non-churners.
There is often a lot of data available
internally (customer demography, product data, transaction history,
competitor intelligence, call center records, etc.) as well as
accessible from outside (credit rating data, etc.). Not all of these data
are useful nor all of them are readily usable (for example, the
transaction history is event data over time) as features. Clearly, a naive
approach of somehow putting together all the data and wish the algorithms work
out can
meet with disastrous result.
meet with disastrous result.
Thus the feature selection and preparation can
turn out to be a very complex issue to tackle. Most of the time we have to
use understanding of the problem domain in conjunction with application of
feature selection and transformation methods. For example, in our problem we
might consider the following:
- Why a customer would decide to leave?
- Dissatisfied with the product/service
- Product/service is not what is actually
needed/expected
- Has trouble enjoying facilities provided
- Trouble accessing the delivery channels
- ...
- Got a better deal from a competitor
- A similar product for less price
- A better product for similar price
- ...
The above issues, individually or together, may
influence various aspects of customer behavior vis-a-vis the bank, which, in
turn reflects in complex patterns hidden in the data. It is the aim
of feature engineering to
- identify the part of data in which such patterns likely
to be hidden and
- designing suitable processing or transformation of the
data in order to enhance the information content.
Did you notice that earlier I have used the term
“data footprint”? That fits nicely with my favorite analogy of this work - finding
the footprints of a rare animal in a jungle.
Blind feature selection
This is what I feel is the (charitably speaking) lazy approach or (less of charity, more to truth) incompetent approach. Get the whole data dump and run some in-box feature
selection/ranking algorithm without trying to understand the data semantics.
Unfortunately, I am observing too much of this. While this approach may help
quickly build a solution, it will essentially become a black-box kind of
solution. It definitely harms the interpretability/understandability/transparency
of the solution as well as makes the solution maintenance and update a
nightmare.
Incidentally, practitioners of this approach usually finds Deep Learning extremely attractive, almost forming a "celebrity fan-base". Topmost reason for that, as far as I can discern from multiple sources, is that DL networks (supposedly) eliminate the need of feature engineering, both selection as well as transformation, in one fell swoop!
Unfortunately, the reality is a slightly more complicated. But that will be subject of another post.
Incidentally, practitioners of this approach usually finds Deep Learning extremely attractive, almost forming a "celebrity fan-base". Topmost reason for that, as far as I can discern from multiple sources, is that DL networks (supposedly) eliminate the need of feature engineering, both selection as well as transformation, in one fell swoop!
Unfortunately, the reality is a slightly more complicated. But that will be subject of another post.
Hi Arijit you did a great job buddy. This blog is awesome and informative. I really enjoyed your information. Thanks for sharing.
ReplyDeleteData Science Classes in Pune
Data Science Training in Pune
Data Science Institutes in Pune
Thanks for Sharing this Valuable Information i like this i Can Share this with My Friend Circle.
ReplyDeleteData Science Interview Questions and Answers
What a fantastic read on Data Science. This has helped me understand a lot in Data Science course. Please keep sharing similar write ups on Data Science. Guys if you are keen to know more on Data Science, must check this wonderful Data Science tutorial and i'm sure you will enjoy learning on Data Science training.:-https://www.youtube.com/watch?v=h_GnVUIISk0
ReplyDeleteData Science has taken the world by storm in the recent days and there are several aspiring software professionals looking to master this platform. There are several institutes in India especially in Hyderabad. Get the best Data Science Courses in Hyderabad.
ReplyDeleteVery interesting blog post.Quite informative and very helpful.This indeed is one of the recommended blog for learners.Thank you for providing such nice piece of article. I'm glad to leave a comment. Expect more articles in future. You too can check this Data Science tutorial for updated knowledge on Data Science https://www.youtube.com/watch?v=8gFu30KW-ek&t=270s
ReplyDeleteDemand for skilled data scientists continues to be sky-high, with IBM recently predicting that there will be a 28% increase in the number of employed data scientists in the next two years.
ReplyDeleteBusinesses in all industries are beginning to capitalize on the vast increase in data and the new big data technologies becoming available for analyzing and gaining value from it.This makes it a great prospect for anyone looking for a well-paid career in an exciting and cutting-edge field Data Science.
Iteanz provides the most comprehensive and extraordinary technical training with our wealth of experience on Data Science.
Top Trending Technologies of 2019. Watch here: https://www.youtube.com/watch?v=-y5Z2fmnp-o
ReplyDeleteReally loved reading through your article. Wonderful job you have done by sharing your thoughts and educating who are lacking in this content.
ReplyDeleteData Science Online Training in Bangalore
Data Scientist Online Training in Noida
Data Science Online Training in Delhi
This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more.
ReplyDeleteWeb Designing Training in Chennai
Web Designing Course in Chennai
Web Designing Training in Bangalore
Web Designing Course in Bangalore
Web Designing Training in Hyderabad
Web Designing Course in Hyderabad
Web Designing Training in Coimbatore
Web Designing Training
Web Designing Online Training
ReplyDeleteGreat post i must say and thanks for the information.
Data Scientist Course in pune
Thankyou for the valuable content.It was really helpful in understanding the concept.# BOOST Your GOOGLE RANKING.It’s Your Time To Be On #1st Page
ReplyDeleteOur Motive is not just to create links but to get them indexed as will
Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain
High Quality Backlink Building Service
1000 Backlink at cheapest
50 High Quality for just 50 INR
2000 Backlink at cheapest
5000 Backlink at cheapest
Ηeуa i am foг thе fіrst time here.
ReplyDeleteI сame acrοss this boaгԁ and I find Ӏt reallу useful & it hеlped me out a
lot. I hoρe to givе somethіng baсk and helρ others like you helped me.
Here My website for SEO Company in Vijaywada
Successfully transit your career into the technology of Data Science by enrolling for the Data Science Course in Hyderabad program offered by AI Patasala training institute.
ReplyDeleteData Science Certificate in Hyderabad
It was wonderfull reading your article. Great writing style # BOOST Your GOOGLE RANKING.It’s Your Time To Be On #1st Page Our Motive is not just to create links but to get them indexed as will Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain High Quality Backlink Building Service Boost DA upto 15+ at cheapest Boost DA upto 25+ at cheapest Boost DA upto 35+ at cheapest Captured Current News
ReplyDeleteThe built-in converter can convert your files into any format like MKV, FLV, AVI, MOV, and MP4. How to Download Video from Youtube. Flvto Youtube Downloader Serial
ReplyDeleteTally ERP 9 Crack plus activation key 2022 free download full version zip is the best management solutions for business problems and also very .Tally ERP 9 GST Crack Free Download
ReplyDeleteIn authority style day range night skin marriage. Most worry along risk moment watch us.
ReplyDelete