Data Science Practitioner: (Dis)Similarity Measures

This post is basically my white-paper on the topic already available in pdf at my home site. Since I am an old Latex hand, most of my writings, especially those with mathematical notations, starts there and end up in pdf. Here essentially I am trying to figure out an economic way to convert Latex content into web-friendly content 😏, already tried a few automated conversion/export methods - did not work properly. So, trying to develop an economic copy-paste method. However, hope those who of you have not seen the pdf version, will enjoy this.

Motivation

The ability to recognize objects and their relationships is at the core of intelligent behavior. This, in turn, depend on one's ability of perceiving similarity or dissimilarity between objects, be physical or abstract ones. Hence, if we are interested to make computers behave with any degree of intelligence, we have to write programs that can work with relevant representation of objects and means to compute their similarities or lack thereof, i.e., dissimilarity (obviously, they are two faces of the same coin).

The motivation for writing this white paper (hopefully, the first of a series, depends on its reception and utilization ) is to introduce our young data scientists to some subtleties of the art (and of course,

science) they have taken up to practice. First, please look at two emphasized phrases of the paragraph above. They are most crucial and fundamental issues. Hope you do not represent a 'chair', to your computer program for identifying objects to seat on, as just something that has four legs - then the computer is very likely to advise you to sit on your house-puppy. Similarly, you don't want to compare prices of two objects by comparing their binary string representations.

In short,

we need to work with a representation of objects of interest containing adequate information relevant for the problem at hand. For example, information needed about a chair for distinguishing it from a table is quite different if we want to distinguish between an easy

chair and a work chair;
once we have a proper representation of the objects, we need to incorporate into a relevant mathematical framework which will enable us to compute a measure of similarity or dissimilarity between objects.

There are quite a number of available measures for each type of commonly used representations. We can find a good listing here. Unfortunately, the measures within each group has many, some obvious and some subtle, differences. Thus, we need to choose and evaluate them in context of our problems. A nice survey of many of these measures is available on internet. Advanced readers can and urged to) directly go there and read it and its alikes.

In this white paper we shall discuss the objects represented by a set of their attributes and computing distances between pairs of them as points in feature space. The computed distance is interpreted as a measure of dissimilarity, and thus, inverse of similarity - in the sense, the less is distance the more is similarity and vice-versa. We shall also find a measure, the cosine similarity that can be directly interpreted as similarity. Don’t worry my friends, I shall take you there very gently.

Working with Object data

As far as organization of the data is concerned, there are three major categories:

Object data: Objects are represented by an ordered set of their attributes/ characteristics/ features. We shall use the term “feature” henceforth - these are the data which we store and perceive as one record/instance per row in a file/table;
Relational/relationship data: The data captures various relationships among
the objects - we often call them “graph data”; and
Sequence data: includes time-series data, natural language text data, ...

Remember, there can be many situations while solving real-life problem, when more than one types of data may be used as well as may be converted to one another. Nevertheless, here we concentrate on object data only.

Attributes

Features in object data can be numerical, Boolean as well as categorical. Again, let us consider some features of a chair:

Height of the chair: numerical;
Area of the seat: numerical;
Has armrest: Boolean - either has or not (1/0)
Reclinable: Boolean - yes/no
Color of the chair: categorical
Number of legs: ? - can be 4, 3, 1 (not 2, I guess), is it meaningful to consider it numeric, so that we can do all kind of mathematical and/or statistical jugglery or should we consider it categorical - 4-legged etc.?

Example, my chair is 22 inch in height, with seat area 400 sq. inch. and armrests, is fixed-back, red with 4 legs. Hence, $MyChair = (22; 400; 1; 0; 4L)$, a 5-tuple is the representation of my chair. In statistical literature such data are often called multivariate data.

Numerical features allow us to apply a vast array of mathematical and statistical tools in order to work with them. Thus, in most cases we try to frame the problems in terms of numerical features (even when the raw data is something different such as text). In majority of cases, Boolean features also can be treated as numerical with values 0 and 1.

Categorical features are slightly difficult to deal with within same framework, since we cannot work out a concept of similarity among their values, e.g., it is somewhat absurd to say that the color green is more similar to black than blue or vice-versa. Nevertheless, we shall see later that the categorical features also can be transformed. Thus, without restricting the applicability, hereafter we shall consider all the features of an object. Now, let us set up some basic nomenclatures here.

Let $X$ denote a data set with $n$ instances of object data;
Let $\mathbf{x}_{i}=\left(x_{i1},x_{i2},\cdots,x_{ip}\right)\in X$ be the $i^{th}$object with $p$ attributes/features with values $x_{i1},x_{i2},\cdots,x_{ip}$respectively; and
To keep the things simple, let us posit that $\forall\mathbf{x}_{i}\in X$, the types of features available are same. In other words, each of the object instance is represented by a $p-tuple$ consisted of the values of the same set of attributes in the same order.

The Iris data, partial listing, sorted by sepal width (Courtesy: Wikipedia)

To understand this properly, let us consider one of the most well known (and classic!) data sets, the Iris data set. The data set is comprised of 50 sample observations from three species of Iris flowers. The measured features are length and widths of the sepals and petals of the flowers. Part of the data is shown above. Each row in the data represents measurements of one flower with the species of the flower noted in fifth column. Thus, this data has four numerical features, i.e., $p=4$ and $3\times50=150$ instances (rows), i.e., $n=150$. The inclusion of 'species' in the fifth column, makes it a labeled data, where each instance is labeled with a category or class. For the time being we shall ignore these labels in our discussion.

It is very important to remember that an object can potentially have many (often extremely large number of) attributes/features. For example, consider the Iris flowers. Other than the four collected in the Iris data set, there are many other possible attributes, including simple ones like color to very complex ones like genome sequence (think of others in between ). It is clear that we cannot use all possible attributes.

So, which ones do we use? This is one of the big questions almost always faced by data scientists in real world problems (in research labs one has the luxury of using well-known data sets). The problem is essentially that of determining relevance and accessibility/availability of the features with respect to the problem-in-hand.

Addressing this issue constitutes big and vital parts of data science known as feature selection and feature extraction, known collectively as feature engineering. While feature engineering is very much problem-specific (and thus out of current scope), let us look into some of the guiding principles here.

Relevance: Usually we would like to use discerning features or discriminating features. They are the features which can be useful for computationally identifying similar and dissimilar objects in the sense relevant for the problem. For example, consider the species identification problem of Iris flowers. The attribute ‘color’ is not very discriminating (all Irises are violet), but genome data can be very useful. On the other hand, if the problem is to identifying between Irises and Roses, color can be one of the very useful features.

Let us try to understand the above in simpler terms. Again, consider the problem of Iris species identification. If the objects, i.e., Iris flowers are represented with a set of discriminating feature, we can expect that, in general, the similarity computed using the features is greater between two flowers from same species than otherwise. Let us add another feature to the representation. This will effect the similarities. Now, if the similarity between flowers from same species is increased and/or similarity between flowers from different species is decreased after introduction of the new feature, then the new feature is discriminating one.

Existence of non-discriminating/irrelevant/redundant features in data evidently do not contribute to the solutions, but can often worsen the situation. Their existence can be thought as noise in the data which essentially decreases signal-to-noise ratio in the information content of the data. This makes the task of extracting knowledge/information (signal) from the data more difficult.

Pragmatics: Even after we have some idea on which features are useful, we have to consider the cost as well as feasibility of collecting and/or processing them. Let us again consider the problem of Iris species identification. While the genome data can be extremely good in discriminating species, it is costly to collect as well as to process computationally. So, if we have the freedom to dictate the data collection strategy, we shall have to consider the cost-benefit tradeoff for selecting the features to be collected/measured.

In a lot of real-world works we don’t have the above freedom. Instead we have to work with already available data, which is mostly collected for some other purpose (transactional, operational, etc.), with no plan of subjecting them to application of data science techniques. Such scenarios are known as data repurposing.

Here our challenge is to identify the relevant features from the data for use in solving the data science problem. In worst case, we might face a situation where there is not enough useful feature in the data for solving the problem to a satisfactory level .

Objects in feature space

The Euclidean space

Till now we have been discussing about a single object and its representation. Now, let us think about working with a number of objects (at the least, similarity refers to a pair of objects!). In this the concepts drawn from the study of vector spaces becomes most invaluable tool. To understand

how to use them let us start with a recalling of our school days when we have learnt as Euclidean geometry or plane geometry.

There we studied properties of various geometric objects, like straight line, parallel lines, triangles, circles, etc., based on a number of axioms or postulates. But, think carefully, at that time we did

not have any business with coordinates of the points. That came latter under the heading co-ordinate geometry (bit of nasty stuff for most of us ). Well, the person responsible for complicating our innocent(?) school days was René Descartes, inventor of, among a lot of other stuff, the Cartesian coordinate system shown below

A Cartesian coordinate plane (Wikipedia)

Cartesian coordinate system is a rectilinear coordinate system, whose axes are mutually perpendicular (in formal mathematical terms, orthogonal)and linear (i.e., straight lines). For Cartesian coordinate system in 3-dimension see below.

3-D Cartesian space (Wikipedia)

Introduction of Cartesian coordinate system endows each point in the Euclidean plane (2-D) with a unique address, called its coordinate, a pair of numeric values. This created a bridge between geometry and algebra, the marriage producing the offspring called the Algebraic Geometry. One of minor consequences of this is we can plot and thus visualize algebraic equations, e.g., $x^{2}+y^{2}=1$, a circle centered at origin with radius 1!.

A Euclidean plane equipped with a Cartesian coordinate system becomes a Euclidean space.

Notice that I have used 'a', not 'the' in the previous sentence. }Further, here the term ``space'' is not a common sense or colloquial term (like in 'a specious home'), rather it refers to an algebraic

geometric entity . Yes, we are venturing in a slightly dangerous territory, but have courage, we need not go too deep into it. For our purpose it will suffice to understand that a ``space'' is a

set of distinctly identifiable (by their coordinates) points with certain properties holding among them. These properties are the ones which essentially distinguish one type of space from another. The

points to remember:

There are many types of spaces other than Euclidean Spaces (yes, there are also quite a few different Euclidean spaces, even with non-Cartesian coordinates) called (obviously!) non-Euclidean spaces; and
There are many coordinate systems other than Cartesians, (again!)the non-Cartesians.

Thankfully, here we shall work only with real-valued (i.e., the coordinate

values are all real numbers) Euclidean space and its \emph{generalizations}

equipped with Cartesian coordinate systems. Let us have an intuitive

understanding how such a space relates to our object data. Let us,

for the time being, consider that the Iris data has only two features,

namely, \emph{petal length} and \emph{petal width. }If we conceive

the petal length as $x$ value and petal width as $y$ value, then

we can associate an iris flower with a point $p=(x,y)$ in the two

dimensional space.

1 comment:

Công ty dịch thuật chuyên nghiệp MIDtrans3 April 2019 at 02:52
Mình đã tìm thấy các thông tin cần thiết ở đây, cảm ơn bạn. Mình cũng muốn giới thiệu về một Công ty dịch thuật uy tín - Công ty dịch thuật miền trung - MIDtrans có trụ sở chính chính tại địa chỉ 02 Hoàng Diệu, TP Đồng Hới, tỉnh Quảng Bình có Giấy phép kinh doanh số 3101023866 cấp ngày 9/12/2016 là đơn vị chuyên cung cấp dịch vụ dịch thuật, phiên dịch dành các cá nhân. Hệ thống thương hiệu và các Công ty dịch thuật con trực thuộc: công ty dịch thuật hcm - dịch thuật sài gòn 247 , địa chỉ 47 Điện Biên Phủ, Phường Đakao, Quận 1 TP HCM, dịch thuật buôn mê thuột, đăk lăk : địa chỉ 100 , Lê lợi, TP Buôn Mê Thuột là nhà cung ứng dịch vụ dịch thuật uy tín hàng đầu tại Đăk Lăk; vietnam translate : dịch vụ dịch thuật cho người nước ngoài có nhu cầu, giao diện tiếng Anh dễ sử dụng; dịch thuật công chứng gò vấp : nhà cung ứng dịch vụ dịch vụ dịch thuật phiên dịch hàng đầu tại Quận Bình Thạnh, TP HCM; dịch thuật đà nẵng midtrans : Địa chỉ 54 Đinh Tiên Hoàng, Quận Hải Châu, TP Đà Nẵng chuyên cung cấp dịch vụ dịch thuật công chứng, dịch thuật chuyên ngành tại Đà Nẵng; dịch thuật hà nội midtrans : địa chỉ 101 Láng Hạ, Đống Đa, Hà Nội là nhà cung ứng dịch vụ biên dịch, phiên dịch chuyên nghiệp tại địa bàn Hà Nội. Chúng tôi chuyên cung cấp các dịch vụ biên dịch và phiên dịch, dịch thuật công chứng chất lượng cao hơn 50 ngôn ngữ khác nhau như tiếng Anh, Nhật, Hàn, Trung, Pháp, Đức, Nga, Tây Ban Nha, Bồ Đào Nha, Ý, Ba Lan, Phần Lan, Thái Lan, Hà Lan, Rumani, Lào, Campuchia, Philippin, Indonesia, La Tinh, Thụy Điển, Malaysia, Thổ Nhĩ Kỳ..vv... Dịch thuật MIDtrans tự hào với đội ngũ lãnh đạo với niềm đam mê, khát khao vươn tầm cao trong lĩnh vực dịch thuật, đội ngũ nhân sự cống hiến và luôn sẵn sàng cháy hết mình. Chúng tôi phục vụ từ sự tậm tâm và cố gắng từ trái tim những người dịch giả.Tự hào là công ty cung cấp dịch thuật chuyên ngành hàng đầu với các đối tác lớn tại Việt nam trong các chuyên ngành hẹp như: y dược (bao gồm bệnh lý), xây dựng (kiến trúc), hóa chất, thủy nhiệt điện, ngân hàng, tài chính, kế toán. Các dự án đã triển khai của Công ty dịch thuật chuyên nghiệp MIDtrans đều được Khách hàng đánh giá cao và đạt được sự tín nhiệm về chất lượng biên phiên dịch đặc biệt đối với dịch hồ sơ thầu , dịch thuật tài liệu tài chính ngân hàng, dịch thuật tài liệu y khoa đa ngữ chuyên sâu. Đó là kết quả của một hệ thống quản lý chất lượng dịch thuật chuyên nghiệp, những tâm huyết và kinh nghiệm biên phiên dịch nhiều năm của đội ngũ dịch giả của chúng tôi. Hotline: 0947688883. email: info@dichthuatmientrung.com.vn . Các bạn ghé thăm site ủng hộ nhé. Cám ơn nhiều

Sunday, 9 July 2017

(Dis)Similarity Measures