Data Science Practitioner

Sunday, 9 July 2017

(Dis)Similarity Measures

This post is basically my white-paper on the topic already available in pdf at my home site. Since I am an old Latex hand, most of my writings, especially those with mathematical notations, starts there and end up in pdf. Here essentially I am trying to figure out an economic way to convert Latex content into web-friendly content 😏, already tried a few automated conversion/export methods - did not work properly. So, trying to develop an economic copy-paste method. However, hope those who of you have not seen the pdf version, will enjoy this.

Motivation

The ability to recognize objects and their relationships is at the core of intelligent behavior. This, in turn, depend on one's ability of perceiving similarity or dissimilarity between objects, be physical or abstract ones. Hence, if we are interested to make computers behave with any degree of intelligence, we have to write programs that can work with relevant representation of objects and means to compute their similarities or lack thereof, i.e., dissimilarity (obviously, they are two faces of the same coin).

The motivation for writing this white paper (hopefully, the first of a series, depends on its reception and utilization ) is to introduce our young data scientists to some subtleties of the art (and of course,

science) they have taken up to practice. First, please look at two emphasized phrases of the paragraph above. They are most crucial and fundamental issues. Hope you do not represent a 'chair', to your computer program for identifying objects to seat on, as just something that has four legs - then the computer is very likely to advise you to sit on your house-puppy. Similarly, you don't want to compare prices of two objects by comparing their binary string representations.

In short,

we need to work with a representation of objects of interest containing adequate information relevant for the problem at hand. For example, information needed about a chair for distinguishing it from a table is quite different if we want to distinguish between an easy

chair and a work chair;
once we have a proper representation of the objects, we need to incorporate into a relevant mathematical framework which will enable us to compute a measure of similarity or dissimilarity between objects.

There are quite a number of available measures for each type of commonly used representations. We can find a good listing here. Unfortunately, the measures within each group has many, some obvious and some subtle, differences. Thus, we need to choose and evaluate them in context of our problems. A nice survey of many of these measures is available on internet. Advanced readers can and urged to) directly go there and read it and its alikes.

In this white paper we shall discuss the objects represented by a set of their attributes and computing distances between pairs of them as points in feature space. The computed distance is interpreted as a measure of dissimilarity, and thus, inverse of similarity - in the sense, the less is distance the more is similarity and vice-versa. We shall also find a measure, the cosine similarity that can be directly interpreted as similarity. Don’t worry my friends, I shall take you there very gently.

Working with Object data

As far as organization of the data is concerned, there are three major categories:

Object data: Objects are represented by an ordered set of their attributes/ characteristics/ features. We shall use the term “feature” henceforth - these are the data which we store and perceive as one record/instance per row in a file/table;
Relational/relationship data: The data captures various relationships among
the objects - we often call them “graph data”; and
Sequence data: includes time-series data, natural language text data, ...

Remember, there can be many situations while solving real-life problem, when more than one types of data may be used as well as may be converted to one another. Nevertheless, here we concentrate on object data only.

Attributes

Features in object data can be numerical, Boolean as well as categorical. Again, let us consider some features of a chair:

Height of the chair: numerical;
Area of the seat: numerical;
Has armrest: Boolean - either has or not (1/0)
Reclinable: Boolean - yes/no
Color of the chair: categorical
Number of legs: ? - can be 4, 3, 1 (not 2, I guess), is it meaningful to consider it numeric, so that we can do all kind of mathematical and/or statistical jugglery or should we consider it categorical - 4-legged etc.?

Example, my chair is 22 inch in height, with seat area 400 sq. inch. and armrests, is fixed-back, red with 4 legs. Hence,

$MyChair = (22; 400; 1; 0; 4L)$ , a 5-tuple is the representation of my chair. In statistical literature such data are often called multivariate data.

Numerical features allow us to apply a vast array of mathematical and statistical tools in order to work with them. Thus, in most cases we try to frame the problems in terms of numerical features (even when the raw data is something different such as text). In majority of cases, Boolean features also can be treated as numerical with values 0 and 1.

Categorical features are slightly difficult to deal with within same framework, since we cannot work out a concept of similarity among their values, e.g., it is somewhat absurd to say that the color green is more similar to black than blue or vice-versa. Nevertheless, we shall see later that the categorical features also can be transformed. Thus, without restricting the applicability, hereafter we shall consider all the features of an object. Now, let us set up some basic nomenclatures here.

Let $X$ denote a data set with $n$ instances of object data;
Let $\mathbf{x}_{i}=\left(x_{i1},x_{i2},\cdots,x_{ip}\right)\in X$ be the $i^{th}$ object with $p$ attributes/features with values $x_{i1},x_{i2},\cdots,x_{ip}$ respectively; and
To keep the things simple, let us posit that $\forall\mathbf{x}_{i}\in X$ , the types of features available are same. In other words, each of the object instance is represented by a $p-tuple$ consisted of the values of the same set of attributes in the same order.

The Iris data, partial listing, sorted by sepal width (Courtesy: Wikipedia)

To understand this properly, let us consider one of the most well known (and classic!) data sets, the Iris data set. The data set is comprised of 50 sample observations from three species of Iris flowers. The measured features are length and widths of the sepals and petals of the flowers. Part of the data is shown above. Each row in the data represents measurements of one flower with the species of the flower noted in fifth column. Thus, this data has four numerical features, i.e.,

$p=4$ and

$3\times50=150$ instances (rows), i.e.,

$n=150$ . The inclusion of 'species' in the fifth column, makes it a labeled data, where each instance is labeled with a category or class. For the time being we shall ignore these labels in our discussion.

It is very important to remember that an object can potentially have many (often extremely large number of) attributes/features. For example, consider the Iris flowers. Other than the four collected in the Iris data set, there are many other possible attributes, including simple ones like color to very complex ones like genome sequence (think of others in between ). It is clear that we cannot use all possible attributes.

So, which ones do we use? This is one of the big questions almost always faced by data scientists in real world problems (in research labs one has the luxury of using well-known data sets). The problem is essentially that of determining relevance and accessibility/availability of the features with respect to the problem-in-hand.

Addressing this issue constitutes big and vital parts of data science known as feature selection and feature extraction, known collectively as feature engineering. While feature engineering is very much problem-specific (and thus out of current scope), let us look into some of the guiding principles here.

Relevance: Usually we would like to use discerning features or discriminating features. They are the features which can be useful for computationally identifying similar and dissimilar objects in the sense relevant for the problem. For example, consider the species identification problem of Iris flowers. The attribute ‘color’ is not very discriminating (all Irises are violet), but genome data can be very useful. On the other hand, if the problem is to identifying between Irises and Roses, color can be one of the very useful features.

Let us try to understand the above in simpler terms. Again, consider the problem of Iris species identification. If the objects, i.e., Iris flowers are represented with a set of discriminating feature, we can expect that, in general, the similarity computed using the features is greater between two flowers from same species than otherwise. Let us add another feature to the representation. This will effect the similarities. Now, if the similarity between flowers from same species is increased and/or similarity between flowers from different species is decreased after introduction of the new feature, then the new feature is discriminating one.

Existence of non-discriminating/irrelevant/redundant features in data evidently do not contribute to the solutions, but can often worsen the situation. Their existence can be thought as noise in the data which essentially decreases signal-to-noise ratio in the information content of the data. This makes the task of extracting knowledge/information (signal) from the data more difficult.

Pragmatics: Even after we have some idea on which features are useful, we have to consider the cost as well as feasibility of collecting and/or processing them. Let us again consider the problem of Iris species identification. While the genome data can be extremely good in discriminating species, it is costly to collect as well as to process computationally. So, if we have the freedom to dictate the data collection strategy, we shall have to consider the cost-benefit tradeoff for selecting the features to be collected/measured.

In a lot of real-world works we don’t have the above freedom. Instead we have to work with already available data, which is mostly collected for some other purpose (transactional, operational, etc.), with no plan of subjecting them to application of data science techniques. Such scenarios are known as data repurposing.

Here our challenge is to identify the relevant features from the data for use in solving the data science problem. In worst case, we might face a situation where there is not enough useful feature in the data for solving the problem to a satisfactory level .

Objects in feature space

The Euclidean space

Till now we have been discussing about a single object and its representation. Now, let us think about working with a number of objects (at the least, similarity refers to a pair of objects!). In this the concepts drawn from the study of vector spaces becomes most invaluable tool. To understand

how to use them let us start with a recalling of our school days when we have learnt as Euclidean geometry or plane geometry.

There we studied properties of various geometric objects, like straight line, parallel lines, triangles, circles, etc., based on a number of axioms or postulates. But, think carefully, at that time we did

not have any business with coordinates of the points. That came latter under the heading co-ordinate geometry (bit of nasty stuff for most of us ). Well, the person responsible for complicating our innocent(?) school days was René Descartes, inventor of, among a lot of other stuff, the Cartesian coordinate system shown below

A Cartesian coordinate plane (Wikipedia)

Cartesian coordinate system is a rectilinear coordinate system, whose axes are mutually perpendicular (in formal mathematical terms, orthogonal)and linear (i.e., straight lines). For Cartesian coordinate system in 3-dimension see below.

3-D Cartesian space (Wikipedia)

Introduction of Cartesian coordinate system endows each point in the Euclidean plane (2-D) with a unique address, called its coordinate, a pair of numeric values. This created a bridge between geometry and algebra, the marriage producing the offspring called the Algebraic Geometry. One of minor consequences of this is we can plot and thus visualize algebraic equations, e.g.,

$x^{2}+y^{2}=1$ , a circle centered at origin with radius 1!.

A Euclidean plane equipped with a Cartesian coordinate system becomes a Euclidean space.

Notice that I have used 'a', not 'the' in the previous sentence. }Further, here the term ``space'' is not a common sense or colloquial term (like in 'a specious home'), rather it refers to an algebraic

geometric entity . Yes, we are venturing in a slightly dangerous territory, but have courage, we need not go too deep into it. For our purpose it will suffice to understand that a ``space'' is a

set of distinctly identifiable (by their coordinates) points with certain properties holding among them. These properties are the ones which essentially distinguish one type of space from another. The

points to remember:

There are many types of spaces other than Euclidean Spaces (yes, there are also quite a few different Euclidean spaces, even with non-Cartesian coordinates) called (obviously!) non-Euclidean spaces; and
There are many coordinate systems other than Cartesians, (again!)the non-Cartesians.

Thankfully, here we shall work only with real-valued (i.e., the coordinate

values are all real numbers) Euclidean space and its \emph{generalizations}

equipped with Cartesian coordinate systems. Let us have an intuitive

understanding how such a space relates to our object data. Let us,

for the time being, consider that the Iris data has only two features,

namely, \emph{petal length} and \emph{petal width. }If we conceive

the petal length as

$x$ value and petal width as

$y$ value, then

we can associate an iris flower with a point

$p=(x,y)$ in the two

dimensional space.

Saturday, 24 June 2017

Building a Deep-Learning (and of course, general Data Science ) Workstation

A personal DIY project

Disclaimer:

I am a data scientist, researcher

♮

as well as practitioner

♯

(these days more of later than former ). This document is result of my personal curiosity and passion. The intended readership of this is experienced data scientists and data science enthusiasts interested in exploring deep learning – the techniques and tricks for training large Neural Networks using high volume of high-dimensional data. While I have tried my best for getting the facts correct, there are very likely some misses, which I shall be grateful if the readers send me feedbacks about. Finally, these machines are also likely to be excellent for gaming. So those of us interested in that pastime, can enjoy their spare time likewise .

♮

read: inventor of new algorithms, frameworks and/or tweaker of existing ones.

♯

read: designer, architect and (sometimes) developer of end-to-end data science solutions for my employer's clients' problems.

Note: I am working in India and shall procure all components from authorized Indian retailers. Therefore, all monetary references, unless marked otherwise, should be understood to be in Indian rupees.

1 Is it a worthwhile exercise?

Data scientists working outside well-funded labs find a hard time to gain access to a GPU accelerated computing system. One of the (allegedly) popular option is to use AWS EC2 p2 instances. Let us have a look.

Powerful Performance

P2 instances provide up to 16 NVIDIA K80 GPUs, 64 vCPUs and 732 GiB of host memory, with a combined 192 GB of GPU memory, 40 thousand parallel processing cores, 70 teraflops of single precision floating point performance, and over 23 teraflops of double precision floating point performance. P2 instances also offer GPUDirect™ (peer-to-peer GPU communication) capabilities for up to 16 GPUs, so that multiple GPUs can work together within a single host.

Scalable

Cluster P2 instances in a scale-out fashion with Amazon EC2 ENA-based Enhanced Networking, so you can run high-performance, low-latency compute grid. P2 is well-suited for distributed deep learning frameworks, such as MXNet, that scale out with near perfect efficiency.

These NVIDIA K80 GPUs are also known as (according to Google) are Tesla K80 GPUs from NVIDIA.

These are monsters!

Still, there are some whispers around the web forums about P2 instances being somewhat slow and overall cost is pretty high. I cannot comment abort the slow part. but price can creep up quite high if you intend to learn deep learning and/or a lot of experiments on large data sets.

However, for me personally the deal-breaker is the ease-of-use issue. These are server instances, thus do bot have any GUI (yes, I am aware that this ask firmly banish me from the nerd community. But then, I was never one, as a data/computer scientist most of my time is spent thinking about algorithms and solution design/architecture. So, I am firmly into whatever allows me to work while not remembering finer nitty-gritties of command-line languages.) Yes, there is some workaround by launching vncserver in the EC2 instance – but, but I found that also quite cumbersome.

Therefore, as a data scientist, I (and I guess, many like me) want a system that:

Capable of GPU accelerated computation;

Also very good for CPU-intensive computations (I am not going to get another system for that)

And easy-to-use, i.e., with a nice desktop GUI – allowing me to use visual tools as well as generate complex visualization.

Yes, you guessed correct, I am talking about Workstations.

Now the question pops up, whether buy or build. There are some niche manufactures, you can check the followings:

http://www.compecta.com/nova.html,

https://exxactcorp.com/deep-learning-workstations-servers.php,

and the original big daddy going for around $15K

https://developer.nvidia.com/devbox.

The prices start at around USD 8000 and can go up to ..., for example, check NVIDIA® DGX-1™ (well, it is actually a server) that costs $125K upward!

Yes, you got it right, pretty obscene. However, mind that they usually have 4 or more very high-end GPUs (NVIDIA TITAN X/Xp, NVIDIA GeForce GTX 1080/ti, NVIDIA QUADRO GP100. etc.)

So, what does an otherwise accomplished and experienced data scientist, interested in exploring deep-learning, but having no organizational support for the same, do?

Answer is obvious:

build your own GPU (for deep learning) accelerated “data science workstation” within an affordable budget.

And in problems your machine turns inadequate, most likely those are paid projects and you can shift to AWS (or any other available facility) and bill it to your client/customer. Even in those cases you can study the downscaled problem in your machine, so that you avoid bulk of the trial and error experimentation in the larger system.

Clearly, there is a good case of building your own, small data science and deep learning workstaion.

2 How does GPUs help in deep learning

Note: While both NVIDIA and AMD are leading manufacturers of high-end graphics cards, at this point of time, only NVIDIA provide comprehensive support for General-Purpose Graphics Processing Units (GPGPU), through its CUDA architecture (most probably since GeForce GTX 7xx series upwards) and development toolkit. Hence, unless mentioned otherwise, in the rest of the article the term “GPU” will refer to CUDA-enabled NVIDIA GPUs.

Original use of GPUs was designed to facilitate fast rendering of graphics, mainly for games. As the games gained in complexity and visual outputs, GPUs also evolves into collection of a large number of processors, called stream processors (SP), also called cores, capable of operating in parallel. In simplistic terms, stream processing refers to performing similar (usually fairly simple) computation on a input steam of data objects/records. In parallel mode, the stream can easily be divided across the available processors and all of them perform the same computation on its portion of the stream. For generating graphics the SPs can work with steams of primitives like vertex, normal, color information of a graphical object, say a triangle or the 2D texture matrix of a frame, etc.

However, for our purpose of understanding, it will suffice to know that a bunch of SPs together can perform some simple operations on a stream of data objects in parallel and very very fast. The limitation is that their repertoire of operation is much limited compared to general-purpose CPUs and at a point of time a group of SPs perform the same operations on their input, e.g., we cannot make half of them to addition and other half multiplication at the same time. The GPU computation model is like this:

Nevertheless, despite their individual simplicity, a large number of SPs can be extremely suitable certain types of computations, namely large matrix multiplications. Fundamentally, this involves a lot of multiplications of pairs of numbers and then adding the results in certain order. In a CPU, even for multi-core/threading, multiplication of each pair of number in each thread will involve (1) fetching the numbers from memory (if lucky, from cache), (2) multiplying them and (3) writing the result back to memory ... this will go on and on. Even with say 20 cores/threads it will be a helluva lot of time to multiply two really large, say 1Mx1M matrices. Unfortunately we face such problems only too frequently in the field of scientific computations. They can be accomplished in GPUs much more efficiently in the manners depicted in CUDA toolkit documentation:

They give us a glimpse (simplistically) how we can supply at one go a column and row (or a large chunk of them depending on availability of number of SMs) to the GPU and it will compute the product in a single execution cycle. This is the acceleration we have by using GPUs for matrix operations.

Now, this capability was there in the GPUs since long back. So, why we could not exploit them in scientific computations? The answer lies in the fact that GPUs are of primary interest of graphics programmers. Now, every processor, at their core execute their own machine-language (binary) instruction set. Using that level of language (or its mnemonic equivalent assembly language) to develop a substantial software is an onerous task. To be able to program them effectively, there has to be a higher-level language with API set to access the processor functionality and corresponding compilers. Now, for graphics programmers languages and associated toolkits, such as DirectX, OpenGL, Vulkan, etc. are developed.

However, these are intended for graphics programmers and designed to male their life simpler. Thus the APIs are designed to work with graphics objects and operations. They allows (just for illustration) stuff like: create a triangle T, rotate T by ..., add texture TX to T , ..., etc. Clearly that is not going to work for others. We need a high-level language interface for general-purpose GPU (parallel) programming. NVIDIA, for its CUDA-based GPUs have just that, a development toolkit, consisted of (1)an extension of C language that can use a set of (b) general-purpose APIs and (3) a good compiler.

So, how does deep learning benefit from all these? Remember that deep learning is essentially techniques and tricks of efficiently training large multi-layer (hence, deep) neural networks. Now, the computational aspect of an NN can (possibly, the only reasonably efficient way) be described/formulated/implemented in terms of operations between arrays of numbers of various dimensions. In traditional vocabulary, array of numbers with dimension 0 are called scalars, dimension 1 are vectors, 2 are matrices and beyond that they are called tensors of respective orders. Often they are all called tensors of respective orders starting from 0 for scalars. Google's TensorFlow deep learning library gets its name because they define deep leaning as computational graph, whose nodes represents operation between tensors.

So, it is easy to see how GPGPU capability can support deep learning. Fortunate for us, most of the time we need not develop our code using CUDA toolkit directly. There are libraries availables in almost all of the popular languages which help us develop our deep learning code in our favorite language and takes up the burden of converting it to GPU executable.

3 Design issues

An interesting build case-study is available at Analytics Vidya blog. Here I intend to go little deep and somewhat pedantic . Here our discussion will attempt to help resolving the following issues:

Adequacy of computational power – both at CPU and GPU levels;

Compatibility of the components and

Scope of expansion/evolution of the system.

We shall look into main components one at a time and also consider cross-component compatibility issues.

3.1 CPU

At this moment, there are 3 families of good multi-core processors in the market capable of running Windows and Linux.

Intel
AMD

Xeons are great server processors, but they cost bombs. So, we drop them outright from consideration.

Ryzen 7s are new and also great (according to reports, haven't yet worked on any), but has only 24 PCIe lanes and capable of supporting up 64 GB RAM.

PCI Express (Peripheral Component Interconnect Express)

Each PCIe lane is a high-speed, independent, point-to-point serial connection between the CPU and a peripheral device (e.g., GPU). They can be grouped for increasing bandwidth of a connection. Such groupings are denoted as ×1 (single lane), ×4, ×8, ×12, ×16 and ×32. Due to independence of the lanes the bandwidth simply multiplies. In a motherboard contain several PCIe slots of sizes. However, actual allocation of lanes to these slots is controlled by the motherboard design depending on the available lanes and the slots in use.

To understand this, let us consider the case of GPUs. A GPU can use up to 16 PCIe connection. Now consider a CPU with 16 lane on a motherboard with two x16 PCIe slots. When one GPU is put in one slot while the other is empty, (most likely) the GPU will be allotted all of the 16 lanes. However, if we put another GPU in the other slot, the motherboard will allot 8 lanes to each slot, thus though the physical slot sizes are both x16, their connection bandwidth is x8.

Like everything in computing, PCIe has gone through several versioning. Here is the chart from Wikipedia page

Now, in the i7 family the stuff is little convoluted. At this moment, the consumer processors are into 7th generation – launched i7-7700 and i7-7700K only till now. These have (among other specs) 4 Cores, 8 MB Cache, 16 PCIe lanes and supports up to 64 GB memory over dual-channel. For our purpose, they are not very suitable.

Let us turn our attention to some unusual processors in i7 family. They are called prosumer processor, the highest end is often termed as “Extreme Edition” processor. These are targeted for high-end desktop (HEDT) market and popular with extreme performance enthusiasts (read gamers). Currently available processors in this category are shown here (Source: http://www.pcgamer.com/the-broadwell-e-review/)

Why are we looking into these pricey beasts? There are at least 3 reasons. Two of them are is explicit in the chart above, while is third is not.

Even humblest of them has 15MB L3 Cache, almost double of that in consumer processors. Processor performance, much more than raw clock speed, improves with the amount of the cache.

What is not mentioned in the chart, these processors support up to 128 GB RAM and capable of quad-channel memory access. Quad-channel access allow the CPU to access 64 bit chunks simultaneously from 4 memory banks. Theoretically (I have not yet seen conclusive test results to this effect) this will cause doubly faster memory access compared to consumer processors, which have only dual-channel capability.

The third reason is there in the chart, their number of PCI Express (PCIe) lanes. Even with the lowest (28) of them, we can (theoretically) support up to 3 GPUs with x8 connections, leaving rest (4) for other devices, if needed.

So, I shortlist these for a potential DS-DL build. Note that none of these have integrated graphics, so we cannot build with them without dedicated graphics card. But that is not a problem for us!

3.2 GPU

GPU is the single most important and (possibly) costliest component in the system. You may dream about Tesla K80, but do not speak aloud unless your deep-pocket organization is strongly backing the build. We have to settle for less, but which of them? Tim Dettmers provides a nice overview of GPU usage for deep learning. His choice analysis shown in Table 1. Table 2 shows prices of 3 next-level, but still cutting-edge GPUs.

GPUs use PCIe connection. Almost all modern GPUs are equipped with PCIe 3.0 x16 connectivity, which means that, when mounted on a x16 slot of the motherboard, they can use up to 16 PCIe lanes, if available/allocated to the slot. However, the beauty of PCIe is that you can mount a GPU (though very unlikely) on a x8, x4 or even x1 slot in a motherboard – and it will still work, albeit slowly.

User Category

Possible GPU choice

Best GPU overall (by a small margin)

Titan Xp

Cost efficient but expensive

GTX 1080 Ti, GTX 1070, GTX 1080

Cost efficient and cheap

GTX 1060 (6GB)

I work with data sets > 250GB

GTX Titan X (Maxwell)

NVIDIA Titan X Pascal

NVIDIA Titan Xp

I have little money

GTX 1060 (6GB)

I have almost no money

GTX 1050 Ti (4GB)

I do Kaggle

GTX 1060 (6GB) for any “normal” competition

GTX 1080 Ti for “deep learning competitions”

I am a competitive CV researcher

NVIDIA Titan Xp

I am a researcher

GTX 1080 Ti

I want to build a GPU cluster

This is really complicated

♯

I started deep learning

♮

Start with a GTX 1060 (6GB)

♯ ♯

I want to try deep learning,

♮ ♮

GTX 1050 Ti (4 or 2GB)

♮

and I am serious about it.

♮ ♮

but I am not serious about it.

♯

You can get some ideas here.

♯ ♯

Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate.

Table 1: GPU choice guide. (Courtesy: Tim Dettmers)

GPU	Price (INR)
NVIDIA GeForce Titan X Pascal 12GB	1,89,999.00
GeForce 12GB GTX TITAN X	1,54,990.00
GeForce GTX 1080 Ti Founders Edition 11GB	67,500.00

Table 2: GPU prices (Amazon India on 19-Jun-2017)

Multiple GPUs in parallel:

What will happen if I throw in more than one GPUs? NVIDIA offers a multi-GPU scaling/parallelizing technology called Scalable Link Interface (SLI), which can support 2/3/4-way parallelism. It seems to work well for some games, especially those designed to take advantage of it. At this time, we do not know clearly about SLI scaling of deep-learning tasks. Tim Dettmers provides some useful pointers – from which it seems that only a few of all deep-learning libraries can take advantage of this. Nevertheless, he also argues that using multiple GPUs even independently, will allow to run multiple DL tasks simultaneously.

3.3 Motherboard (or Mobo for brevity)

The mobo is what hold the system together. So it supposed to be a bit complicated to understand. Nevertheless, we have to brave it. The important (not exhaustive) attributes to look for in a mobo are the following:

Form factor – The system real estate

CPU Socket type – Determines which processor(s) can it take

Chipset – The surrounding electronics compatible to the acceptable processor(s). Sometimes there can be multiple possible chipsets, varying in their cost and capabilities.

Memory capacity – How much (and what kind of) RAM the board can support (If the board supports more than the CPU does, then CPU wins)

Expansion slots – How many and how well devices like GPU can be supported

Secondary storage support – HDD, SSD (SATA/NVMe), etc.

Connectivity support – USB 2.0/3.0, LAN, WiFi, Bluetooth, etc.

System cooling support – For powering cooling attachments.

Common motherboard form factors (Source:http://www.corsair.com/en-us/blog/2013/october/how-to-build-a-pc---motherboard-selection):

The form factor, among other things determine potentially viable system configuration parameters. For a given processor family and chipset, the motherboard supporting them has rest of the area available to arrange for other stuff, especially PCIe slots. It may so happen that a mobo has (say) 4 x16 slots. That does not automatically translate into its ability to always accommodate up to 4 GPUs. Since a typical high-end GPU is “2 slots” wide, there must be enough spacing between the slots. Both EATX and ATX boards have 4 x16 slots (read) and one x4/8 (black, though of same physical size). If you have one GPU, you can plug it into any of them (though not likely in a non-x16). Now, since GPUs are 2 lanes wide, if you decide to have 4 of them, then the black one will be blocked (physically). On the other hand, if you must have some non-GPU device, you can have only up to 3 GPUs, ven if you use the black one for a non-GPU device.

Also, you need space to accommodate memory modules. Thus form factor can influence that too. The EATX board can have up to 8 modules (alternate red and black slots on left and/or right sides of the CPU sockets), while the others can have only 4.

Other mobo attributes mentioned above, though vital, not apparent to visual inspection. One needs to study specification details to know them and resolve compatibility issues and judge suitability of choice.

3.4 Others

Apart from the major ones above, there are a number of other components, such as RAM modules, HDD/SDD, Power supply, Cabinet, etc. Their characteristics attributes and compatibility issues are relatively simple and straight-forward. Hence, we shall not discuss them in detail here, rather while discussing my choice of components.

4 Budget Issues

Like in all sensible endeavors, it is very much recommended that building a DSDL workstation also should be taken up with a predefined budget. If we are building for our employer and/or a client, try to get as good one as possible. I am sure one can figure out how to spend it fruitfully.

5 My build

5.1 My budget

This is my personal project. So the budget depends solely on how much I (and only I) can afford to spend without disrupting my regular life-style and/or inconveniencing myself and my dependents financially. Even so, this is a tricky business, a compromise between wallet and passion. Nevertheless, one has to fix a budget and must stick to it, otherwise, give the lure of plastics, there could too much temptation for wrecking financial havoc.

I am based in India and that has some cost implications. Almost none of the components are manufactured in-country, thus has to be imported and therefore import duty is levied, thus I shall get somewhat less bang for my bucks compared to someone in, say, USA. My strategy is to get a somewhat minimal (but sufficient for moderate-size problems) system up and running and then expanding its capability as required. So let us get down to the initial system budget and configuration.

Budgeting, especially for a close-to-heart project like this can be a little uncertain. So, to arrive at my budget, I take a precedence-based approach. I considered one of my other passions, motorbike. One and half years back I bought a Royal Enfield Tunderbird 350cc bike, costing INR 150K. At that time, I had around INR 50K available in cash, so I paid the rest using a credit card and converted it to 6 EMI payments at 13% pa (not sure this conversion to EMI facility is available in other countries). I tried to remember the problems I faced paying the EMIs, can't remember any. So I decide to plastic for INR 100K as before. Also, sometime back I have received some (paltry) performance bonus from my employer. Thus, all in all, I should not have problem in getting my hand to 100K or so.

So, my budget is INR 200K (approx. USD 3100 on 19-Jun-2017).

Incidentally, 200K was the price of the 500cc variant of the bike I bought, initially I had my eye on that but somehow (with difficulty) curbed my impulse and settled on the lesser one .

5.2 CPU

I intend my system to evolve over time. While I start with one, eventually, ultimately I want to have 3 GPUs. If the CPU has 40 lanes, when I have 3 GPUs, the lane distribution could be x16/x16/x8 or (more likely, if some other device use a few lanes) x16/x8/x8. On the other hand, with a CPU having 28 lanes (saving some money in order to stick to the budget) 3GPUs will end up x8/x8/x8 connections, with 4 lanes still left for other devices.

Now, PCIe3.0:x8 has half bandwidth of x16, but it is still nearly 8 GB/s. The question is to ask is whether the x8 connection is enough for deep-learning workloads. Unfortunately, I could not find an exact answer. However, in context of gaming, some test results are available, which found inconsequential (within margin of error) difference in performance between x16 and x8 for PCIe 3.0. Now, whether graphics processing or deep-learning, computationally, more or less, both are multiplication of large matrices. So, we can expect the similar behavior of the system for DL tasks.

Therefore, I can confidently choose i7-6800K as long as I choose a motherboard that can accommodate, both physically and logically 3 GPUs (and of course, 128 GB DDR4 RAM).

Budget impact:

5.3 Motherboard

i7-6800K needs LGA2011-v3 socket and X99 chipset. There are a number of motherboards for them. These motherboards are typically favorite of serious gamers. After some digging around, I zeroed on some ASUS motherboards – to me (in my present situation) ASUS boards seemed to give better value for money. I looked into comparative description of 4 of them.

While, X99-A II is cheapest, I chose ROG Strix X99 Gaming for its inbuilt WiFi and Bluetooth capability. Notice that all these boards have a feature called “Aura Lighting” – it seems to be lot of LEDs in the board which can be configured to various lighting schemes, making the system illuminated like a Christmas tree. Allegedly, gamers love it, but I am not sure what I shall do, possibly keep the system below the bed!

Memory
Expansion Slots
Storage
- New Intel® Core™ i7 X-Series Processors :
- Intel® X99 chipset :
Multi-GPU Support

Looks all good so far.

Budget impact:

5.4 GPU

I am a data scientist, basically a practitioner, with strong research background. While at this moment I am not doing publishable research (for various reasons), I tweak things a lot and some of them can qualify as publishable, but for my lethargy in writing them up. Anyway, I would like the system to have capability to support research-level works. Looking around the Indian market, I got the following prices (on 19-Jun-2017):

Given my budget, the choice is obvious, the 1080 Ti – however, this is also what recommended by Tim for researchers.

Above specs are extremely important from deep-learning perspective. Other important specs include:

Bus Support
OS Certification
Height
Length
Width
Maximum GPU Temperature (in C)
Graphics Card Power (W)
Recommended System Power (W)

Budget impact:

5.5 Memory modules

Since, sometime in distant future, I might equip the system with full 128 GB RAM. I need to choose 16 GB DDR4 modules. I chose “Corsair Vengeance LPX 16 GB 3000 MHz - CL15”, specs here:

Since this type of modules are often used by gamers, these are also loaded with LEDs, contributing to Christmas spirit. Other than cost, one of the reason of my choice is that this particular module does not seem to have LEDs!

For the initial build I plan to use two modules only (again budget ...), giving 32 GB. Note that, with only two module, the quad-channel access will not be active, instead dual-channel mode will be in operation till I put in two more modules later.

Budget impact:

5.6 Secondary storage

Here I get little bit inventive. I go for two of them with different technologies.

Samsung 960 EVO Series - 500GB PCIe NVMe - M.2 Internal SSD

Seagate 2TB Firecuda (Solid State Hybrid) SATA 6GB/s 64MB Cache 3.5" – significantly better (claimed to be 5x faster) than traditional HDD.

Budget impact:

5.7 PSU

It is imperative that the PSU must comfortably meet the system's overall power demand. I have used an online tool for computing the system power requirement. I have calculated the power requirement for the system with 2 GPUs, 8 RAMs and 2 SATA drives. The details of the calculation is shown 7 in the Appendix. Based on that calculation, I selected CORSAIR CX Series CX850M 850W ATX12V / EPS12V 80 PLUS BRONZE Certified Modular Power Supply. Hopefully, this will be good enough till I try to add the third GPU.

Budget impact:

6 Building

Built it, but there was a slight hiccup. Got delivery of a Barracuda HDD instead of the Firecuda I ordered. It was a third-party seller in Amazon India and had a no-return policy. However, Amazon agreed to refund the price. Ordered Firecuda again from a different seller and awiting delivery. In the meantime assemble the rig for testing using the Barracuda. Using Ubuntu 16. Shall re-install it once put in the Firecuda – shall install the data science and deep learning tools and libraries after that.

Nothing really to share about the building experience/exercise itself – there are numerous YouTube clips of building systems, especially by gamers. Only mine was somewhat more messy and lengthier due to lack of practice.

7 Before signing off

Despite being interested in building stuffs since very young age, I have, till now, built only one PC and that too long (say, 18 years) back. I built that for personal use on cheap, same as now. I guess, this build will be something that can be called an entry-level machine in context of data science and deep-learning (DSDL) rigs. Anyway, this exercise is undertaken in order to find whether we can get personal DSDL machine within affordable (relatively speaking, since almost all IT industry greenhorns these days seem to easily afford to buy a good motorbike within months of staring their job, a DSDL machine like this should not be beyond their reach if they postpone the bike by few months) budget.

Looking forward to get the rig up an running and try out some interesting stuff. Since I am more into NLP than images, hoping to have even this minimal configuration doing some wonders.

Appendix:

The build at a glance:

Intel Core i7 6800K CPU

Asus Strix X99 Socket LGA2011-v3 Gaming Motherboard

Corsair Vengeance LPX 16 GB (1 x 16 GB) 3000 MHz RAM (2 nos)

Immo3D GeForce GTX 1080 Ti Founders Edition GPU

Samsung 960 EVO Series - 500GB PCIe NVMe - M.2 Internal SSD

Seagate 2TB Firecuda Hybrid Hard disk

Corsair Hydro SeriesTM H45 High Performance 120MM Liquid CPU Cooler

CORSAIR CX Series CX850M 850W ATX12V / EPS12V 80 PLUS Modular Power Supply

Corsair CC-9011030-WW Carbide Series Air 540 ATX Cube Case

The power supply requirement calculation

Tuesday, 4 October 2016

Problem levels analysis for data science solutioning

A Data science-based solution needs to address problems at multiple levels. While it addresses a business problem, computationally it is comprised of a pipeline of algorithm which, in turn, operates on relevant data presented in proper format. Thus to understand the them we need to focus at least at the

Business level;
Algorithm level; and
Data level.

Contrary to the popular belief, almost all non-trivial data science solutions are needed to be built ground up with minute and interrelated attention to the details of the problem at all three levels. In the following we shall try to understand that with the help of an running example of aspects of a churn analysis solution.

It is vital to understand that in most real-world cases we are re-purposing the data for building the solution. In other words the data used is not collected for the purpose of kind of analysis we want to perform. They are collected as part of transactional and operational activities of the organization. Thus the strategies for collection, formatting and storage of the data is optimized for those purpose. Therefore, locating the relevant data and processing them to enable application of data science technique can be quite non-trivial, often herculean exercise.

From a user perspective a solution life-cycle can be understood as following:

Solution development: Using historical data, involves extensive experimentation, testing and validation;
Solution deployment: Using the solution to get the insight and/or decision support;
Solution assimilation: In the workflow enabling actions based on insight and/or prediction made by the solution;
Solution maintenance and update: Periodic checking and validation of the solution performance and update to improve performance if required.

It is the job of the data scientist(s) deliver on the above and for that she have to understand the problem at different levels.

The Business/Domain level

At this level the broad business context, desired outcome of the solution are defined along with various parameters and constraints the solution should/must adhere to. Also, at this level desired/ acceptable performance parameters can be set in commensurate with business policies, especially the risk management. Let us try to understand this with respect to the popular churn analysis problem

Business level context for churn analysis

The churn analysis problem has a quite straight-forward context. Every business strives to retain
its good/valuable customers. Thus, if it is possible to identify the customers likely to stop doing business with the organization, it may be able to take proactive steps in order to retain the customer. However, we cannot only demand that a system should spout lists of customers whenever asked. We need to specify at the followings:

What should be the prediction lead time? Naturally, we don’t want to know a customer is going to churn within next five minutes. We cannot possibly do anything about it with the knowledge. We need enough lead time so that preventive actions can be taken.
What is the acceptable probability/likelihood measure of a prediction being correct? Predictions are always uncertain to some degree (otherwise they would be called facts). Usually, most of the customers run through the system will have non-zero likelihood of churning. This issue is often revisited in later stage and calibrated against the system output characteristics.
What is the acceptable level of accuracy of the predictions? No solution is going to be 100% accurate (As we humans are neither, after all ”to err is human...”) and taking action based on the predictions involve cost. Thus a cost/benefit analysis is required in order to determine an acceptable level of solution performance.

Note:

The meaning of probability or likelihood measure is not straightforward or very intuitive.

One individual customer will either churn or not. Thus post-facto or actual probability is either 1 or 0! So what does this likelihood value, which is somewhere in between, mean? It will always be ultimately proved wrong, isn’t it?

Well, actually I am misleading you slightly - probability is not about individuals, but large collections. So, the interpretation goes like this,

If there are large, very large number of customers with very same characteristics or attribute values used by the solution, the likelihood fraction of them will churn. Satisfactory?

Wait, there are many attributes used to in the solutions and they have varied range of values. How likely is it that many customers will have just the same values of the attributes? Actually, not very likely at all. What happened is this, the solution maintains knowledge, explicit or implicit, of an overall probability distribution over the attribute space.

This knowledge and knowledge of how to use it to make prediction constitute the underlying prediction model, which is usually learned from the training data using the machine-learning algorithm. This is used to compute the individual likelihoods. So, in simplistic terms we can interpret this as

If we had a lot of customers like this, likelihood fraction of them will churn. If the likelihood value is high, so is the chance of this guy being one of those churners. So let us see if that is so and we can prevent him to do so.

While it may not be immediately apparent, understanding the above subtlety is often useful in overall understanding of system performance.

The Algorithmic Level

Algorithmic level delivers the asks of the business level. In data science approach, the algorithm level creates, maintains and applies a model of the process in reality involving the objects of interest, hereafter referred as simply ”objects”. This process leads to the events or outcomes of interests. The following are the main characteristic of the data science algorithms.

An algorithm works with available data footprint of the process of interest;
It discovers the relationships between the process characteristics and the outcomes;
The above relationships are, more often than not, in form of complex patterns;
Discovering these patterns require application of powerful learning algorithms on the historical data;
Discovered patterns lead to learning the required model parameters;
An analysis/model application algorithm use these parameters to create the model and apply it on the new data in order to compute the output.

The algorithm level of a data science application is comprised of one or (often) more of algorithms from the basic types of algorithms:

Regression: Predicts the value of a continuous valued variable from the values of a set of numerical attributes;
Classification: Predict one discrete class/category out of a set of classes using numerical and/or categorical attributes;
Clustering: Discovers natural grouping of the objects;
Association discovery: Discovers propensity of similar behavior among two or more objects. described using numerical and/or categorical attributes;

Again, there are many actual algorithms for each types, differing in their approaches, complexity, interpretability, acceptable data types and above all their efficacy in a particular problem scenario.

Churn analysis example:

Again, with respect to the churn analysis problem, we can easily discern that the objects here are the customers and the process of interest is their reaching a decision about whether to churn or not.

The data about a customer available to the organization contains the data footprint of his/her decision process, albeit hidden among a lot of dust and garbage. How to isolate/extract and possibly enhance the footprint is a matter we shall touch upon in next section. It can also be seen that here the task demands the system predict one of two discrete outcomes, churn or not-churn.

Hence, the heart of the system is likely to be a classification model and algorithms for learning and applying the model.

Note:

We should not straight-jacket an one-to-one correspondence between the business level and the algorithmic level of a problem. For example, at the algorithm level we can pose the churn analysis problem as a regression problem, trying to predict for each customer, after what time amount of time she is going to churn.

Actually, what you do at algorithm level may depend on a lot of factors. We should keep an open mind while exploring for best possible solution for a given problem.

The Data Level

Data Science algorithms work with object data in form of feature/attribute vectors describing the
objects of analysis/interest. In a real life problem scenario, we seldom have the data readily available in the such form. Usually a lot of effort, often majority of the total, goes into transforming the available raw data into the usable form comprised of vectors of useful features.

This is the stage we identify, isolate and enhance the data footprint of the process of interest and encode it in form of feature vectors. Clearly, this is a momentous process that needs to take into account the format and content of actual data sources available, their quality as well as deep understanding of the problem domain. This exercise is most often solution-specific as well as organization-specific (because data collection and stewardship policies vary across organizations). The whole process is known as feature engineering.

Problem-specific feature engineering

Coming up with features is difficult, time-consuming, requires expert knowledge.
"Applied machine learning" is basically feature engineering. —Andrew Ng, Machine
Learning and AI via Brain simulations

Let us again go back to the churn analysis/prediction problem. In order to enable algorithm level to perform adequately, it has to work with quality data, i.e., data with information-rich features.

Technically such features are called discriminative features, the features which can significantly contribute towards the algorithm discriminating/distinguishing between churners and non-churners.

There is often a lot of data available internally (customer demography, product data, transaction history, competitor intelligence, call center records, etc.) as well as accessible from outside (credit rating data, etc.). Not all of these data are useful nor all of them are readily usable (for example, the transaction history is event data over time) as features. Clearly, a naive approach of somehow putting together all the data and wish the algorithms work out can
meet with disastrous result.

Thus the feature selection and preparation can turn out to be a very complex issue to tackle. Most of the time we have to use understanding of the problem domain in conjunction with application of feature selection and transformation methods. For example, in our problem we might consider the following:

Why a customer would decide to leave?

Dissatisfied with the product/service
Product/service is not what is actually needed/expected
Has trouble enjoying facilities provided
Trouble accessing the delivery channels
...

Got a better deal from a competitor

A similar product for less price
A better product for similar price
...

The above issues, individually or together, may influence various aspects of customer behavior vis-a-vis the bank, which, in turn reflects in complex patterns hidden in the data. It is the aim of feature engineering to

identify the part of data in which such patterns likely to be hidden and
designing suitable processing or transformation of the data in order to enhance the information content.

Did you notice that earlier I have used the term “data footprint”? That fits nicely with my favorite analogy of this work - finding the footprints of a rare animal in a jungle.

Blind feature selection

This is what I feel is the (charitably speaking) lazy approach or (less of charity, more to truth) incompetent approach. Get the whole data dump and run some in-box feature selection/ranking algorithm without trying to understand the data semantics. Unfortunately, I am observing too much of this. While this approach may help quickly build a solution, it will essentially become a black-box kind of solution. It definitely harms the interpretability/understandability/transparency of the solution as well as makes the solution maintenance and update a nightmare.

Incidentally, practitioners of this approach usually finds Deep Learning extremely attractive, almost forming a "celebrity fan-base". Topmost reason for that, as far as I can discern from multiple sources, is that DL networks (supposedly) eliminate the need of feature engineering, both selection as well as transformation, in one fell swoop!

Unfortunately, the reality is a slightly more complicated. But that will be subject of another post.