Data Science Practitioner: Building a Deep-Learning (and of course, general Data Science ) Workstation

An interesting build case-study is available at Analytics Vidya blog. Here I intend to go little deep and somewhat pedantic . Here our discussion will attempt to help resolving the following issues:

Adequacy of computational power – both at CPU and GPU levels;

Compatibility of the components and

Scope of expansion/evolution of the system.

We shall look into main components one at a time and also consider cross-component compatibility issues.

3.1 CPU

At this moment, there are 3 families of good multi-core processors in the market capable of running Windows and Linux.

Intel
AMD

Xeons are great server processors, but they cost bombs. So, we drop them outright from consideration.

Ryzen 7s are new and also great (according to reports, haven't yet worked on any), but has only 24 PCIe lanes and capable of supporting up 64 GB RAM.

PCI Express (Peripheral Component Interconnect Express)

Each PCIe lane is a high-speed, independent, point-to-point serial connection between the CPU and a peripheral device (e.g., GPU). They can be grouped for increasing bandwidth of a connection. Such groupings are denoted as ×1 (single lane), ×4, ×8, ×12, ×16 and ×32. Due to independence of the lanes the bandwidth simply multiplies. In a motherboard contain several PCIe slots of sizes. However, actual allocation of lanes to these slots is controlled by the motherboard design depending on the available lanes and the slots in use.

To understand this, let us consider the case of GPUs. A GPU can use up to 16 PCIe connection. Now consider a CPU with 16 lane on a motherboard with two x16 PCIe slots. When one GPU is put in one slot while the other is empty, (most likely) the GPU will be allotted all of the 16 lanes. However, if we put another GPU in the other slot, the motherboard will allot 8 lanes to each slot, thus though the physical slot sizes are both x16, their connection bandwidth is x8.

Like everything in computing, PCIe has gone through several versioning. Here is the chart from Wikipedia page

Now, in the i7 family the stuff is little convoluted. At this moment, the consumer processors are into 7th generation – launched i7-7700 and i7-7700K only till now. These have (among other specs) 4 Cores, 8 MB Cache, 16 PCIe lanes and supports up to 64 GB memory over dual-channel. For our purpose, they are not very suitable.

Let us turn our attention to some unusual processors in i7 family. They are called prosumer processor, the highest end is often termed as “Extreme Edition” processor. These are targeted for high-end desktop (HEDT) market and popular with extreme performance enthusiasts (read gamers). Currently available processors in this category are shown here (Source: http://www.pcgamer.com/the-broadwell-e-review/)

Why are we looking into these pricey beasts? There are at least 3 reasons. Two of them are is explicit in the chart above, while is third is not.

Even humblest of them has 15MB L3 Cache, almost double of that in consumer processors. Processor performance, much more than raw clock speed, improves with the amount of the cache.

What is not mentioned in the chart, these processors support up to 128 GB RAM and capable of quad-channel memory access. Quad-channel access allow the CPU to access 64 bit chunks simultaneously from 4 memory banks. Theoretically (I have not yet seen conclusive test results to this effect) this will cause doubly faster memory access compared to consumer processors, which have only dual-channel capability.

The third reason is there in the chart, their number of PCI Express (PCIe) lanes. Even with the lowest (28) of them, we can (theoretically) support up to 3 GPUs with x8 connections, leaving rest (4) for other devices, if needed.

So, I shortlist these for a potential DS-DL build. Note that none of these have integrated graphics, so we cannot build with them without dedicated graphics card. But that is not a problem for us!

3.2 GPU

GPU is the single most important and (possibly) costliest component in the system. You may dream about Tesla K80, but do not speak aloud unless your deep-pocket organization is strongly backing the build. We have to settle for less, but which of them? Tim Dettmers provides a nice overview of GPU usage for deep learning. His choice analysis shown in Table 1. Table 2 shows prices of 3 next-level, but still cutting-edge GPUs.

GPUs use PCIe connection. Almost all modern GPUs are equipped with PCIe 3.0 x16 connectivity, which means that, when mounted on a x16 slot of the motherboard, they can use up to 16 PCIe lanes, if available/allocated to the slot. However, the beauty of PCIe is that you can mount a GPU (though very unlikely) on a x8, x4 or even x1 slot in a motherboard – and it will still work, albeit slowly.

User Category

Possible GPU choice

Best GPU overall (by a small margin)

Titan Xp

Cost efficient but expensive

GTX 1080 Ti, GTX 1070, GTX 1080

Cost efficient and cheap

GTX 1060 (6GB)

I work with data sets > 250GB

GTX Titan X (Maxwell)

NVIDIA Titan X Pascal

NVIDIA Titan Xp

I have little money

GTX 1060 (6GB)

I have almost no money

GTX 1050 Ti (4GB)

I do Kaggle

GTX 1060 (6GB) for any “normal” competition

GTX 1080 Ti for “deep learning competitions”

I am a competitive CV researcher

NVIDIA Titan Xp

I am a researcher

GTX 1080 Ti

I want to build a GPU cluster

This is really complicated

♯

I started deep learning

♮

Start with a GTX 1060 (6GB)

♯ ♯

I want to try deep learning,

♮ ♮

GTX 1050 Ti (4 or 2GB)

♮

and I am serious about it.

♮ ♮

but I am not serious about it.

♯

You can get some ideas here.

♯ ♯

Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate.

Table 1: GPU choice guide. (Courtesy: Tim Dettmers)

GPU	Price (INR)
NVIDIA GeForce Titan X Pascal 12GB	1,89,999.00
GeForce 12GB GTX TITAN X	1,54,990.00
GeForce GTX 1080 Ti Founders Edition 11GB	67,500.00

Table 2: GPU prices (Amazon India on 19-Jun-2017)

What will happen if I throw in more than one GPUs? NVIDIA offers a multi-GPU scaling/parallelizing technology called Scalable Link Interface (SLI), which can support 2/3/4-way parallelism. It seems to work well for some games, especially those designed to take advantage of it. At this time, we do not know clearly about SLI scaling of deep-learning tasks. Tim Dettmers provides some useful pointers – from which it seems that only a few of all deep-learning libraries can take advantage of this. Nevertheless, he also argues that using multiple GPUs even independently, will allow to run multiple DL tasks simultaneously.

3.3 Motherboard (or Mobo for brevity)

The mobo is what hold the system together. So it supposed to be a bit complicated to understand. Nevertheless, we have to brave it. The important (not exhaustive) attributes to look for in a mobo are the following:

Form factor – The system real estate

CPU Socket type – Determines which processor(s) can it take

Chipset – The surrounding electronics compatible to the acceptable processor(s). Sometimes there can be multiple possible chipsets, varying in their cost and capabilities.

Memory capacity – How much (and what kind of) RAM the board can support (If the board supports more than the CPU does, then CPU wins)

Expansion slots – How many and how well devices like GPU can be supported

Secondary storage support – HDD, SSD (SATA/NVMe), etc.

Connectivity support – USB 2.0/3.0, LAN, WiFi, Bluetooth, etc.

System cooling support – For powering cooling attachments.

Common motherboard form factors (Source:http://www.corsair.com/en-us/blog/2013/october/how-to-build-a-pc---motherboard-selection):

The form factor, among other things determine potentially viable system configuration parameters. For a given processor family and chipset, the motherboard supporting them has rest of the area available to arrange for other stuff, especially PCIe slots. It may so happen that a mobo has (say) 4 x16 slots. That does not automatically translate into its ability to always accommodate up to 4 GPUs. Since a typical high-end GPU is “2 slots” wide, there must be enough spacing between the slots. Both EATX and ATX boards have 4 x16 slots (read) and one x4/8 (black, though of same physical size). If you have one GPU, you can plug it into any of them (though not likely in a non-x16). Now, since GPUs are 2 lanes wide, if you decide to have 4 of them, then the black one will be blocked (physically). On the other hand, if you must have some non-GPU device, you can have only up to 3 GPUs, ven if you use the black one for a non-GPU device.

Also, you need space to accommodate memory modules. Thus form factor can influence that too. The EATX board can have up to 8 modules (alternate red and black slots on left and/or right sides of the CPU sockets), while the others can have only 4.

Other mobo attributes mentioned above, though vital, not apparent to visual inspection. One needs to study specification details to know them and resolve compatibility issues and judge suitability of choice.

3.4 Others

Apart from the major ones above, there are a number of other components, such as RAM modules, HDD/SDD, Power supply, Cabinet, etc. Their characteristics attributes and compatibility issues are relatively simple and straight-forward. Hence, we shall not discuss them in detail here, rather while discussing my choice of components.

4 Budget Issues

Like in all sensible endeavors, it is very much recommended that building a DSDL workstation also should be taken up with a predefined budget. If we are building for our employer and/or a client, try to get as good one as possible. I am sure one can figure out how to spend it fruitfully.

5 My build

5.1 My budget

This is my personal project. So the budget depends solely on how much I (and only I) can afford to spend without disrupting my regular life-style and/or inconveniencing myself and my dependents financially. Even so, this is a tricky business, a compromise between wallet and passion. Nevertheless, one has to fix a budget and must stick to it, otherwise, give the lure of plastics, there could too much temptation for wrecking financial havoc.

I am based in India and that has some cost implications. Almost none of the components are manufactured in-country, thus has to be imported and therefore import duty is levied, thus I shall get somewhat less bang for my bucks compared to someone in, say, USA. My strategy is to get a somewhat minimal (but sufficient for moderate-size problems) system up and running and then expanding its capability as required. So let us get down to the initial system budget and configuration.

Budgeting, especially for a close-to-heart project like this can be a little uncertain. So, to arrive at my budget, I take a precedence-based approach. I considered one of my other passions, motorbike. One and half years back I bought a Royal Enfield Tunderbird 350cc bike, costing INR 150K. At that time, I had around INR 50K available in cash, so I paid the rest using a credit card and converted it to 6 EMI payments at 13% pa (not sure this conversion to EMI facility is available in other countries). I tried to remember the problems I faced paying the EMIs, can't remember any. So I decide to plastic for INR 100K as before. Also, sometime back I have received some (paltry) performance bonus from my employer. Thus, all in all, I should not have problem in getting my hand to 100K or so.

So, my budget is INR 200K (approx. USD 3100 on 19-Jun-2017).

Incidentally, 200K was the price of the 500cc variant of the bike I bought, initially I had my eye on that but somehow (with difficulty) curbed my impulse and settled on the lesser one .

5.2 CPU

I intend my system to evolve over time. While I start with one, eventually, ultimately I want to have 3 GPUs. If the CPU has 40 lanes, when I have 3 GPUs, the lane distribution could be x16/x16/x8 or (more likely, if some other device use a few lanes) x16/x8/x8. On the other hand, with a CPU having 28 lanes (saving some money in order to stick to the budget) 3GPUs will end up x8/x8/x8 connections, with 4 lanes still left for other devices.

Now, PCIe3.0:x8 has half bandwidth of x16, but it is still nearly 8 GB/s. The question is to ask is whether the x8 connection is enough for deep-learning workloads. Unfortunately, I could not find an exact answer. However, in context of gaming, some test results are available, which found inconsequential (within margin of error) difference in performance between x16 and x8 for PCIe 3.0. Now, whether graphics processing or deep-learning, computationally, more or less, both are multiplication of large matrices. So, we can expect the similar behavior of the system for DL tasks.

Therefore, I can confidently choose i7-6800K as long as I choose a motherboard that can accommodate, both physically and logically 3 GPUs (and of course, 128 GB DDR4 RAM).

Budget impact:

5.3 Motherboard

i7-6800K needs LGA2011-v3 socket and X99 chipset. There are a number of motherboards for them. These motherboards are typically favorite of serious gamers. After some digging around, I zeroed on some ASUS motherboards – to me (in my present situation) ASUS boards seemed to give better value for money. I looked into comparative description of 4 of them.

While, X99-A II is cheapest, I chose ROG Strix X99 Gaming for its inbuilt WiFi and Bluetooth capability. Notice that all these boards have a feature called “Aura Lighting” – it seems to be lot of LEDs in the board which can be configured to various lighting schemes, making the system illuminated like a Christmas tree. Allegedly, gamers love it, but I am not sure what I shall do, possibly keep the system below the bed!

Memory
Expansion Slots
Storage
- New Intel® Core™ i7 X-Series Processors :
- Intel® X99 chipset :
Multi-GPU Support

Looks all good so far.

Budget impact:

5.4 GPU

I am a data scientist, basically a practitioner, with strong research background. While at this moment I am not doing publishable research (for various reasons), I tweak things a lot and some of them can qualify as publishable, but for my lethargy in writing them up. Anyway, I would like the system to have capability to support research-level works. Looking around the Indian market, I got the following prices (on 19-Jun-2017):

Given my budget, the choice is obvious, the 1080 Ti – however, this is also what recommended by Tim for researchers.

Above specs are extremely important from deep-learning perspective. Other important specs include:

Bus Support
OS Certification
Height
Length
Width
Maximum GPU Temperature (in C)
Graphics Card Power (W)
Recommended System Power (W)

Budget impact:

5.5 Memory modules

Since, sometime in distant future, I might equip the system with full 128 GB RAM. I need to choose 16 GB DDR4 modules. I chose “Corsair Vengeance LPX 16 GB 3000 MHz - CL15”, specs here:

Since this type of modules are often used by gamers, these are also loaded with LEDs, contributing to Christmas spirit. Other than cost, one of the reason of my choice is that this particular module does not seem to have LEDs!

For the initial build I plan to use two modules only (again budget ...), giving 32 GB. Note that, with only two module, the quad-channel access will not be active, instead dual-channel mode will be in operation till I put in two more modules later.

Budget impact:

5.6 Secondary storage

Here I get little bit inventive. I go for two of them with different technologies.

Samsung 960 EVO Series - 500GB PCIe NVMe - M.2 Internal SSD

Seagate 2TB Firecuda (Solid State Hybrid) SATA 6GB/s 64MB Cache 3.5" – significantly better (claimed to be 5x faster) than traditional HDD.

Budget impact:

5.7 PSU

It is imperative that the PSU must comfortably meet the system's overall power demand. I have used an online tool for computing the system power requirement. I have calculated the power requirement for the system with 2 GPUs, 8 RAMs and 2 SATA drives. The details of the calculation is shown 7 in the Appendix. Based on that calculation, I selected CORSAIR CX Series CX850M 850W ATX12V / EPS12V 80 PLUS BRONZE Certified Modular Power Supply. Hopefully, this will be good enough till I try to add the third GPU.