Setting Up Infiniband

So we got some new IB cards, and we needed to set them up on our servers. Our servers are Ubuntu 14.04 for this post, but I believe 16.04 should be similar.

Install the cards Physically.

To check if your hardware found your cards, enter the following:

lspci -v | grep Mellanox
02:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

You should get something like the above.

Install Infiniband Driver

Refer to the Release notes of version v4_2-1_2_0_0. The reference has a list of packages that are required before installation. I found out afterwards, that the installer seems to check these dependencies and installs them itself, but why not prepare your system beforehand.

$ apt-get install perl dpkg autotools-dev autoconf libtool automake1.10 automake m4 dkms debhelper tcl tcl8.4 chrpath swig graphviz tcl-dev tcl8.4-dev tk-dev tk8.4-dev bison flex dpatch zlib1g-dev curl libcurl4-gnutls-dev python-libxml2 libvirt-bin libvirt0 libnl-3-dev libglib2.0-dev libgfortran3 automake m4 pkg-config libnuma-dev logrotate ethtool lsof

For the libnuma package and the libnl-dev package, the corresponding package names are libnuma-dev and libnl-3-dev​.

Afterwards, checkout the ConnectX-3 Pro VPI Single and Dual QSFP+ Port Adapter Card User Manual for more help with installing.


Now, go ahead and install the Mellanox OFED. Download the installer from the Mellanox website under Products->Software->Infiniband VPI drivers. Go for Mellanox OFED Linux and at the bottom click the Download button. If nothing shows up and you are using Chrome, make sure to enable unsafe scripts.

Download the tgz file (or iso if you prefer iso) for your distribution. Untar the file.

Install the Mellanox OFED by executing the following script:

./mlnxofedinstall [OPTIONS if applicable. I didn't need any]

Afterwards, I rebooted the system.

Assigning IP addresses to each IB

Now Infiniband supports IPoIB that seems to allow infiniband to be resoluted with IP addresses. For this part I referred to the following post. Just to make sure IPoIB is installed, check the following command

lsmod | grep ipoib

There should be a ib_ipoib module loaded.

Now check your ib interface names via ifconfig -a command. Then set your ib IP addresses in /etc/network/interfaces file.

auto ib0
iface ib0 inet static

And bring up your network device (ib) up via

ifup ib0

Setting up the Subnet Manager (If your not using a IB Switch)

Now if you check the status of your ib cards, via ibstat you may find that your card states are State: Initializing. Intel developer zone has a Troubleshooting InfiniBand connection issues using OFED tools Under the state part, I found that the INIT state corresponds to a HW initialized, but subnet manager unavailable situation.

If you are in a situation like I am, where you do not have an Infiniband switch, and you are just connecting nodes directly, you need to start up a SW subnet manager. Another intel guide allowed me to start up the subnet manager.

/etc/init.d/opensmd start

Afterwards my ibstat showed that my State: Active.

I tried a few tests, ib_send_bw to check the performance between two nodes and found that my system was working as expected.

Also, to setup the subnet manager to startup at boot execute the following command

update-rc.d opensmd defaults

Setup cluster NIS Client

There are some good manuals around, but the key thing is

  1. when installing via apt-get make sure to specify the domain name of the NIS master
    1. If things don’t work well, use apt-get purge nis to remove and reinstall nis to setup your nis domain.
  2. setup the ypserver in /etc/yp.conf
    1. ypserver [full address]
  3. add nis to the appropriate lines in /etc/nsswitch.conf
    1. passwd, group, shadow, hosts
  4. Finally use the yptest to check if things are working.
  5. Xenial has an issue where the rpcbind service does not start up properly. I used the following command to set rpcbind to start at bootup.
      1. # systemctl add-wants rpcbind.service
      2. This solution was found on askubuntu

Zedboard Linux Hello world execution failure

No such file or directory!

On trying to execute the hello_world.elf file I ran into a ./hello_world.elf: No such file or directory error.

The problem was due to a missing interpreter that was provided in the elf file.

readelf -a hello_world.elf

results in a big output.

If yo check the Program Headers section you see the following output:

<br />Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
EXIDX 0x0004b4 0x000104b4 0x000104b4 0x00008 0x00008 R 0x4
PHDR 0x000034 0x00010034 0x00010034 0x00100 0x00100 R E 0x4
INTERP 0x000134 0x00010134 0x00010134 0x00019 0x00019 R 0x1
[Requesting program interpreter: /lib/]
LOAD 0x000000 0x00010000 0x00010000 0x004c0 0x004c0 R E 0x10000
LOAD 0x0004c0 0x000204c0 0x000204c0 0x0011c 0x00120 RW 0x10000
DYNAMIC 0x0004cc 0x000204cc 0x000204cc 0x000e8 0x000e8 RW 0x4
NOTE 0x000150 0x00010150 0x00010150 0x00044 0x00044 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10

The Requesting program interpreter is our problem here.

That file is not available on the zerboard’s /lib directory.

Thus just simply create a symbolic link of the interpreter on the zedboard

ln -s /lib/ /lib/


Synology Gitlab Setup SSL over Let’s Encrypt

With Let’s Encrypt and Synology, we need to take an extra step to setup certificates in the Gitlab persistent data.

First off, the synology certificates seem to be at /usr/syno/etc/certificate/system/default

I’m referring to the instructions on github and by P. Behnke.

Also, the gitlab certs directory should be placed at /volume1/docker/gitlab/certs

Three files are required in all, gitlab.key, gitlab.crt, and dhparam.pem. If any of them does not exist, it won’t work.

Generate dhparam.pem by referring to the github info.

openssl dhparam -dsaparam -out dhparam.pem 2048

Also append the fullchain.crt & cert.pem into gitlab.crt. Change the name of privkey.pem into gitlab.key

Thus in the end you’ll have the following three files:

  • dhparam.pem
  • gitlab.crt
  • gitlab.key

Restart the gitlab container, and you should check to see if you get a certificate not found error. If you don’t, then you should be set to go 🙂

Note: synology also has the dh-param files ready. (It takes very long to generate the file, I generated it on my workstation and ferried it over). Anwyays, you can find the synology generated files in the following path: /usr/syno/etc/ssl

Update: When you a SSL certification verification failed error

When using the HTTPS protocol the SSL verification sometimes seems to fail. The reason seems to be gnuTLS being picky about the order of the certificates.

fatal: unable to access ‘https://hostname:port/username/repo.git&#8217; server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none

Thus instead of the order that is shown in P.Behnke‘s blog, reverse the order of the fullchain and cert as follows:

$ cat fullchain.crt cert.pem > gitlab.crt

After restarting the docker containers, all seems to work.

Setting up Gitlab on Synology (via Docker)

Synology provides Gitlab from their package manager. However when you log on you see that the build version is too old, and Gitlab asks for upgrade ASAP.

Thus, I checked the docker registry, and downloaded the sameersbn/gitlab.

Synology seems to use Docker API v1 instead of v2, and so you can’t see all the tags. Thus logon to the synology device over ssh, and execute the following command

docker pull sameersbn/gitlab:8.12.3

The 8.12.3 was the latest tag. There is also another tag that is latest and so you may just pull from that tag.

Then you get a totally new docker image on the docker explorer on synology DSM. From here we need to copy some environment variables from the synology gitlab docker.

From the provided 8.6.2 to the 8.12.3, obviously there were a lot of change, and we need to add two more environment variables before gitlab boots properly.

This github issue helps us with the issue. Basically find the .secret file, and the secret value needs to be set as the GITLAB_SECRETS_OTP_KEY_BASE and GITLAB_SECRETS_SECRET_KEY_BASE. After you add these values gitlab should boot up.

Its a shame that the synology version of docker does not suport the pulling the latest version, and we have to do this manually. But still, synology did provide quite a lot of useful variables to refer to.

Finally set GITLAB_HTTPS to true to forward the GITLAB_PORT to HTTPS.

You should be set to go!

Prof. Steve Keckler Talk

Architectures for Deep Neural Networks

DNN is here to stay because of the big data that is available. Lots of data to analyze and run from. Also the computational ability of current systems.

Accuracy have been increased drastically. due to DNN.

Now a time for computer architects to use the creativity to improve performance instead of process technologies iproving performance.

How do DNNs work?

Training & inference

  • Training -> Large number of data.  (Back propogate to learn)
  • Inference -> Smaller varied number of incoming data

A lot of linear algebra! Output = Weight * Input

Learn the weight using back propagations

Locality: Batching! Multiple images at a time. Input matrices & output matrices (single images are vectors)

Convolutional stage and fully connected stage

Convolutional stages act as trained feature detection stage

6D loop! Dot product accumulator within the image.

  • Run a larger(deeper 100s~1000s of levels)
  • Run network faster
  • Run network more efficiently

Limit of memory capacity (Bandwidth & capacity tradeoff!)

vNN: Virtualizing training!

Trend is larger and deeper neural networks

Recurrant neural networks! (Temporal factor??) Audio recognition

Training needs to keep the memory of each layer output activations so that it an be used to backpopagation!

Volume of that data is very large! GPU memory usage is proportional to the number of layers

Computing set of gradients!

We don’t need data of the first layers until the training goes to the end, then back propagate! These data take up a lot of space! Thus offload to the DRAM (more capacity)

(forward and backward propagation)

Pipeline Writeback and prefetching

Allows training much larger networks! Incurs little overhead relative to hypothetical large memory GPUs, oeverhead will drop with faster CPU & GPU links.

Optimizing DNNs for Inference

Accessing DRAM is far more expensive than computing. But then to have more accuracy we still need large networks!


  • Reduce numerical precision: FP32->FP16
  • Prune networks: Lots of zeros as weights! Thus remove these unnecessary weights. ALso share weights among other elements in the weight matrix
  • Share weights
  • Compress netowrk
  • Re-architect network

Importance of staying local! LPPDDR 640 Pj/word, SRAM(MB) 50 -> SRAM(KB) 5

 Cost of operations

Accumulate at greater precision than your multiplications!



Lots of the wieghts are zeros! Prune unnecessary weights!

This allows reducing the network dramatically. You can also prune weights that are close to 0. The other wieghts can be used to recover from pruning! Retraining is important

Prune up to 90% of wegiths, but reiterative learning keeps the accuracy pretty high

Pruning can be aggressively done

Factor of 3x performance improvement of pruning


Weight Sharing

Similar weights can be clustered, and using simple bit indexes, point to fin-tuned centroids

Retrain to account for the change as well!

We can reduce weight table significantly!

Huffman encoding to further improve! up to 35x~49x reduction!

Rearchitect netowrks!

SqueezeNet -> Reduce filters! combination of 1×1 filters and 3×3? This allows really shrinking down the networks!

DNN Inference Accelerator Architectures

Eyeriss: ISSCC 2016 & ISCA 2016

Implications of data flow in DNNs

Do we keep filters stationary or input stationary, etc.

Maximize reuse of data within PE

Data compression! Non linear functios applied between activations. Used to be sigmoid. Now it is ReLU! (Negative values are 0)

Lots of negative values are calculated!

Reduce data footprints… Compress them before moving to offschip ememory.

Saves volume of transfered data

If there are multiplications by 0, then just disable the rest of the calculation, reduce the power!

Up to 25x Energy efficient compared to Jetson TK

EIE: Efficient inference Engine (compressed activations)

All weights are compressed and pruned, weight sharing, zero skipping!

Sparse matrix engine, and skip zeros.


Special purpose processors can maximize energy efficiencies


ISCA Day 3


DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric

Low-power High density reconfigurable RAM.

Low power & high capacity FPGA is desirable, use DRAMs instead of SRAM lookup table to provide low power & high capacity

Challenges of building DRAM-based FPGA

LUT is slower thanSRAM, destructive access (data lost after access)

Narrower MAT (DRAM array) from 1K -> 8~16 bits,

Destructive DRAM read is solved by PRE(charge) ACT(ivation) RST(restore)
followed by a wire transfer. These are sequential, but can be overlapped.
RST can be overlapped by wire transfer, and, etc.

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs

Resistive Memory Write:

Slower writes have endurance benefits (knife sharpening example)

Adaptively use write speed choosing.

Make use of the idle memory(?) to slowly writeback

Bank-aware mellow writes: Choose banks with less blocks to writeback in the wb queue of memory controller. FOr those relatively free banks, issue slow writes.

Performance degradation is not noticeable, endurance improved by 87%

Eager Mellow Writes:

Predict that LLC dirty lines will not be dirtied again, and so writeback slowly to ReRAM

Does some epoch counting? to find cachelines…

Add a eager mellow write queue, lowest priority but uses memory bandwidth to writeback. Eagerly.

Eager writeback also improves performance as it reduces write queue congestion!! 

Also employs lifetiem quota, where a lifetime is enforced.

More energy is used to write slower write!

MITTS: Memory Inter-arrival Time Traffic Shaping

CPU initiated memory bandwidth provisioning.

IaaS can charge the users on memory bandwidth usage (and arrival time)

HW mechanism to provision memory BW (Bulky vs. Bursty bandwidth)

Relative memory inter-arrival time, make into a histogram.

Credits per interarrival time in bins. Thus if you use all your credit, you need to wait, and use the next inter-arrival time credit bin

Array of registers that represent credits in each bin. Also, replenisher to fill the bins

Reliability 2

All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory

Command/Address trnasfer failed to speed up with newer DRAM specs.

CA more prone to erros than data! Due to DDR DIMM technology

CA-parity was introduced in DDR4.

Read address error1 (Read wrong codeword! Data & Data ECC are vailid within themselves, but wrong codeword!!)
Extended data ECC -> also address is encoded into the ECC.

Write Address Error!  are even more severe!

Lots of problems possible with command/address


GPUs 2

ActivePointers: Software Address Translation Layer on GPUs

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit


Agile Paging: Exceeding…

Virtualization has two problem:

Nested paging-> slower walks

Shadow paging-> fast walks, slow page table updates

Use shadow paging for most of the time, and then use nested page walk during the walk.


Nested paging is better if the address space is being frequently switched! However, only a fraction of the address space is dynamic.

Only a few entries of teh page table are modified frequently!

Start page walk in shadow mode, optionally switch to nested mode.

One bit of the page table entry signifies that the lower levels require nested paging mode. (Do in nested mode!)

CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture

Sharing architecture-> for each 10 million cycle exectuions, the maxima were different, when cache banks and # of slices (cores) were different.

Learning optimizer, has a feedback mechanism that goes through a Kalman filter(?)

ISCA Day 2

Neural Network

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks


  • novel eenergy efficient CNN dataflow
  • Fabricated, verified chip
  • Taxomony of CNN dataflow

Convolution accounts for more than 90% of overall computation (runtime & energy consumption)

Simple arithmetic: Multiplay and accumulate (MAC)

Add local memory hierarchy: Lots of data is reused (due to nature of window sliding)

  • Image reuse
  • filter reuse,
  • Convolution reuse

Reduce read of filter/image by up to 500x


  1. Input Stationary
    1. Filter is stationary
  2. Output Stationary
    1. Image is stationary
  3. No Local Reuse
    1. No local reuse….??
  4. Row Stationary <– Contribution
    1. Maximize local reuse and accumulation at RF
    2. Row-based work dispatching?

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory

Memory capacity & memory bandwidth issue:
Use PIM in HMC for the issue

We want a programmable (General purpose) & Energy Efficiency

Programmable, Scalable,

PIM in HMC logic die. (Thermal and Energy restriction in the standard?)


Data will be pushed into the PE, and this event will trigger data processing

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory

Design an ISA for Neural Network


Decoupling Loads for Nano-Instruction Set Computers

McFarlin13 88% of OoO performance comes from scheduling, 12% from Dynamism (react to dynmaic events)

  • Ld: Compute address, access memory & write register, (Ordering)
  • Bne: Specify target, compute condition, transfer control

multiple action Loads makes compiler optimization difficult to reorder loads (due to aliasing concerns)

Decoupling Load -> execute load to cache ASAP, writeback to register in order (ordering)

Basically prefetch the data to the Register from the caches early?

These loads are stored using load tags. Thus loads are prefetched into load tags, and any stores to the same physical  address as load tags will update load tags! thus consistency is guaranteed

I don’t understand the justification that this is not just prefetching!

But still awesome!

Future Vector Microprocessor Extensions for Data Aggregations

Data aggregation: SQL Group By

Reduction of key-value pairs of: sum, mix, man, average etc.

(slept ㅡㅡ;;)

Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading

Increase instruction windows in SMT

SMT degrades OoO benefits, as more threads are added.

True data dependencies take longer, thus in-order execution

Add a ‘Shelf’ that holds true dependencies(?)???

Shell instructions are executed in order, and form a FIFO for the shelf instructions.

Only OoO ROB(64) baseline: ROB+Shelf(64+64), ROB(128)
The performance is better for 128 ROB, but EDP is better by 64+64 ROB & Shelf

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Add a little bit of compute to memory controller

Dependent cache miss:

The following instructions are delayed due to the loads of previous instructions, that miss. But the dependent are also misses, thus dependent cache miss.

Pointer dependent workloads (pointer chasing, graph processing, etc.)

On-chip delays are pretty signifcant (Shared interconnect, and cache??)

Enhanced Memory Controller(EMC) is used to execute the dependent instructions.

Dependence chain generation generates the instructions to be executed in the EMC.

Miss Predictor Qureshi and Loh (MICRO 2012)

Great work!