Setting Up Infiniband

So we got some new IB cards, and we needed to set them up on our servers. Our servers are Ubuntu 14.04 for this post, but I believe 16.04 should be similar.

Install the cards Physically.

To check if your hardware found your cards, enter the following:

lspci -v | grep Mellanox
02:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

You should get something like the above.

Install Infiniband Driver

Refer to the Release notes of version v4_2-1_2_0_0. The reference has a list of packages that are required before installation. I found out afterwards, that the installer seems to check these dependencies and installs them itself, but why not prepare your system beforehand.

$ apt-get install perl dpkg autotools-dev autoconf libtool automake1.10 automake m4 dkms debhelper tcl tcl8.4 chrpath swig graphviz tcl-dev tcl8.4-dev tk-dev tk8.4-dev bison flex dpatch zlib1g-dev curl libcurl4-gnutls-dev python-libxml2 libvirt-bin libvirt0 libnl-3-dev libglib2.0-dev libgfortran3 automake m4 pkg-config libnuma-dev logrotate ethtool lsof

For the libnuma package and the libnl-dev package, the corresponding package names are libnuma-dev and libnl-3-dev​.

Afterwards, checkout the ConnectX-3 Pro VPI Single and Dual QSFP+ Port Adapter Card User Manual for more help with installing.


Now, go ahead and install the Mellanox OFED. Download the installer from the Mellanox website under Products->Software->Infiniband VPI drivers. Go for Mellanox OFED Linux and at the bottom click the Download button. If nothing shows up and you are using Chrome, make sure to enable unsafe scripts.

Download the tgz file (or iso if you prefer iso) for your distribution. Untar the file.

Install the Mellanox OFED by executing the following script:

./mlnxofedinstall [OPTIONS if applicable. I didn't need any]

Afterwards, I rebooted the system.

Assigning IP addresses to each IB

Now Infiniband supports IPoIB that seems to allow infiniband to be resoluted with IP addresses. For this part I referred to the following post. Just to make sure IPoIB is installed, check the following command

lsmod | grep ipoib

There should be a ib_ipoib module loaded.

Now check your ib interface names via ifconfig -a command. Then set your ib IP addresses in /etc/network/interfaces file.

auto ib0
iface ib0 inet static

And bring up your network device (ib) up via

ifup ib0

Setting up the Subnet Manager (If your not using a IB Switch)

Now if you check the status of your ib cards, via ibstat you may find that your card states are State: Initializing. Intel developer zone has a Troubleshooting InfiniBand connection issues using OFED tools Under the state part, I found that the INIT state corresponds to a HW initialized, but subnet manager unavailable situation.

If you are in a situation like I am, where you do not have an Infiniband switch, and you are just connecting nodes directly, you need to start up a SW subnet manager. Another intel guide allowed me to start up the subnet manager.

/etc/init.d/opensmd start

Afterwards my ibstat showed that my State: Active.

I tried a few tests, ib_send_bw to check the performance between two nodes and found that my system was working as expected.

Also, to setup the subnet manager to startup at boot execute the following command

update-rc.d opensmd defaults

Setup cluster NIS Client

There are some good manuals around, but the key thing is

  1. when installing via apt-get make sure to specify the domain name of the NIS master
    1. If things don’t work well, use apt-get purge nis to remove and reinstall nis to setup your nis domain.
  2. setup the ypserver in /etc/yp.conf
    1. ypserver [full address]
  3. add nis to the appropriate lines in /etc/nsswitch.conf
    1. passwd, group, shadow, hosts
  4. Finally use the yptest to check if things are working.
  5. Xenial has an issue where the rpcbind service does not start up properly. I used the following command to set rpcbind to start at bootup.
      1. # systemctl add-wants rpcbind.service
      2. This solution was found on askubuntu

RISC-V Notes


src/main/scala/uncore/tilelink/Definitions.scala States the following for each steps:

  1. Acquire: used to initiate coherence protocol transactions in order to gain access to a cache blcok’s data with certain permissions enabled. … Acquires may contain data for Put or PutAtomic… After sending acquires, clients must wait for a manager to send them a Uncore Grant message in response
  2. Probe: used to force clients to release data or cede permissions on a cache block. Clients respond to probes with Release messages.
  3. Release: used to release data or permission back to the manager in response to Probe message. Can be used to volunatirly writeback data. (ex. event that dirty data must be evicted on cache miss).
  4. Grant: used to refill data or grant permissions requested of the manger agent via acquire message. Also used to ack the receipt of volunatry writeback from clients in the form of Release.
  5. Finish: used to provide global ordering of Txs. Sent as ack for receipt of grant message. When a Finish message is received, a manager knows it is safe to begin processing other transactions that touch the same cache block.

Cache Miss

On a miss, there is a block of code that adds the miss into the MSHR

// replacement policy
val replacer = p(Replacer)()
val s1_replaced_way_en = UIntToOH(replacer.way)
val s2_replaced_way_en = UIntToOH(RegEnable(replacer.way, s1_clk_en))
val s2_repl_meta = Mux1H(s2_replaced_way_en, wayMap((w: Int) => RegEnable(, s1_clk_en && s1_replaced_way_en(w))).toSeq)

// miss handling := s2_valid_masked && !s2_hit && (isPrefetch(s2_req.cmd) || isRead(s2_req.cmd) || isWrite(s2_req.cmd)) := s2_req := s2_tag_match := Mux(s2_tag_match, L1Metadata(s2_repl_meta.tag, s2_hit_state), s2_repl_meta) := Mux(s2_tag_match, s2_tag_match_way, s2_replaced_way_en) :=
when ( { replacer.miss }
io.mem.acquire <>

The miss should be processed by the MSHR by issuing an Acquire to the TileLink, and waiting for a Grant that’ll be filled into the way.

In the MSHRFile class, there is a line of code as follows:

val sdq_enq = io.req.valid && io.req.ready && cacheable && isWrite(io.req.bits.cmd)

Thus I’m assuming the sdq stands for a Store Data Queue. Also, as we’re trying to prefetch misses (I think we can ignore this part for now…)

MSHR Issues Acquire Requests

 io.mem_req.valid := state === s_refill_req &&
 io.mem_req.bits := req.old_meta.coh.makeAcquire(
 addr_block = Cat(io.tag, req_idx),
 client_xact_id = Bits(id),
 op_code = req.cmd)

This is a code snippet from the MSHR class.  The individual are connected via an arbiter in the MSHRFile class The mem_req_arb‘s out is connected to the io.mem_req which is then connected to the io.mem.acquire in the HellaCache class.

The snippet above sends out a Acquire requests if the state of the current MSHR is a refill request, and is ready to be enqueued into the Finish Queue. The address block is generated from the tag and index, and the op_code carries the command of the request. Also, an id is generated to create an client Transaction ID. However, the id the index of the MSHR in the MSHRFile.

State change: s_refill_req->s_refill_resp

The MSHR state change from the s_refill_req to s_refill_resp happens on an Acquire fire(). The fire() occurs when both the  object’s valid and ready bits are on at the same time.

object ReadyValidIO {
def fire(): Bool = target.ready && target.valid

Thus the acquire has been fed valid data, and the acquire has been switched to ready. The acquire has been successfully fired, and we’re waiting for a grant response from the uncore.

Check the code from line:304

Linux NUMA physical memory layout

Just looked through the pseudo files in the procfs.

/proc/zoneinfo has some interesting information.

There are two zones in my current system: DMA, normal. We also have 2 nodes, and thus the file is shown as:

  • Node0 DMA
  • Node0 DMA32
  • Node0 normal
  • Node1 normal

Each section has how many free pages, present pages there are. At the end of the section there is the start_pfn that shows us the physical address of the beginning of the zone.

Thus we can approximate the physical address space of our system by using the start_pfn * PAGE_SIZE and also using the number of present pages.

NUMA Linux Memory Allocation

Scope of Memory Policies

System Default Policy

Hard coded into the kernel. On system boot it uses interleaved. After bootup it uses local allocation.

Task/Process Policy

Per task policy. Task policies are inherited to child processes. Thus applications like numactl uses this property to propogate the task policy to the child process.

In multi-threaded situation where other threads exist only the thread that calls the MEMORY_POLICY_APIS will set its memory policy. All other exisitng threads will retain the prior policy.

The policy only affects memory allocation after the time the policy is set. All allocations before the change are not affected.

VMA Policy

Only applies to anonymous pages. File mapped VMAs will ignore the VMA policy if it is set to MAP_SHARED. If it is MAP_PRIVATE VMA policy will only be enforced on a write to the mapping (CoW).

VMA policies are shared by threads of the same address space. VMA policies do not persist accross exec() calls (as the address pace is wiped)

Shared Policy

Similar to VMA policy but it is shared among processes. Some more details, but skipped due to irrelevance to my work.

Memory Policies

Default Mode

Specifies the current scope does not follow a policy, fall back to larger scope’s policy. At the root it’ll follow the system policy.

Bind Mode

Memory allocated from the nodes specified by the policy. Proximity is considered first, and if enough free space exists for the closest memory node (to the allocation requestor) it’ll be granted

Preferred Mode

Allocation will be attempted from the preferred (single) node. If it fails, the nodes will be searched for free space in nearest first fashion.
Local allocation is a preferred mode where the node that initiates a page fault is the preferred node.

Interleaved Mode

For Anonymous/shared pages: Node set is indexed using the page offset to the VMA. (Address % node_nums). The indexed node is requested for a page as in Preferered mode. And if it fails follows preferred mode style.

Page cache pages: A node counter is used to wrap around and try to spread out pages among the nodes (that are specified).


Reference: What is Linux Memory Policy


Zedboard Linux Hello world execution failure

No such file or directory!

On trying to execute the hello_world.elf file I ran into a ./hello_world.elf: No such file or directory error.

The problem was due to a missing interpreter that was provided in the elf file.

readelf -a hello_world.elf

results in a big output.

If yo check the Program Headers section you see the following output:

<br />Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
EXIDX 0x0004b4 0x000104b4 0x000104b4 0x00008 0x00008 R 0x4
PHDR 0x000034 0x00010034 0x00010034 0x00100 0x00100 R E 0x4
INTERP 0x000134 0x00010134 0x00010134 0x00019 0x00019 R 0x1
[Requesting program interpreter: /lib/]
LOAD 0x000000 0x00010000 0x00010000 0x004c0 0x004c0 R E 0x10000
LOAD 0x0004c0 0x000204c0 0x000204c0 0x0011c 0x00120 RW 0x10000
DYNAMIC 0x0004cc 0x000204cc 0x000204cc 0x000e8 0x000e8 RW 0x4
NOTE 0x000150 0x00010150 0x00010150 0x00044 0x00044 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10

The Requesting program interpreter is our problem here.

That file is not available on the zerboard’s /lib directory.

Thus just simply create a symbolic link of the interpreter on the zedboard

ln -s /lib/ /lib/


Xilinx SDK Hello World Build Errors

Cannot find

Got an error in the Xilinx SDK where the error message was that the arm gcc compiler was missing

Not sure if this will help but I tried installing the 32-bit libz library

apt-get install zlib1g:i386

Cannot find -lxil

Seems like somebody else had a problem here

And the third message had the answer:

I wasn’t able to solve it. It was a simple project, so I just remade it and it worked.

So I cleaned the build and remade it.