Vivado HLS (2016.02) co-simulation build error on Ubuntu 14.04 (crti.o not found)

Vivado HLS 2016.02 has an error when building binary files (for co-simulation) on Ubuntu 14.04. (Don’t know about other distributions, or other versions of Ubuntu)

I got my help from this blog post.

The problem I got was as follows:

INFO: [HLS 200-10] In directory '/home/username/Vivado_HLS_projects/XorGenerator/XorGenerator/sim/wrapc'
clang: warning: argument unused during compilation: '-fno-builtin-isinf'
clang: warning: argument unused during compilation: '-fno-builtin-isnan'
INFO: [APCC 202-3] Tmp directory is /tmp/apcc_db_username/160961476017075859177
INFO: [APCC 202-1] APCC is done.
In file included from apatb_do_xor.cpp:18:0:
/opt/Xilinx/Vivado_HLS/2016.2/include/ap_stream.h:70:2: warning: #warning AP_STREAM macros are deprecated. Please use hls::stream<> from "hls_stream.h" instead. [-Wcpp]
/usr/bin/ld: cannot find crt1.o: No such file or directory
/usr/bin/ld: cannot find crti.o: No such file or directory

So the problem was that crt1.o and crti.o are note found. Just to give you a hint, these files are under /usr/lib/x86_64-linux_gnu

However instead of adding that directory as the build path or Library path (which I think is the intuition and the right way to do it…) The workaround in the blog post is to copy the crti.o, crt1.o, and crtn.o file to the Vivado HLS's gcc directories.

The path has changed since Vivado HLS 2012.02 as in the post.

Now the path is at /opt/Xilinx/Vivado_HLS/2016.2/lnx64/tools/gcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.3/

Thus you will want to execute the following command:

cp /usr/lib/x86_64-linux-gnu/crt?.o /opt/Xilinx/Vivado_HLS/2016.2/lnx64/tools/gcc/lib/gcc/x86_64-unknown-linux-gnu/4.6.3/

Cheers 🙂

Zedboard Block RAM Size

The Zedboard’s device name is Z-7020; and if we take a look at the data sheet of the Zynq-7000, we can find that the Block RAM size is 4.9Mb, or 140 x 36Kb blocks of BRAM.

FIFO can be used as a bridge between the PS and PL, as either will have different frequencies. To prevent data loss, the FIFO is used as a shared buffer as a synchronization circuit. (Ref)

Synology Gitlab Setup SSL over Let’s Encrypt

With Let’s Encrypt and Synology, we need to take an extra step to setup certificates in the Gitlab persistent data.

First off, the synology certificates seem to be at /usr/syno/etc/certificate/system/default

I’m referring to the instructions on github and by P. Behnke.

Also, the gitlab certs directory should be placed at /volume1/docker/gitlab/certs

Three files are required in all, gitlab.key, gitlab.crt, and dhparam.pem. If any of them does not exist, it won’t work.

Generate dhparam.pem by referring to the github info.

openssl dhparam -dsaparam -out dhparam.pem 2048

Also append the fullchain.crt & cert.pem into gitlab.crt. Change the name of privkey.pem into gitlab.key

Thus in the end you’ll have the following three files:

  • dhparam.pem
  • gitlab.crt
  • gitlab.key

Restart the gitlab container, and you should check to see if you get a certificate not found error. If you don’t, then you should be set to go 🙂

Note: synology also has the dh-param files ready. (It takes very long to generate the file, I generated it on my workstation and ferried it over). Anwyays, you can find the synology generated files in the following path: /usr/syno/etc/ssl

Update: When you a SSL certification verification failed error

When using the HTTPS protocol the SSL verification sometimes seems to fail. The reason seems to be gnuTLS being picky about the order of the certificates.

fatal: unable to access ‘https://hostname:port/username/repo.git&#8217; server certificate verification failed. CAfile: /etc/ssl/certs/ca-certificates.crt CRLfile: none

Thus instead of the order that is shown in P.Behnke‘s blog, reverse the order of the fullchain and cert as follows:

$ cat fullchain.crt cert.pem > gitlab.crt

After restarting the docker containers, all seems to work.

Setting up Gitlab on Synology (via Docker)

Synology provides Gitlab from their package manager. However when you log on you see that the build version is too old, and Gitlab asks for upgrade ASAP.

Thus, I checked the docker registry, and downloaded the sameersbn/gitlab.

Synology seems to use Docker API v1 instead of v2, and so you can’t see all the tags. Thus logon to the synology device over ssh, and execute the following command

docker pull sameersbn/gitlab:8.12.3

The 8.12.3 was the latest tag. There is also another tag that is latest and so you may just pull from that tag.

Then you get a totally new docker image on the docker explorer on synology DSM. From here we need to copy some environment variables from the synology gitlab docker.

From the provided 8.6.2 to the 8.12.3, obviously there were a lot of change, and we need to add two more environment variables before gitlab boots properly.

This github issue helps us with the issue. Basically find the .secret file, and the secret value needs to be set as the GITLAB_SECRETS_OTP_KEY_BASE and GITLAB_SECRETS_SECRET_KEY_BASE. After you add these values gitlab should boot up.

Its a shame that the synology version of docker does not suport the pulling the latest version, and we have to do this manually. But still, synology did provide quite a lot of useful variables to refer to.

Finally set GITLAB_HTTPS to true to forward the GITLAB_PORT to HTTPS.

You should be set to go!

Scala Language

In our FPGA project we’ll be using the UC Berkeley Architecture Research team’s Rocket core. (Most likely we’ll be using it) The core is written in the Scala language and is compiled via Chisel into C++ simulator or Verilog for FPGA/ASIC.

Now I need to know Scala to fully understand the rocket core, and so I’m looking throught the Scala Tutorial, offered at the Scala Language site.

Interesting Features

Method call

Methods taking one argument can be used with an infix syntax.

df format now

These have the same meaning syntactically. Seems alot like the style of smalltalk (and Objective-C)

Function are Objects (Passing functions as arguments)

object Timer {
  def oncePerSecond(callback: () => Unit) {
    while (true) { callback(); Thread sleep 1000 }
  def timeFlies() {
    println("time flies like an arrow...")
  def main(args: Array[String]) {

Anonymous Functions

object TimerAnonymous {
  def oncePerSecond(callback: () => Unit) {
    while (true) { callback(); Thread sleep 1000 }
  def main(args: Array[String]) {
    oncePerSecond(() =>
      println("time flies like an arrow..."))

Notice that the () =&gt; denotes an anonymous function. I think it would be a lambda function 😉

Vivado License Ubuntu 14.04 Zeroed NIC Interface problem

During the Activation (License management) of Vivado HLx 2016.2 I encounterd a problem where I could copy the license (received via email), however when viewing license information, I couldn’t find my license.

I checked the host information and found that the Network Interface Card ID was zeroed.
Screen Shot 2016-10-03 at 3.40.45 PM.png

I think a possible reason is because from some point in Ubuntu, they followed Redhat’s convention for the nic names. The NIC names used to be ethX, but now they are emX because the NICs I use are embedded onto the mainboard.

I found that the module that changes the name is biosdevname and can be disabled by setting biosdevname=0 in the linux bootup parameters.

A Stackoverflow Q&A discusses this problem, and we can either:

I decided to try the grub option.

If you use static IP setting, and use /etc/network/interfaces, don’t forget to update your settings before resetting. (You won’t be able to log on, or the system will delay the boot for 60 seconds to try to get connectivity)

This issue is also reported on the Xilinx Forums, however I’m not sure if there is a fix going on.



So there are a few Wooribank branches in Beijing, and the one that’s closest I think is the San yuan Qiao sub-branch. (삼원교지행) Damn Korean sites without direct links!

Anyways, there are a few places in Beijing, and Wooribank is written 友利銀行.

wooribank map.PNG

So we are going to go to Number 7.

We ride the subway line 10 in clockwise, get off at 三元桥 station, (Exit C2) and walk.

I’ll make an account, put my money in the account, and then probably withdraw the money from Korea.

Good thing is the Sanyuanqiao station is connected to the Airport via express way, thus I think I could go from the hotel to Sanyuanqiao to the airport right away.


Prof. Steve Keckler Talk

Architectures for Deep Neural Networks

DNN is here to stay because of the big data that is available. Lots of data to analyze and run from. Also the computational ability of current systems.

Accuracy have been increased drastically. due to DNN.

Now a time for computer architects to use the creativity to improve performance instead of process technologies iproving performance.

How do DNNs work?

Training & inference

  • Training -> Large number of data.  (Back propogate to learn)
  • Inference -> Smaller varied number of incoming data

A lot of linear algebra! Output = Weight * Input

Learn the weight using back propagations

Locality: Batching! Multiple images at a time. Input matrices & output matrices (single images are vectors)

Convolutional stage and fully connected stage

Convolutional stages act as trained feature detection stage

6D loop! Dot product accumulator within the image.

  • Run a larger(deeper 100s~1000s of levels)
  • Run network faster
  • Run network more efficiently

Limit of memory capacity (Bandwidth & capacity tradeoff!)

vNN: Virtualizing training!

Trend is larger and deeper neural networks

Recurrant neural networks! (Temporal factor??) Audio recognition

Training needs to keep the memory of each layer output activations so that it an be used to backpopagation!

Volume of that data is very large! GPU memory usage is proportional to the number of layers

Computing set of gradients!

We don’t need data of the first layers until the training goes to the end, then back propagate! These data take up a lot of space! Thus offload to the DRAM (more capacity)

(forward and backward propagation)

Pipeline Writeback and prefetching

Allows training much larger networks! Incurs little overhead relative to hypothetical large memory GPUs, oeverhead will drop with faster CPU & GPU links.

Optimizing DNNs for Inference

Accessing DRAM is far more expensive than computing. But then to have more accuracy we still need large networks!


  • Reduce numerical precision: FP32->FP16
  • Prune networks: Lots of zeros as weights! Thus remove these unnecessary weights. ALso share weights among other elements in the weight matrix
  • Share weights
  • Compress netowrk
  • Re-architect network

Importance of staying local! LPPDDR 640 Pj/word, SRAM(MB) 50 -> SRAM(KB) 5

 Cost of operations

Accumulate at greater precision than your multiplications!



Lots of the wieghts are zeros! Prune unnecessary weights!

This allows reducing the network dramatically. You can also prune weights that are close to 0. The other wieghts can be used to recover from pruning! Retraining is important

Prune up to 90% of wegiths, but reiterative learning keeps the accuracy pretty high

Pruning can be aggressively done

Factor of 3x performance improvement of pruning


Weight Sharing

Similar weights can be clustered, and using simple bit indexes, point to fin-tuned centroids

Retrain to account for the change as well!

We can reduce weight table significantly!

Huffman encoding to further improve! up to 35x~49x reduction!

Rearchitect netowrks!

SqueezeNet -> Reduce filters! combination of 1×1 filters and 3×3? This allows really shrinking down the networks!

DNN Inference Accelerator Architectures

Eyeriss: ISSCC 2016 & ISCA 2016

Implications of data flow in DNNs

Do we keep filters stationary or input stationary, etc.

Maximize reuse of data within PE

Data compression! Non linear functios applied between activations. Used to be sigmoid. Now it is ReLU! (Negative values are 0)

Lots of negative values are calculated!

Reduce data footprints… Compress them before moving to offschip ememory.

Saves volume of transfered data

If there are multiplications by 0, then just disable the rest of the calculation, reduce the power!

Up to 25x Energy efficient compared to Jetson TK

EIE: Efficient inference Engine (compressed activations)

All weights are compressed and pruned, weight sharing, zero skipping!

Sparse matrix engine, and skip zeros.


Special purpose processors can maximize energy efficiencies


ISCA Day 3


DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric

Low-power High density reconfigurable RAM.

Low power & high capacity FPGA is desirable, use DRAMs instead of SRAM lookup table to provide low power & high capacity

Challenges of building DRAM-based FPGA

LUT is slower thanSRAM, destructive access (data lost after access)

Narrower MAT (DRAM array) from 1K -> 8~16 bits,

Destructive DRAM read is solved by PRE(charge) ACT(ivation) RST(restore)
followed by a wire transfer. These are sequential, but can be overlapped.
RST can be overlapped by wire transfer, and, etc.

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs

Resistive Memory Write:

Slower writes have endurance benefits (knife sharpening example)

Adaptively use write speed choosing.

Make use of the idle memory(?) to slowly writeback

Bank-aware mellow writes: Choose banks with less blocks to writeback in the wb queue of memory controller. FOr those relatively free banks, issue slow writes.

Performance degradation is not noticeable, endurance improved by 87%

Eager Mellow Writes:

Predict that LLC dirty lines will not be dirtied again, and so writeback slowly to ReRAM

Does some epoch counting? to find cachelines…

Add a eager mellow write queue, lowest priority but uses memory bandwidth to writeback. Eagerly.

Eager writeback also improves performance as it reduces write queue congestion!! 

Also employs lifetiem quota, where a lifetime is enforced.

More energy is used to write slower write!

MITTS: Memory Inter-arrival Time Traffic Shaping

CPU initiated memory bandwidth provisioning.

IaaS can charge the users on memory bandwidth usage (and arrival time)

HW mechanism to provision memory BW (Bulky vs. Bursty bandwidth)

Relative memory inter-arrival time, make into a histogram.

Credits per interarrival time in bins. Thus if you use all your credit, you need to wait, and use the next inter-arrival time credit bin

Array of registers that represent credits in each bin. Also, replenisher to fill the bins

Reliability 2

All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory

Command/Address trnasfer failed to speed up with newer DRAM specs.

CA more prone to erros than data! Due to DDR DIMM technology

CA-parity was introduced in DDR4.

Read address error1 (Read wrong codeword! Data & Data ECC are vailid within themselves, but wrong codeword!!)
Extended data ECC -> also address is encoded into the ECC.

Write Address Error!  are even more severe!

Lots of problems possible with command/address


GPUs 2

ActivePointers: Software Address Translation Layer on GPUs

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit


Agile Paging: Exceeding…

Virtualization has two problem:

Nested paging-> slower walks

Shadow paging-> fast walks, slow page table updates

Use shadow paging for most of the time, and then use nested page walk during the walk.


Nested paging is better if the address space is being frequently switched! However, only a fraction of the address space is dynamic.

Only a few entries of teh page table are modified frequently!

Start page walk in shadow mode, optionally switch to nested mode.

One bit of the page table entry signifies that the lower levels require nested paging mode. (Do in nested mode!)

CASH: Supporting IaaS Customers with a Sub-core Configurable Architecture

Sharing architecture-> for each 10 million cycle exectuions, the maxima were different, when cache banks and # of slices (cores) were different.

Learning optimizer, has a feedback mechanism that goes through a Kalman filter(?)