Instrumentation with jtx1inst C Library

Both Jetson TX1 module and carrier board are each equipped with INA3221 current and voltage monitors. The Jetson TX1 has also various temperature sensors placed in total of eight thermal zones of module and SoC. In order to leverage the information that is output by these sensors, we have developed custom API for accessing the two INA3221 on-board and on-module monitors as well as reading the values of eight thermal zones.

Using the sysfs file /sys/class/thermal/thermal_zone*/type, we are able to retrieve information about the following zones: AO-therm, CPU-therm, GPU-therm, PLL-therm, PMIC-Die, Tdiode_tegra, Tboard_tegra and thermal-fan-est.46. Each zone ending in -therm provides information from sensors located on SoC, while those ending in _tegra are located on module. Throughout the experiments we will only concentrate temperature values available from debugfs files located at /sys/kernel/debug/tegra_soctherm/, namely CPU, GPU, memory I/O, and PLL temperatures.

We also use sysfs files, in order to take voltage, current or power measurements from carrier board’s 3 channel INA3221 monitor. We access rails of main carrier board power input VDD_MUX, main carrier board 5V supply VDD_5V_IO_SYS, main carrier board 3.3V supply VDD_3V3_SYS through the I2C address of 0x42. At the 0x43 address, there are rails for carrier board 3.3V sleep supply VDD_3V3_IO, main carrier board 1.8V supply VDD_1V8_IO and 3.3V supply for M.2 Key E connector VDD_M2_IN.

At the time of writing, there are no sysfs files for module’s INA3221 monitor and we access the information located at the I2C address of 0x40, namely main module power input VDD_IN, GPU Power rail VDD_GPU and CPU Power rail VDD_CPU through custom userspace functions.

The sysfs files can be also employed for the system-level control of Jetson TX1 performance. In order to control CPU performance we can manually change pattern of each of four cores utilisation by either enabling or disabling each of the cores or by changing their operating frequency. We can also control GPU system-level performance by either changing it’s operating frequency or changing the operating rate of it’s memory. We encapsulate the aforementioned operations in open-sourced C API and provide on-line documentation .

The current version of the API supplies among the others the following functions:

jtx1_get_temp for reading on-chip and on-module temperature. It takes two arguments, first argument is one of the zones which are indexed with jtx1_tzone enumeration (see table 1), and second is reference to a variable that is going to store the actual value of temperature read from sensor specified in first of the arguments. The temperature value is output in millidegree Celsius.

thermal zone	description
`A0`	on-chip thermal zone
`CPU`	on-chip thermal zone
`GPU`	on-chip thermal zone
`PLL`	on-chip thermal zone
`PMIC`	on-chip thermal zone
`TDIODE`	on-module thermal zone
`TBOARD`	on-module thermal zone
`FAN`	on-chip thermal zone

jtx1_get_ina3221 for reading on-board and on-module INA3221’s values. This function currently uses sysf files to access on-board INA3221 sensor and userspace I2C to access on-module INA3221 sensor and read power, current, and voltage information. It takes three arguments: rail which is indexed by jtx1_rail enumeration, second parameter specifies the type of measurement which can be either VOLTAGE, POWER or CURRENT value from jtx1_rail_type enumeration (see table 2, and third is the actual output’s reference where value is given either in millivolts, milliwatts or milliamps depending on the setting of the second argument.

rail	description
`VDD_IN`	main module power input
`VDD_GPU`	GPU Power rail
`VDD_CPU`	CPU Power rail
`VDD_MUX`	main carrier board power input
`VDD_5V_IO_SYS`	main carrier board 5V supply
`VDD_3V3_SYS`	main carrier board 3.3V supply
`VDD_3V3_IO`	carrier board 3.3V Sleep supply
`VDD_1V8_IO`	main carrier board 1.8V supply
`VDD_M2_IN`	3.3V supply for M.2 Key E connector

jtx1_get_rate and jtx1_set_rate allowing to either set or get value of the frequency of either the external memory controller (EMC), the graphics processing unit (GPU) or one of the four available CPU cores. As the first argument both functions take one of the available choices specified in jtx1_unit enumeration (see table 3).

unit	definition
`EMC_RATE`	external memory controller (EMC)
`GPU_RATE`	graphics processing unit (GPU)
`CPU0_RATE`	first core of central processing unit (CPU)
`CPU1_RATE`	second core of CPU
`CPU2_RATE`	third core of CPU
`CPU3_RATE`	fourth core of CPU

The API is provided in the form of C library while it’s sources are stored in the project’s repository. In order to use them, one follows standard build and installation process described below in listing 1.

mkdir build
cd build
cmake ..
make
sudo make install
sudo ldconfig

The jtx1inst library is released under public domain type license, thus it can be copied and modified to satisfy any specific requirements of a custom project.

Deep Convolutional Neural Network Components and Techniques

There are three main components of neural networks; score function which maps individual features to classes, loss function which quantifies agreement between prediction of class and the ground-truth label and, the final third component is the formulation and solution of a problem through optimisation i.e., minimising loss function with respect to parameters of score function.

Score Function

Given set of features in the form of pixels colours and containing them images $\boldsymbol{x}_{i} \in \mathcal{R}^{D}$, where each i^th image $\boldsymbol{x}_{i}$ with $i \in {1, \ldots, N}$ has also associated class to it $y_{i}$ and $y_{i} \in {1, \ldots, M}$, i.e., we have $N$ labeled images each having $D$ dimensions, and all images are associated with $M$ distinct classes. A score function going from each image’s pixels to a number of class predictions is defined as $\boldsymbol{f}: \mathcal{R}^{D} \to \mathcal{R}^{M}$. The elementary example of such function is a $\boldsymbol{f}(\boldsymbol{x}_{i}; \boldsymbol{W}, \boldsymbol{b}) = \boldsymbol{W} \boldsymbol{x}_{i} + \boldsymbol{b}$, where $\boldsymbol{W} \in \mathcal{R}^{M \times D}$ and $\boldsymbol{b} \in \mathcal{R}^{M}$. Thus, single matrix product, simultaneously evaluates $M$ classifiers, where each classifier is a row in matrix $\boldsymbol{W}$. If we also attach an activation function to a single row multiplication, we effectively get what is called linear discriminant or perceptron $h(\boldsymbol{w};\boldsymbol{x}) = \sigma\left(\boldsymbol{w}^{T}\boldsymbol{x}\right) = \sigma\left( w_{0} + \sum_{i=1}^{m} w_{i}x_{i} \right)$.

Loss function

Once we have a scoring function, we need to have some way of measuring error of it’s output predictions. Assuming $\sigma(z) = z$, the exemplary instance of such loss function would then take a form of $E(\boldsymbol{w}) = \sum_{i=1}^{m} (y_{i} - h(\boldsymbol{w};\boldsymbol{x}_{i}))^{2} = \sum_{i=1}^{m} (y_{i} - \boldsymbol{w}^{T}\boldsymbol{x})^{2}$. The more practical example would be to consider Multiclass Support Vector Machine (SVM). The main idea behind SVM is to make a loss function require that correct class gives higher score than than incorrect one, by a fixed margin $\Delta$. Given a score for k^th class, $\boldsymbol{s}_{k} = \boldsymbol{f}(\boldsymbol{x}_{i}; \boldsymbol{W}, \boldsymbol{b})_{k}$. The multiclass SVM for a single image takes the following form $L_{i} = \sum_{k \neq y_{i}} \max (0, \boldsymbol{s}_{k} - \boldsymbol{s}_{y_{i}} + \Delta)$, and if we consider only linear score functions, this would be $L_{i} = \sum_{k \neq y_{i}} \max (0, \boldsymbol{w}_{k}^{T} \boldsymbol{x}_{i} - \boldsymbol{w}_{y_{i}}^{T} \boldsymbol{x}_{i} + \Delta)$. Thus we will accumulate loss, if the score for correct class is not larger than for the incorrect one, by at least $\Delta$.

Gradient Descent and Backpropagation

If the parameters of $\boldsymbol{W}$ are set so that, most of the predictions for all $\boldsymbol{x}_{i}$ examples agree with ground truths $y_{i}$, the accumulated loss would be very low. Thus our next task is to find the optimal parameters of $\boldsymbol{W}$.

We can define gradient descent by basing on the observation that if multi-variable loss function $E(\boldsymbol{w})$ is differentiable in a neighbourhood of a point $\boldsymbol{w}_{t}$, then the value of a loss function will decrease fastest if we move in a direction of the negative gradient of the $E$ at $\boldsymbol{w}_{t}$, i.e., $-\nabla E\left(\boldsymbol{w}_{t}\right)$. Thus, in order to update weights we can use the following result,

$$
\boldsymbol{w}_{t+1} = \boldsymbol{w}_{t} - \eta\frac{\partial E(\boldsymbol{w})}{\partial \boldsymbol{w}}\bigg|_{\boldsymbol{w}_{t}}
$$

this equation can be also obtained by rearranging derivative of a function, $f(x + h) = f(x) - h \frac{\partial f }{\partial x}$. The step size $\eta$ also called learning rate is one of the most important hyperparameters in training. There are various tuning strategies devised for it, which we will describe in further sections of this report. Note also, that currently there are kinks in our loss function due to max operation. This seemingly makes it non-differentiable, however we can still use existing subgradients.

The loss functions we concern ourselves with, are defined over high-dimensional spaces i.e., the matrix $\boldsymbol{W}$ of weights corresponds to single point in that space, thus we need to realise efficient strategies for computing derivatives in this large space. Therefore, we can choose to take only a subset of the training samples and form the batches for evaluation of a loss function as it would not be possible to compute it using entire dataset. The examples from a dataset are typically correlated thus it is not a significant obstacle having to compute gradient against only a batch. The extreme case of using one single sample for computation is called stochatic gradient descent (SGD). Because we can use parallelised computation it is much less computationally efficient to compute using just one example as in SGD. Altogether, although batch size is hyperparameter it is not common to cross-validate it and the main constraint here is the capacity of GPU’s memory.

Deep Convolutional Neural Networks

Deep convolutional neural networks are formed with three types of layers: convolutional layer, pooling layer and fully-connected layer which has been described in section 1. The main drawbacks of using fully-connected layers on their own is that they do not scale well and are prone to overfitting. Extending neural network architecture with convolutional layers mitigates scalability issue, by extracting features from input images and thus providing basis to the full-connected layers which are trained with unique set of these discerned features. In the following two sections, we detail the additional two layers types.

Convolution

Convolutional layers are formed with set of dimensionally small parametrised filters. During forward pass each filter is being moved across width and height of input image volume. Each such operation generates 2 dimensional activation map, representing response of filter of which parameters are being trained during backward pass, in order to activate on recognition of specific visual feature. The idea then is to have multiple layers of filtering volumes, where each layer learns features of increasing complexity. In our demonstrative setting, the outer layer learns generic features and patterns such as lines and edges. The next layer learns compositions of these i.e., corners and curves; going deeper in hierarchy, we learn simple shapes and so forth. The lowest-level convolutional layers provide activation maps for each instance of an object.

Pooling

In order to reduce the amount of involved parameters and thus also computational cost, we insert pooling layers in-between convolutional layers. It’s task is to reduce spatial size of representations by downsampling the input volumes.