Parallel Computing and I/O Blog

HDF5: Self-describing data in modern storage architectures

2022-08-02T00:00:00+00:00

In today’s post, we will discuss the advantages of self-describing data formats. As a case study, we will examine the popular self-describing data format HDF5. After a description of HDF5’s basic features and its data model, we will follow the development of support for modern storage architectures through history.

To understand the advantages of self-describing data formats, we first need to understand what self-describing data formats are. Taking the definition by Argonne National Laboratory, self-describing data formats have the following two properties:

The data is accessed by name and by class. Instead of reading 20 bytes starting at offset 1337, one would request to read the dataset named XYZ.
Various data attributes that may be necessary for interpretation are available. For example, data types, units, and file contents can be discovered by a user without prior knowledge.

The first point is only possible if the data format is paired with a programming library to access it. Otherwise, users would need prior knowledge to parse the file’s structure. Another advantage is that the file format can be updated without dropping support for older applications because the data model is abstracted from the actual file.

Why do we need self-describing data formats?

As explained above, abstracting the data model from files is beneficial for the maintainability of code. Nevertheless, there is more to self-describing data.

The history of self-describing formats started in the 1980s¹ as the amount of scientific data produced by simulations increased. For global exchangeability of datasets, standards were needed to abstract from architecture-dependent data types and software-dependent storage layouts. Take the following image as an example:

Imagine you receive the file on the left without information on how to interpret it. You will have to invest some time until you realize that it contains an ASCII encoded string. Understanding more complex data (especially some architecture-dependent float data types) would be practically impossible without further hints.

However, the file on the right side contains the data type of the file content and even a comment describing its content. Using this information, it is easy to read the file’s actual content, independent of the complexity of the data. The ability to annotate data with units and comments further supports exchangeability.

Examples of self-describing formats for scientific data are HDF5 and NetCDF. In this post, we will look at HDF5 as it is one of the most popular formats and, meanwhile, the base of NetCDF4.

Basics of HDF5

HDF5 offers a complex and feature-rich data model. Files can be understood as containers that can hold many different types of data.

Take a simulation experiment as an example. During the experiment, a lot of data is created: a model describing the problem, a mesh discretizing the simulated space, initial and boundary conditions, the solver in use, parameters to the solver, the time series of the solution, and some visualizations of the result. For reproducibility, you want to keep track of all the metadata describing how you obtained your results. Those heterogeneous but logically related datasets may be stored in the same HDF5 file. In doing so, the data and metadata are guaranteed not to be separated by accident. Whoever receives a copy of this file will fully understand the details of your simulation and will be able to reproduce your results.

So, how is data modeled in HDF5?

The most common objects in HDF5 files are groups, data types, dataspaces, datasets, and attributes.

Groups act like directories in file systems by mapping names to objects. Nesting groups creates a hierarchical namespace in which objects are identified by a path. The same object can be part of multiple groups, either via hard or soft links. Therefore, care must be taken to not create loops that are not prevented by HDF5. Every file has a root group denoted as /. One could say that HDF5 creates a file system within a file.

In addition to built-in data types, e.g., floats and integers of different flavors, users may define their own complex data types. Apart from creating arrays of a particular data type or packing up different data types into a compound, it is also possible to create atomic data types. The definition of user-defined data types needs to be saved to the file so that it can be reread without prior knowledge. This will happen automatically by using the data type resulting in an unnamed (i.e., transient) data type, but it is also possible to name the data type and save it to a group (i.e., saving it as a committed type). Conversion functions can be registered and saved to the HDF5 file for user-defined atomic types.

In contrast to the POSIX data model, where a file is understood as a stream of bytes, elements contained in a dataset are addressed according to an associated dataspace. The dataspace describes the number of dimensions and each dimension’s size and maximum size. If the size and maximum size are not equal, the dataset can grow in that dimension. This is especially useful for time series, which might grow when more data is collected. Growth may be unbound when the maximum size is set to infinity. Unlike data types, dataspaces are always saved implicitly, i.e., they do not have a name.

Datasets hold the actual data in HDF5 files. Their most important properties are the dataspace describing its shape and the data type of which its elements are. Nevertheless, datasets have lots of settings and properties. For example, the fill value for elements can be modified. Reading from a new dataset, which was not yet written, will return the fill value.

Another vital setting controls if datasets are stored continuously or chunked. If the dataspace contains a dimension that allows for growth, the dataset must be stored in chunks. When more data is added, chunks can be appended without moving existing data. The chunk size is set at the creation of the dataset. While writing or reading, chunked data can be passed to a filter pipeline, transforming the data stream. The most popular (and probably most useful) filter class is compression. However, the user may define their own filter functions.

There is no limit to the size of datasets. Nevertheless, it is neither practical nor possible to write or read several terabytes at once. Fine-grained partial access is realized to solve that problem. For every write and read, a selection of the dataset’s dataspace needs to be passed. This selection can be the original dataspace to access the whole set. Basic selection types are point and hyperslab selections. Point selections are created by supplying a list of coordinates that should be included. Hyperslabs are regular patterns of arbitrary-sized blocks. When dealing with large and dense matrices, hyperslabs can reflect the distribution of matrix parts to different clients. Combining multiple selections using set operators provides an intuitive way to construct complex selections.

Attributes are metadata objects that may be attached to all named objects except other attributes. They are similar to datasets as they are named objects (i.e., are referred to by a path) and have a dataspace and a data type. However, there are some key differences:

They do not support partial I/O, so they need to be written/read at once.
They do not support chunked storage and are therefore of fixed size.
They do not support compression.
They are stored as part of the header of other objects inside the HDF5 file.

Attributes do not only explain the file’s content to users but also enable visualization or search tools to interact with the data based on its meaning. Several domain-specific conventions exist for this purpose. One of the most popular sets of conventions are the Climate and Forecast (CF) Conventions². If a file uses a specific set of conventions, it is automatically compatible with tools using the same conventions.

All relations between the objects explained above are summarized in the following diagram. Please note that this is a simplified version to highlight the core concepts.

Only a short introduction to HDF5 features and concepts can be given in this post. The nitty-gritty details of the concepts explained above, as well as additional features like maps and tables, are left for research to the curious user³.

Programming with HDF5

Now that we know the basics of the HDF5 data model let us look at the practical usage of HDF5.

The library is shipped with C, C++, Fortran, and Java interfaces. Apart from those, there are bindings for most popular programming languages, including Rust, Go, Python, Julia, and Matlab. As most scientific software is written in C or Fortran, examples will be given in C.

The interface is grouped into several modules for a better overview:

H5A - Attributes
H5D - Datasets
H5S - Dataspaces
H5T - Data types
H5F - Files
H5G - Groups
H5P - Property Lists
etc.

The general workflow is similar for all objects in HDF5. First, objects are created or opened, returning a unique handle for that object. Using the handle, objects can be manipulated. When everything is done, the object needs to be closed. The handle will then be invalid.

Let us make a short example of how to write a dataset and some attributes:

First, we need to create a file and a group.

1
2
3

// create a file and a group
hid_t file_id = H5Fcreate("solution.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
hid_t group_id = H5Gcreate(file_id, "important_data", H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

Though the code is mainly self-explanatory, you may have noticed some mysterious H5P_DEFAULT constants in the code. Those are default property lists. Property lists contain many parameters controlling fine operations details and are manipulated using the H5P module. Most functions accept several property lists for different purposes. H5Fcreate, for example, takes a file creation property list and a file access property list. We will later see how the file access property list is used to access files via specific HDF5 plugins. However, in most cases, the standard is sufficient.

After creating the file and a group, we should write some data. Therefore we create a dataspace for a 3x3 matrix which will be used to store important_numbers. Using our new dataspace we create the dataset named my_cool_data in the group created above. The data type for the numbers on the disk will be the native float type of the machine.

Everything is set to actually write the matrix to the file. As explained above, the dataspace is again given for partial I/O. As we pass the original dataspace, the whole matrix will be written.

In addition, H5Dwrite retakes the data type and the dataspace. You probably wonder why the type and space need to be passed twice. The reason is that HDF5 can read a different data type from a different shape from memory than it may be written to disk. Therefore, it would be possible to only take the main diagonal of a matrix in double precision from memory and write it to disk as a vector in single precision.

// create and write to a dataset
float important_numbers[3][3] = {{42, 42, 42},
                                 {42, 42, 42},
                                 {42, 42, 42.42}};
hsize_t dims[2] = {3, 3};
hsize_t* max_dims = dims;

hid_t space_matrix_id = H5Screate_simple(2, dims, max_dims);

hid_t set_id = H5Dcreate(group_id, "my_cool_data", H5T_NATIVE_FLOAT, space_matrix_id,
    H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

H5Dwrite(set_id, H5T_NATIVE_FLOAT, space_matrix_id, space_matrix_id,
    H5P_DEFAULT, &important_numbers);

The following code snippet shows how to add metadata in the form of attributes to the file. Writing attributes is mostly similar to writing datasets. Nevertheless, as no partial I/O is supported for attributes, the write function takes no selection of a dataspace.

It is also shown how to use strings in HDF5. The built-in type H5T_C_S1 is copied, and its size is modified because the standard only takes 1 character. To get a variable-sized string, you can pass H5T_VARIABLE.

// create some attributes
hid_t space_scalar_id = H5Screate(H5S_SCALAR);
float mean = 42.05;
char content_description[] = "Contains a dataset with the answer to everything!";

hid_t string_type = H5Tcopy(H5T_C_S1);
H5Tset_size(string_type, sizeof(content_description));

hid_t attr_group = H5Acreate(group_id, "content", string_type, space_scalar_id,
    H5P_DEFAULT, H5P_DEFAULT);

H5Awrite(attr_group, string_type, content_description);

hid_t attr_set = H5Acreate(set_id, "mean", H5T_NATIVE_FLOAT, space_scalar_id,
    H5P_DEFAULT, H5P_DEFAULT);

H5Awrite(attr_set, H5T_NATIVE_FLOAT, &mean);

At last, every opened object can be closed.

// close all objects
H5Tclose(string_type);
H5Dclose(set_id);
H5Aclose(attr_group);
H5Aclose(attr_set);
H5Sclose(space_scalar_id);
H5Sclose(space_matrix_id);
H5Gclose(group_id);
H5Fclose(file_id);

Putting all those snippets together to a valid C program⁴ and executing it yields the file solution.h5. Using h5dump we can verify that the file indeed contains our data and metadata:

$ h5dump solution.h5
HDF5 "solution.h5" {
GROUP "/" {
   GROUP "important_data" {
      ATTRIBUTE "content" {
         DATATYPE  H5T_STRING {
            STRSIZE 50;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "Contains a dataset with the answer to everything!"
         }
      }
      DATASET "my_cool_data" {
         DATATYPE  H5T_IEEE_F32LE
         DATASPACE  SIMPLE { ( 3, 3 ) / ( 3, 3 ) }
         DATA {
         (0,0): 42, 42, 42,
         (1,0): 42, 42, 42,
         (2,0): 42, 42, 42.42
         }
         ATTRIBUTE "mean" {
            DATATYPE  H5T_IEEE_F32LE
            DATASPACE  SCALAR
            DATA {
            (0): 42.05
            }
         }
      }
   }
}
}

Examples for reads are omitted as they are conceptually similar to writes. More examples of short HDF5 programs can be found here.

Parallelism in HDF5

All we have seen so far is how to write data using a single process on a single client. In the context of HPC, parallel access to HDF5 files is necessary. Otherwise, the I/O performance would be limited by the throughput of a single process on a single client. Multiple approaches exist for parallel access.

The most straightforward way is to write one HDF5 file per process and “stitch” them together using external links in a central file. Even though this approach is older than HDF5, it is further supported with Virtual Datasets (VDS), added in release 1.10. A VDS is an object which behaves similarly to a single dataset. In reality, however, it is a mapping to other datasets that may be part of another file.

Nevertheless, using multiple files contradicts the idea of a single container with all necessary data. For real parallel access to a single file Parallel HDF5 (PHDF5) was added in version 1.0.1 using MPI-IO. Files are accessed with PHDF5 by passing a modified file access property list containing a reference to an MPI communicator at open or create time:

hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist_id, comm, info);
H5Fopen("my_file.h5", H5F_ACC_RDWR, plist_id);

Reads and writes are performed using the regular functions and appropriate dataspace selections. Care must be taken on which operations are collective (i.e., all processes must participate) or independent. All modifications of the file’s structural metadata, such as creating or linking objects, are always collective. Reads and writes can either be collective or independent, which is controlled by the data transfer property list. In most cases, collective I/O leads to higher throughput.

Despite its easy usage, it is hard to get good performance with PHDF5. I/O on a parallel distributed file system alone is a complex task where throughput is influenced by many factors. The introduction of additional I/O layers further complicates I/O tuning⁵.

Virtual File Layer

For PHDF5, MPI-IO was added as an additional storage interface next to POSIX. This gave rise to the idea of a plugin system for different storage backends. Consequently, the structure of the HDF5 library was changed, and the Virtual File Layer (VFL) was introduced in version 1.4. Instead of using POSIX or MPI-IO directly, all I/O calls are abstracted and passed to a Virtual File Driver (VFD). The VFD, in turn, will map the linear address space of an HDF5 file to the address space of a storage backend. VFDs are used by manipulating the file access property list and setting the respective driver, which must be registered beforehand. For details on registering a VFD with the HDF5 library, please refer to HDF5’s documentation. HDF5 provides several pre-defined VFDs. Some interesting examples are:

H5FD_CORE: perform I/O to RAM
H5FD_SEC2: default VFD using POSIX
H5FD_MPIIO: parallel access via MPI-IO
HDF5_HDFS: direct access to files in Hadoop Distributed File System
H5FD_ROS3: direct read-only access to files stored in Amazon S3
H5FD_MULTI: call different underlying VFDs depending on the address range accessed

In addition, users can implement their own VFD to support their specific storage needs. Currently, work is done to enable the dynamic loading of plugins at runtime.

Limits of VFDs

VFDs only abstract I/O calls (i.e., only handle byte streams) and are therefore unaware of the HDF5 data model. Though decisions can be made based on address ranges (e.g., as in H5FD_MULTI), the file’s structure can not be changed to leverage features of modern storage technologies. In practice, this approach excludes storage types that could (more or less) directly map the data model like, for example, DAOS.

New architecture and Virtual Object Layer

To address the limitation of the VFD, the Virtual Object Layer (VOL) was introduced in version 1.12. It provides another interface for plugins to interact with HDF5. Unlike the VFL, the VOL operates on the data model abstraction level and defines an interface for callbacks for the public HDF5 interface functions.

For the VOL’s implementation, the library was yet again restructured. The default VOL plugin implements the HDF5 file format specification and uses the VFL to interact with storage backends. The following picture summarizes the layers used in the library, in addition to some example VOL plugins not included with HDF5.

There are multiple ways to use VOL plugins. The easiest way is to set environment variables to dynamically load a plugin at the program start. However, just like VFDs, they can be used via file access control lists.

Interesting new possibilities are enabled by VOL. For example, plugins can be stacked to a VOL chain. I/O behavior can be traced easily by using these passthrough connectors. Another use case is to transform data while passing it through the chain.

Nevertheless, the most interesting use case of VOL is to map HDF5 files to modern storage backends in a more intuitive way. Metadata, for example, might be separated and stored in a key-value store or database, while datasets might be stored in an object store. This is the case for the two VOL plugins currently under development in the JULEA storage framework. The goal is to make use of the enhanced query capabilities of those backends to speed up the analysis of data. Another example is given by the DAOS VOL plugin, where the data model is mapped to the modern object store DAOS, which is designed for use with persistent RAM and NVMe SSDs.

As of the current version 1.13, the VOL interface was changed based on the gained experiences. It will remain unstable until version 1.14, which is yet to be released.

Summary and conclusion

Self-describing data formats are essential standards for exchanging scientific data as they abstract technical details from the user and enable the annotation of data with important metadata such as units. HDF5 offers a feature-rich data model based on groups, datasets, data types, and dataspaces. We have seen how HDF5 changed to fulfill growing requirements on storage systems. Based on the idea of exchangeable backends, the VFL was created. Today the actual HDF5 file format is only one implementation of the HDF5 data model among many other variants due to the introduction of the Virtual Object Layer. At this point, the classical files and file systems are challenged, and new ways to model and access scientific data have emerged.

Of course, only the basics of HDF5 could be covered in this post, and many details need to be left out. Because the VFL and VOL APIs are currently under change, only their high-level concepts are featured. If you would like to gain further insight and hands-on experience with VOL plugins, the webinars offered by the HDF Group might be something for you.

Sources

All information is taken from the HDF5 documentation and the HDF5 changelog if not stated otherwise.

The graphics were made using draw.io and the Gnome desktop icons which are licensed under the GPLv2.

Development of CDF started 1985. ↩︎
Technically speaking, those conventions apply to the NetCDF self-describing data format. However, the naming of attributes can be transferred to HDF5 as done in the Recommendations by NASA for Earth Science. ↩︎
A good starting point is the HDF5 documentation. ↩︎
The full code of the HDF5 example can be found here. ↩︎
Further information on I/O tuning can be found in the HDF5 documentation. ↩︎

Rust for Python developers: Using Rust to optimize your Python code

2022-07-20T00:00:00+00:00

This post covers how to use Rust and PyO3 to optimize existing Python projects. It will also give you a basic introduction to Rust on the way.

Example program

The following Python program creates a simple visualisation of the Mandelbrot set using matplotlib. It takes about 20s to finish on my machine.

import matplotlib.pyplot as plt
import math
import numpy as np
from time import time

def simple_stability(real:float, imag:float, max_iterations:int=100) -> int:
    zr = 0
    zi = 0
    for i in range(max_iterations):
        new_zr = zr**2 - zi**2 + real
        zi = 2 * zr * zi + imag
        zr = new_zr
        if math.sqrt(zr**2 + zi**2) > 2:
            return i
    return max_iterations


def main():
    start = time()
    values = []
    for y in np.linspace(-2, 2, 1000):
        line = []
        for x in np.linspace(-2, 2, 1000):
            line.append(simple_stability(x, y))
        values.append(line)
    values = np.array(values)
    print(time() - start)
    plt.imshow(values)
    plt.show()

if __name__ == '__main__':
    main()

We can see that most of the calculation time is spent in simple_stability, which makes it a performance-critical function. This means that any speedup we achieve for simple_stability will also have a big impact on the overall performance of our program. With that in mind, let’s try translating this function into Rust.

First steps in Rust

Rust is a compiled language, unlike Python, which is interpreted. This means that we can’t just start writing .rs files and run them from the console (or IDE). We have to compile them first.

Rust has an excellent tool called Cargo that takes care of all our compilation and dependency management needs. To create a new crate, that is, a new Rust project using Cargo, run cargo new --lib mandelbrot_module in the directory of your choice. (Install Rust and Cargo if you have not done so already.) The contents of your new directory should look something like this:

mandelbrot_module/
├─ src/
│  ├─ lib.rs
├─ .gitignore
├─ Cargo.toml

This is the standard structure for all Rust crates. src is where all our source code will be stored and Cargo requires a specific name for our main file. If we were trying to write an executable, our main file would be src/main.rs and the execution of our compiled program would start in the main function of that file. Since we want to write a library/module, our main file is going to be lib.rs and everything we might want to use from our library after compilation needs to be available from this file.

Since Cargo already wrote some test code into our lib.rs, let’s run it to see that everything works. To do this run cargo test anywhere within the main directory of the crate.

Test code:

#[cfg(test)]
mod tests {
    #[test]
    fn it_works() {
        assert_eq!(2 + 2, 4);
    }
}

Expected console output:

running 1 test
test tests::it_works ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

   Doc-tests mandelbrot_module

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

You should now have a target directory in your crate. This directory contains all the files that get created during compilation, but we don’t actually need it for this project.

We do however need to add PyO3 to our crate’s dependencies before we can start using it, so let’s do that now. Adding dependencies to a crate is normally pretty simple. You just have to write the dependency name and version number under [dependencies] in your Cargo.toml file like this:

[dependencies]
threadpool = "1.8.1"

But PyO3 needs some extra configuration which I won’t explain in this post. Just paste the following into your Cargo.toml file:

[package]
name = "mandelbrot_module"
version = "0.1.0"
edition = "2018"

[lib]
name = "mandelbrot_module"
crate-type = ["cdylib"]

[dependencies.pyo3]
version = "0.15.1"
features = ["extension-module"]

With that done, let’s write some actual Rust code in lib.rs.

Rust functions

We’re going to start by just writing the function how you would in a pure Rust program.

fn simple_stability(real:f64, imag:f64, max_iterations:usize) -> usize {
    let mut zr = 0f64;
    let mut zi = 0f64;
    for i in 0..max_iterations {
        let new_zr = zr.powi(2) - zi.powi(2) + real;
        zi = 2.0 * zr * zi + imag;
        zr = new_zr;
        if (zr.powi(2) + zi.powi(2)).sqrt() > 2.0 {
            return i;
        }
    }
    return max_iterations;
}

Let’s first look at the function declaration which looks pretty similar to its Python counterpart. Rust uses the fn keyword instead of def to declare functions. It also uses different names for its types. f64 is a 64-bit float, which is equivalent to Python floats and C’s double type. 32-bit floats are f32. Integers in Rust use a similar naming scheme. The u in usize tells us that we’re dealing with an unsigned integer. The size means that the size of our integer is dependent on our operating system (OS), so this type would be equivalent to u64 on a 64-bit OS and to u32 on a 32-bit OS. If we wanted to use more than just positive integers we could use isize and the same naming scheme applies to i-types. There are also integer types that fit into a single byte with i8 and u8. Choosing a smaller type can make a huge difference in your program’s memory usage and even performance, so Rust takes typing very seriously. While type annotations in function declarations are only recommended and not mandatory in Python, they are mandatory in Rust. In fact, the type of every variable has to be known at compile time or Rust simply won’t compile your code. This may sound like a lot of type annotations, but the compiler does a great job at inferring a variable’s type most of the time. Note also that Rust functions do not support optional arguments, so we always have to specify max_iterations with our new function.

Let’s take a look at the declaration of zr and zi now. They’re both f64 as can be inferred from the right-hand side of their declaration. The let keyword is used to declare a new variable and the mut keyword specifies that this variable is mutable. Variables declared without mut are immutable. This might seem weird at first, but it actually makes the code more readable by telling you which values will change throughout this function’s runtime.

The rest of this code looks remarkably similar to its Python equivalent with the exeption that Rust has no power operator and that sqrt is a method of float types instead of being an import from the math module.

Using PyO3

We just need to make this function accessible in a Python module now. First of all, let’s import pyo3 using use pyo3::prelude::*;. Importing external crates in Rust is done via the use keyword. The :: are used to specify namespaces in Rust. The namespace prelude is a Rust convention and contains most functionality you’d need from this crate. We import everything from prelude the same way we would in Python via the * operator.

Every function we want to include in our final Python module needs to be annotated with #[pyfunction]. This is a macro that will make some changes to our code during compilation to make it compatible with Python.

#[pyfunction]
fn simple_stability(real:f64, imag:f64, max_iterations:usize) -> usize {
    // ...
}

It’s not always this simple, though, because some Rust types can’t be converted to and from Python types. A list of all Rust types that implement IntoPy and are therefore valid argument and return types in a PyO3 pyfunctions can be found here.

The last thing we need before compilation is a piece of boilerplate code to assemble our module. Copy and paste the following at the end of your lib.rs file.

#[pymodule]
fn mandelbrot_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(simple_stability, m)?)?;

    Ok(())
}

The #[pymodule] macro stitches our Python module together from the function we attach it to. It’s important that your module has the same name as this function or Python won’t be able to find it. The code for adding a function is a bit advanced and you don’t really need to understand what’s going on here. Just add another line of m.add_function(...) and replace the simple_stability with the name of your function if you want to add another function to this module.

We can now finally build our module and try using it in our Python program. There are multiple ways of going about this, but we are going to use maturin in this post. (Have a look at https://pyo3.rs/latest/building_and_distribution.html#manual-builds if maturin doesn’t suit your needs.)

To use maturin, we first need to create a virtual environment in our mandelbrot_module directory and then install and run maturin in said virtual environment.

$ py -m venv .env
$ ./.env/scripts/activate
$ pip install maturin
$ maturin develop

You should now see some build output in your console while maturin compiles your module. And it should finish with:

🛠  Installed mandelbrot_module-0.1.0

Let’s confirm that our module actually works. Copy the previous Python program into the mandelbrot_module directory and modify it so that it uses our new Rust module.

import matplotlib.pyplot as plt
import numpy as np
from time import time
from mandelbrot_module import simple_stability

def main():
    values = []
    for y in np.linspace(-2, 2, 1000):
        line = []
        for x in np.linspace(-2, 2, 1000):
            line.append(simple_stability(x, y, 100))
        values.append(line)
    values = np.array(values)
    plt.imshow(values)
    plt.show()

if __name__ == '__main__':
    start = time()
    main()
    print(time() - start)

This new version of our program takes about 4.6s on my machine, which means we achieved a speedup of more than 400%! This example is very simple and was specifically chosen to be translated into Rust so our speedup is close to a best case scenario, but it shows how powerful translating performance-critical tasks into Rust can be.

Writing Python classes in Rust

Your real world code will most likely not be this simple. You might for instance have many different functions that rely on one or two classes for some shared functionality. In this case you could translate your class to improve your code’s performance.

We are going to implement a complex number class because our simple_stability function has been doing complex calculations all along. zr, zi, real and imag are the real and imaginary components of two complex numbers z and c. And our function is iterating over the formula $$z(n+1) = z(n)^2 + c$$ with $$z(0) = 0 + 0i$$ .

Let’s start with structs then, Rust’s rough equivalent to classes. Here is the declaration for a complex number struct:

struct Complex {
    real: f64,
    imag: f64
}

Simply use the struct keyword followed by your struct’s name and a declaration of the data types that will be stored in your struct. We can now create objects of the type Complex with similar syntax.

fn _example1() {
    let _origin = Complex {
        real: 0.0,
        imag: 0.0
    };
}

Next up, we’re going to create an impl block to implement the methods we need for our calculations.

impl Complex {
    fn new(real: f64, imag: f64) -> Self {
        return Complex {
            real: real,
            imag: imag
        };
    }

    fn add(self, other: Self) -> Self {
        return Self::new(self.real + other.real, self.imag + other.imag);
    }

    fn sub(self, other: Self) -> Self {
        return Self::new(self.real - other.real, self.imag - other.imag);
    }

    fn mul(self, other: Self) -> Self {
        let new_real = self.real * other.real - self.imag * other.imag;
        let new_imag = self.real * other.imag + self.imag * other.real;
        return Self::new(new_real, new_imag);
    }

    fn dist_from_origin(self) -> f64 {
        return (self.real.powi(2) + self.imag.powi(2)).sqrt()
    }
}

Notice that our methods use an uppercase Self and a lowercase self. Lowercase self refers to the object that this method is called on just like in Python. Uppercase Self is shorthand for the type that we’re implementing this method for. So the add method takes an object of type Complex as an argument and also returns an object of type Complex.

Let’s try using these methods in some actual Rust code.

fn complex_test() {
    let x = Complex::new(1.0, 2.0);
    let y = Complex::new(-1.0, -2.0).add(x);
    let z = y.mul(x);
}

If we try to compile this code we will get this error:

error[E0382]: use of moved value: `x`
  --> src/lib.rs:82:23
   |
80 |         let x = Complex::new(1.0, 2.0);
   |             - move occurs because `x` has type `Complex`, which does not implement the `Copy` trait
81 |         let y = Complex::new(-1.0, -2.0).add(x);
   |                                              - value moved here
82 |         let z = y.mul(x);
   |                       ^ value used here after move

This error is a result of Rust’s ownership rules I mentioned earlier. So what is ownership?

Ownership in Rust

The basis of ownership is that every value has exactly one variable that owns it, and it gets automatically deallocated as soon as its owner variable leaves the current scope. This enables Rust to have automatic deallocation without a garbage collector.

The following example code shows when values get dropped (that is, deallocated) in Rust and how ownership gets moved between two values. The DropMe struct used in this example will print a message to the console as soon as its value gets dropped.

fn drop_example() {
    let a = DropMe{val: 'a'};
    let b = DropMe{val: 'b'};
    {
        let other_b = b; // takes ownership
        // other_b leaves scope here
    }
    println!("b has been dropped");
    println!("a drops after this");
    // a leaves scope here
}

This will result in the following output:

dropping b
b has been dropped
a drops after this
dropping a

We can see that the ownership of the value we initially stored in b gets moved to other_b, which then leaves the inner scope delimited by {}. This results in the value getting dropped and a message being written to the console. After this, we print two more messages and then reach the end of the function. At this point a leaves the current scope and its value also gets dropped.

It’s important to note that b becomes invalid after losing ownership of its value. This is the reason for the error we just encountered. We moved the value of x into the add function. After this, x becomes invalid so we can’t use it again in the next line.

The reason we haven’t encountered this problem sooner is that numerical values like floats and integers are so small that they can be copied just as fast as references to them can be created so they just get copied and no ownership transfer takes place. (This is the copy trait the error message mentions.)

Giving up ownership to functions is obviously a huge problem if we want to work with any kind of function, because we want to reuse our values most of the time. We could simply copy our values before moving them into a function, but this gets expensive fast with bigger structs.

Rust has another system called borrowing instead. Borrowing a value lets us create a reference to said value without taking ownership of it. The actual owner of our value gets disabled until all references to it get dropped.

There are two types of references in Rust. Immutable & references that give read-only access to the value they reference. Mutable &mut references that let you modify their referenced value.

You can either have arbitrarily many immutable references or only one mutable reference to a single value at any given point in time. This is to ensure that there is never more than one variable in your program that can modify a given value, which prevents a lot of tricky errors and data races.

Example:
fn foo(x: DropMe) {
    println!("foo {}", x.val);
}
fn foo_immut(x: &DropMe) {
    println!("foo_immut {}", x.val);
}
fn foo_mut(x: &mut DropMe) {
    println!("foo_mut {}", x.val);
    x.val = 'm';
}
fn borrowing_example() {
    let a = DropMe{val: 'a'};
    let b = DropMe{val: 'b'};
    let mut c = DropMe{val: 'c'};
    foo(a);
    foo_immut(&b);
    foo_mut(&mut c);
    println!("end.");
}

This outputs:

foo a
dropping a
foo_immut b
foo_mut c
end.
dropping m
dropping b

You can see that a gets dropped as soon as foo finishes, because it takes ownership of its arguments. The other two values only get dropped at the end of the main example function because their functions did not take ownership. (You can also see that Rust drops values in the opposite order they were created in to not break any possible dependencies between them.)

We can just change all the arguments of our class methods to immutable references, because we don’t need to modify them. This step was also necessary to make our methods compatible with PyO3, because Rust can’t take ownership of Python values (because ownership doesn’t exist in Python). So we had to either copy our method arguments or take references to them instead.

After adding the references and the necessary PyO3 macros, our code looks like this:

#[pyclass]
struct Complex {
    real: f64,
    imag: f64
}

#[pymethods]
impl Complex {
    #[new]
    fn new(real: f64, imag: f64) -> Self {
        return Complex {
            real: real,
            imag: imag
        };
    }

    fn add(&self, other: &Self) -> Self {
        return Self::new(self.real + other.real, self.imag + other.imag);
    }

    fn sub(&self, other: &Self) -> Self {
        return Self::new(self.real - other.real, self.imag - other.imag);
    }

    fn mul(&self, other: &Self) -> Self {
        let new_real = self.real * other.real - self.imag * other.imag;
        let new_imag = self.real * other.imag + self.imag * other.real;
        return Self::new(new_real, new_imag);
    }

    fn dist_from_origin(&self) -> f64 {
        return (self.real.powi(2) + self.imag.powi(2)).sqrt()
    }
}

#[pyclass] and #[pymethods] perform the usual PyO3 magic of making our code compatible with Python. #[new] designates our new method as our class constructor, meaning it will be called if we try to create a new Complex object from Python.

We then add our new class to our Python module:

#[pymodule]
fn mandelbrot_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(simple_stability, m)?)?;
    m.add_class::<Complex>()?;

    Ok(())
}

Once again you don’t need to understand what’s going on here. Just copy and paste the m.add_class(...) line and replace Complex with the name you gave your struct.

Finally we run maturin develop once again and integrate our new class into our example Python program.

import matplotlib.pyplot as plt
import numpy as np
from time import time
from mandelbrot_module import Complex

def complex_stability(real:float, imag:float, max_iterations:int=100) -> int:
    c = Complex(real, imag)
    z = Complex(0, 0)
    for i in range(max_iterations):
        z = z.mul(z).add(c)
        if z.dist_from_origin() > 2:
            return i
    return max_iterations

def main():
    start = time()
    values = []
    for y in np.linspace(-2, 2, 1000):
        line = []
        for x in np.linspace(-2, 2, 1000):
            line.append(complex_stability(x, y, 100))
        values.append(line)
    values = np.array(values)
    print(time() - start)
    plt.imshow(values)
    plt.show()

if __name__ == '__main__':
    main()

This iteration of our program is actually much slower at about 2 minutes. This is probably because we spend a lot of time on switching between Rust and Python and creating new Complex objects, while the original program just ran some floating point operations instead, which have presumably already been heavily optimised using C.

I can say from experience though that translating bigger classes with more involved methods can significantly speed up your programs.

This concludes our Rust tutorial for Python programmers. I hope that this post has sparked your interest for Rust and has given you ideas on how to use it in your existing projects. If you want to learn more about Rust, check out the Rust book. If you want to learn more about PyO3, check out its official user guide. The code for this post and the project it was based on can be found on GitHub:

Clang/LLVM overview

2022-07-04T00:00:00+00:00

Compilers are complex programs with complex requirements. The two most widespread C compilers, GCC and Clang/LLVM, are 10–15 million lines of code behemoths, designed to produce optimal machine code for whatever arbitrary target the user desires. In this blog post I’m going to give an overview of how the Clang/LLVM C compiler works; from processing the source code to writing native binaries.

1. Introduction

First of all, what is a compiler?

A computer (or CPU, rather) executes binary machine code. The human-readable form of machine code is called assembly code. However, assembly code is very low-level and very unnatural to write for humans. So we write our programs in higher-level programming languages like C, C++, Rust, etc. instead and let a compiler translate that source code into machine code.¹

Compiler terminology (Ray Toal, Intro to Compilers (edited) (License: unknown))

Example compilation (C source code to x86-64 assembly)

There are many more compilers than GCC and Clang, for a wide variety of programming languages.

One can distinguish between two kinds of compilers:

AOT (ahead-of-time) compilers. These are compilers where all of the source code is compiled to target code before the program is run. Basically, every C/C++ compiler is an AOT compiler for example.
JIT (just-in-time) compilers. JIT compilers compile code even while the program is running. Examples: Chromium’s JavaScript engine, Dart, LuaJIT

At first, this sounds like JIT is a lot slower than AOT compilation but that’s not necessarily true. JIT compilers have more information about the machine/CPU they’re targetting and can take that into account when compiling. AOT compilers, on the other hand, mostly produce code for the “lowest common denominator”², if you don’t explicitly tell it for what target it should tune the code. So even if you have the very latest Intel 12th generation CPU with the very latest feature set, your compiler will not make use of those features when targetting “just any x86-64 machine”.

Additionally, JIT compilers have more information about the runtime behaviour of the program. For example, if the JIT compilers sees “oh, this function is only called with an integer argument greater than 128”, it can use that to optimize the function. An AOT compiler can deduce some information too, but in most cases that information is just very hard to find out without running the program.³

2. Overview of the Clang/LLVM pipeline

Clang/LLVM pipeline

As you can see in the above picture, there’s roughly 3 phases of compilation:

Frontend
- In this step, all the source code is processed and an intermediate representation (IR) is generated.
- In our case, Clang is the frontend and LLVM is the middle- and backend.
- There are many LLVM frontends for many programming languages, Clang is just the one for C/C++.
Middle-end
- The middle-end is one of the great features of LLVM. In this phase, the IR is optimized. Overall, most of the optimizations are done here. The cool thing is that LLVM IR is completely universal; all frontends produce IR and all backends consume IR. That way, if you write an optimization pass for the middle end, it’ll work for many languages and many target CPUs.
Backend
- The backend will now consume the optimized IR and produce machine code, which (after linking) can be executed on the target machine. The backend will also apply some machine specific optimization passes.

I’ll now go a bit more into detail about how these 3 parts work.

3. Frontend

3.1. Lexer

The first thing the frontend does is read the source code character by character and produce so-called tokens.

For example, for the following C source code:

int main(int argc, char **argv) {
    printf("hello, world!\n");
    return 0;
}

The list of tokens could look like this:

int,
identifier(“main”),
lparen,
int,
identifier(“argc”),
comma,
// and so on ...

Additionally, each token will also have its source location (that is, its file, line and column) associated with it.

In this phase, you basically get rid of all whitespace, comments, and transform the source code into something that can more easily be processed to produce the the abstract syntax tree (AST) and, following that, the IR.

The lexer only does very simple recognition of the basic syntactical building blocks of the source code. For example, if you forget the terminating " of a string, the lexer will complain; but not if you use a wrong type or an undefined symbol.

3.2. Parser/semantic analyzer

Using the list of tokens from the previous step, we’re now constructing a tree. And not just any tree, we’re constructing a so-called abstract syntax tree (AST).

Basically, we’re now recognizing the language structures of the programming language, like definitions, declarations, control flow statements, expressions, type casts, etc.

Clang AST dump for the hello world C program

The above image shows the AST for the example C program in 3.1. <invalid sloc> means invalid source location. The nodes with <invalid sloc> are “imaginary”, they don’t have any corresponding source location and were added by Clang after the fact.

In the AST, we can clearly recognize the structure of the hello world program above by looking at the nodes with valid source locations:

The function declaration int main(int, char **) and the names of the two arguments, argc & argv (FunctionDecl and ParmVarDecl)
The compound statement { ... } afterwards (CompoundStmt)
The call to printf with some implicit casts (CallExpr)
Finally, the return 0 (ReturnStmt)

Ambiguity

The parser has some predefined rules, like function-declaration = type identifier ( parameter-list ) .... It tries to match those rules against the tokens and that way Clang will build the AST. However, in practice it’s not that easy. For example, in C there are two ways you can parse a * b.

Either a and b are variables and that expression is a multiplication,
or a is a type name, and a * b is the declaration of a variable b with type a* (pointer to a)

So to be able to parse this correctly, you need to know beforehand if a is a type or a variable. In C++ it’s even more complicated. There’s a saying that “parsing C is hard and parsing C++ is impossible.” The C++ grammar is ambiguous, C++ templates are turing complete and parsing it is literally undecidable. That’s one of the reasons why Clang has hand-written parsers for both C and C++.

The parser works very closely together with the semantic analyzer (sema). The sema will do things like inferring types, adding type casts, doing validity checks or throwing warnings. For example, warnings about unused code or infinite self-recursion will be thrown by the sema.

3.3. IR generator

The IR generator will now (surprise!) generate rough, unoptimized IR using the AST from the previous step.

LLVM IR is a full-fledged language with well-defined semantics. The IR below was generated for the hello world program from Section 3.1. However, I’d say its workings are a bit out of scope for this blog post, so I’m not going to go into detail here.

@.str = private unnamed_addr constant [15 x i8] c"hello, world!\0A\00", align 1

define dso_local i32 @main(i32 %0, i8** %1) #0 !dbg !8 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  %5 = alloca i8**, align 8
  store i32 0, i32* %3, align 4
  store i32 %0, i32* %4, align 4
  call void @llvm.dbg.declare(metadata i32* %4, metadata !17, metadata !DIExpression()), !dbg !18
  store i8** %1, i8*** %5, align 8
  call void @llvm.dbg.declare(metadata i8*** %5, metadata !19, metadata !DIExpression()), !dbg !20
  %6 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([15 x i8], [15 x i8]* @.str, i64 0, i64 0)), !dbg !21
  ret i32 0, !dbg !22
}

declare void @llvm.dbg.declare(metadata, metadata, metadata) #1

declare dso_local i32 @printf(i8*, ...) #2

attributes #0 = { noinline nounwind optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { nofree nosync nounwind readnone speculatable willreturn }
attributes #2 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

4. Middle-end

Now that we have the IR, we can optimize it. What does “optimizing” even mean, though? When we optimize a program, we want to make it faster or smaller.

Making it run faster is the usual goal that’s achievable for example by combining operations, reducing function calls, resolving recursions, etc. Making the finished program binaries smaller is often done in embedded environments, where you don’t have too much space available.

A single “step” of optimization is called an optimization pass. Usually when employing optimizations, you’ll bundle a bunch of these together in a chain. The order is important too: Some optimization passes rely on the fact that some other optimization pass has run before them (that maybe annotated the IR with some analysis information), others produce better results when some other optimization pass has run before them. But that’s mostly opaque to the user, Clang will do the right thing for you when you just use the -O... commandline argument.

There are three kinds of optimization passes:

Analysis passes
- Those analyze the IR and try to deduce some information that’s useful (or required) for other optimization passes.
- For example, one analysis pass will deduce information about the call graph, another will find memory dependencies, etc.
Transform passes
- Transform passes are what actually optimizes the IR. They use the information from the analysis passes to transform the IR in such a way that it’s either faster or smaller afterwards.
- There -inline transform pass will do function inlining (more on that later), -adce will eliminate dead code, -instcombine combines redundant instructions, etc.
Utility passes
- These are mostly used for debugging purposes. These aren’t applied by Clang by default, only if you explicitly tell it to.
- For example, the -view-cfg pass will use Graphviz to visualize the control flow graph.

Pie chart of the optimization passes of LLVM (Based on https://llvm.org/docs/Passes.html (fetched on 2021-11-05))

To use these optimizations, you can just use the -O... commandline argument for Clang.

For example:

# no optimizations at all (default)
$ clang -O0 main.c
# optimize for speed
$ clang -O1 main.c
# optimize even more for speed
$ clang -O2 main.c
# optimize even more more for speed
$ clang -O3 main.c
# optimize for fastest speed possible
$ clang -Ofast main.c
# optimize for size (basically -O2 with some optimizations that reduce size)
$ clang -Os main.c
# optimize even more for size
$ clang -Oz main.c

As a developer, usually you want your programs to run fast. So why don’t we always use -Ofast?

The first reason is that it breaks strict standards compliance. -Ofast is basically -O3 with -ffast-math. Fast-math will do a lot of things that are not compatible with the C or C++ standards. For example, it’ll replace some operation that divides by a floating-point constant a / 0.123 by a multiplication with the reciprocal (a * (1 / 0.123)) since multiplication is usually faster, some floating point errors won’t be reported anymore, some things are less accurate, etc.
For large projects, it’ll increase compile time. In many cases, it might still be worth it, but others may not want to do that.
It might increase program binary size.

Okay, so the problem with -Ofast is mainly non-compliant math. So why don’t we just always use -O3? Why is there a -O2 then? It turns out that’s a pretty good question. -O3 is basically the same as -O2. In the LLVM I tested, -O3 enables two more optimization passes than -O2 and for one of them it says in the code “FIXME: It isn’t at all clear why this should be limited to -O3”.

Okay, now that we have optimized the IR, we can go on to the next step:

5. Backend

5.1. Instruction Selection

This is the first phase of the backend. Everything (well, most) target-specific stuff happens here. Now, we want to transform the optimized IR into something that’ll run on the target CPU, and as the first step we’re going to select the instructions for that.

LLVM has multiple instruction selectors:

SelectionDAG (produces the best results, best documented)
FastIsel (produces poor results, but runs quickly)
GlobalIsel (WIP, designed to combine the compilation speed of FastIsel with the quality of SelectionDAG)

Since SelectionDAG is best documented and produces the best results, I’m going to use that as an example. Actually, SelectionDAG is not only the name of this instruction selector but also the output of it. That is, the output of this instruction selector is called SelectionDAG (selection directed, acyclic graph) too. In this graph, each node is an instruction and the edges are data or control dependencies.⁴

A finished DAG (for the program from 3.1) looks like this:

Final SelectionDAG

But how is that graph built? There are multiple steps involved here:

First of all, we’re using static mappings from IR instruction to SelectionDAG node, and the control and dataflow dependencies we can infer from the IR to build an initial, naive SelectionDAG.⁵
Now we’re applying some basic optimizations to it.
The (still naive) SelectionDAG we now have might not even be runnable on the target CPU. Maybe it contains operations that aren’t supported or some type doesn’t work with the operation used, etc. In other words, it might be an illegal SelectionDAG. So now, as the first step of making it a legal graph, we’re going the legalize the types. There are two kinds of modifications we can make to the types here:
- Type promotion (converting a small type to a larger one)
- Type expansion (splitting up a larger type into multiple smaller ones)
For example: If the target doesn’t support 16-bit integers, we’re just going to promote it to a 32-bit integer instead. Likewise, if it doesn’t support 64-bit integers, we’re just going to expand it to two 32-bit ints instead.
Now, we optimize that again, mostly to get rid of redundant operations introduced by type promotion/expansion.
After that, we’re going to legalize the operations. Targets sometimes have weird, arbitrary constraints for the types that can be used for some operations. (x86 does not support byte-conditional moves, PowerPC does not support sign-extending loads from a 16-bit memory location.) So we’ll apply type promotion and type expansion or some custom, target-specific modifications here to make the SelectionDAG legal.
Optimize again.
Actually select the instructions. This phase is a bit more complicated, but in a nutshell, LLVM will take the instructions in the SelectionDAG we have (which are target-independent instructions that just happen to be executable on the target machine) and translate them into target-specific instructions, while also using pattern-matching to combine instructions where possible.

5.2. Scheduling and Formation

Now we have a SelectionDAG of machine instructions. However, CPUs don’t run DAGs. So we need to linearize the SelectionDAG, that is, form it into a list. There are many ways to do that, LLVM will just use some heuristics so we, for instance, always have enough registers available, you can also take into account instruction latencies, etc. You can print the linearized SelectionDAG for some LLVM IR using llc -print-machineinstrs ....

After this, there’s a machine code (actually MIR) based optimization phase.

5.3. Register Allocation

Up until this point, even if you might not have realized, we acted like the target machine has an infinite amount of one-time assignable registers.

In other words, the IR and SelectionDAG was in the so-called SSA (single static assignment) form. The SSA form simplifies many analyses of the control flow graph of the IR. At this point, we want to select the actual target registers we will use for the previously virtual SSA registers. Most targets only have 16, maybe 32 registers and many of those are reserved for special purposes. It’s possible we don’t have enough physical registers to accomodate all the virtual registers. That means we have to put some of them into main memory instead, which is called spilling.

Now that we’ve selected the physical registers, we’re adding some prologue and epilogue instructions to the function, that is, push some registers on the stack and pop them again later.

After that comes a machine code based optimization phase.

5.4. Code Emission

Now we can finally emit the optimized machine code, in whatever format the user desires. Some targets support writing .o files directly, for others assembly will be written and assembled into an .o file as an intermediate step. Note that to be able to run this file, we also need to link it, which Clang can do for you as well. (Not Clang itself, Clang will just call a linker.)

Feature matrix of the different target code generators (From https://llvm.org/docs/CodeGenerator.html#target-feature-matrix)

6. Optimizations

Loop unrolling

As an example function, take this piece of code which will just copy over 16 integers from s to d.

void copy_16(int *d, int *s) {
  for (int i = 0; i < 16; i=i+1) {
    d[i] = s[i];
  }
}

When we execute this, we basically do:

i := 0,
check:
is i < 16? if no goto end
d[i] = s[i]
i++
goto check
end:
return

So we’ll check 16 times if i < 16 and we’ll increment i 16 times, which is quite a bit of overhead, given that we know we want exactly 16 iterations.

Clang/LLVM will use loop unrolling to transform it into this:

void copy_16(int *d, int *s) {
  d[0] = s[0];
  d[1] = s[1];
  d[2] = s[2];
  // ...
}

So basically we just repeat the loop body 16 times and save the overhead.

Vectorizing

Even if we don’t have a fixed upper bound for the loop, LLVM can still do something about it. In this example we do the same thing but iterate up to n, which is a parameter, so not known at compile time.

void copy_n(int *d, int *s, int n) {
  for (int i = 0; i < n; i=i+1) {
    d[i] = s[i];
  }
}

Many CPUs have instructions that allow to copy over a lot more bytes at once, which is faster than only copying 4 bytes at once. So the compiler will try to use those instructions for the bulk of the copying and do the rest one-by-one again. That’s called vectorization.

void copy_n(int *d, int *s, int n) {
  int i = 0;
  for (; i < n-127; i=i+128) {
    // copy 128 bytes at once
  }
  for (; i < n; i=i+1) {
    d[i] = s[i];
  }
}

Function inlining and loop unrolling

In this example, we have a mixture of the above. There’s copy_n, which still takes the upper loop limit as a parameter, and it is used in copy_16_v2, which unconditionally calls copy_n with n=16.

void copy_n(int *d, int *s, int n) {
  for (int i = 0; i < n; i=i+1) {
    d[i] = s[i];
  }
}

void copy_16_v2(int *d, int *s) {
  copy_n(d, s, 16);
}

The compiler will now basically copy and paste the implementation of copy_n into the other function, which is called function inlining. Then it can deduce that the for loop limit is 16, so it’ll make use of loop unrolling again.

void copy_16_v2(int *d, int *s) {
  d[0] = s[0];
  d[1] = s[1];
  d[2] = s[2];
  // ...
}

7. Sources

[Ray Toal, Intro to Compilers] Toal, R. Intro to Compilers. https://www.cs.cornell.edu/~asampson/blog/llvm.html
[Finkel, 2017] Finkel, H. and Horváth, G. (2017). Code Transformation and analysis using Clang and LLVM. https://llvm.org/devmtg/2017-06/2-Hal-Finkel-LLVM-2017.pdf
https://stackoverflow.com/questions/6319086/are-gcc-and-clang-parsers-really-handwritten
https://stackoverflow.com/questions/11510792/is-the-semantic-analysis-step-in-clang-an-essential-part-of-the-compiler
https://cppdepend.com/blog/?p=321
https://llvm.org/docs/CodeGenerator.html#legalize-operations
https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1
https://stackoverflow.com/questions/845355/do-programming-language-compilers-first-translate-to-assembly-or-directly-to-mac
https://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm/
https://llvm.org/docs/CodeGenerator.html#target-feature-matrix
https://blog.regehr.org/archives/1603
https://github.com/llvm/llvm-project/blob/7175886a0f612aded1430ae240ca7ffd53d260dd/llvm/lib/Passes/PassBuilderPipelines.cpp#L717
https://clang.llvm.org/docs/CommandGuide/clang.html

Some compilers translate into machine code directly (LLVM, mostly), other translate into assembly and use an assembler to compile it into machine code (GCC). ↩︎
Not necessarily, there’s something called Function Multiversioning, but that’s not automatic. ↩︎
Profile guided optimization might help in that case, though. ↩︎
There’s one other type of edge called glue, that’ll make the instructions stick together through scheduling (see https://stackoverflow.com/questions/33005061/what-are-glue-and-chain-dependencies-in-an-llvm-dag). ↩︎
This mapping is not entirely static. Target-specific interfaces are used to map things like returns, calls, varargs, etc. (see https://llvm.org/docs/CodeGenerator.html#initial-selectiondag-construction). ↩︎

An introduction to performance analysis and understanding profilers

2022-06-14T00:00:00+00:00

Where performance matters, we want to make sure we know what to look for and what to optimize. For that we need to measure our code by analyzing its performance with tools either provided by the language or external ones called “profilers”. This post intends to give an overview on both of these methods while introducing some tools – like Google Benchmark and perf – and explaining their functionality.

“Premature optimization is the root of all evil.”

A famous quote by Tony Hoare, later popularised by Donald Knuth, shows why measuring performance is so important. If we try to optimize our code before knowing where it may be needed, it might end up hindering us in the long run. Trying to be unnecessarily clever by, for example, replacing divisions and multiplications by powers of two with bit shifts, might just end up hurting readability while not providing any gains in performance as most compilers nowadays are able to do such trivial optimizations.

We want to do benchmarks, more precisely micro benchmarks. Micro benchmarks are for measuring small parts, like single functions or routines of our code, inspecting hot loops and investigate small things such as cache misses or assembly code generation. Before we perform these micro benchmarks, let’s introduce a dichotomy¹ of tools.

In-Code Benchmarking: Measuring/performing benchmarks within the language by leveraging existing functions and libraries.

Profilers: External, language-agnostic tools which measure/perform benchmarks while utilizing the compiled binary and system calls.

In-Code Benchmarking

Let’s look at and compare three different languages and what they natively provide for benchmarking and measuring performance. C, C++ and Rust – compiled, systems programming languages – all provide at least some tools necessary for capturing the current time (with varying precision), something we definitely need to begin measuring our code. By going from an older, relatively low-level language², like C, to a newer one like Rust with more high-level abstractions, we will quickly notice a difference in ease of use and amount of options available to us while coding (micro) benchmarks.

As old as `<time.h>`: C and POSIX

Let’s begin with C and its time.h facilities:

#include <stdio.h>
#include <time.h>

int main(void) {
    time_t start = time(NULL);
    // expensive operation
    time_t end = time(NULL);

    printf("%.2f seconds\n", (double) end - start);

    return 0;
}

time(...) returns the current calendar time, which is almost always represented as the number of seconds since the Epoch (00:00:00 UTC 01/01/1970), as a time_t object. The full function signature is time_t time(time_t* arg) where arg acts as an out-parameter storing the same information as the return value, which is why we pass in NULL. Since time_t is just a typedef for an unspecified (implementation-defined) real type, we can compute the difference between start and end to get the elapsed time in seconds. This essentially measures so-called “wall clock time”, something we will come back to later.

One alternative time measuring facility is the clock_t clock(void) function. Unlike time(), it returns the approximate processor time, or “CPU time”, of the current process. Similar to time_t, the returned value is also an implementation-defined real type³ from which we can calculate a difference. This “CPU time” may differ from “wall clock time” as it may advance faster or slower depending on the resources allocated by the operating system.

If we want a little more precision while still using C, the C POSIX library offers additional functionality. For example, gettimeofday(), provided by the sys/time.h header, lets us measure with microsecond accuracy:

#include <stdio.h>
#include <sys/time.h>

int main(void) {
    struct timeval tv_start;
    struct timeval tv_end;

    gettimeofday(&tv_start, NULL);
    // expensive operation
    gettimeofday(&tv_end, NULL);

    printf("%.2f µs\n", (double) tv_end.tv_usec - tv_start.tv_usec);
    return 0;
}

We declare two structs of type timeval as out-parameters for gettimeofday() which takes two arguments: struct timeval* tv and struct timezone* tz. The timeval struct holds the following information:

struct timeval {
    time_t      tv_sec;     /* seconds */
    suseconds_t tv_usec;    /* microseconds */
};

This allows us to also get the amount of microseconds after the Epoch for much greater measuring precision. The type used to represent the microseconds, suseconds_t, is usually defined as a long which can hold at least 32 bits.

The second argument of gettimeofday() can be used to get information about the timezone of the system though it is obsolete and flawed, hence why just passing in NULL is recommended⁴.

A new dawn: C++ and `<chrono>`

With the arrival of C++11, a more flexible collection of types for time tracking was added to the standard. Including the most recent changes to <chrono> as part of C++20, the following clock types are available:

C++11

std::chrono::system_clock
std::chrono::steady_clock
std::chrono::high_resolution_clock

C++20

std::chrono::utc_clock
std::chrono::tai_clock
std::chrono::gps_clock
std::chrono::file_clock
std::chrono::local_t

Now this might seem overwhelming at first but when it comes to benchmarking, the only clock types we need to look at are the ones added in C++11. Out of these, std::chrono::steady_clock is the most suitable for measuring intervals. To understand why, let’s take a quick detour and talk about what a Clock is according to the C++ standard and the differences between the aforementioned three types of clocks. In its most basic form, a clock type needs to have a starting point and a tick rate. A more precise definition of the requirements needed to satisfy being a Clock type can be found here.

Now, let’s compare the different clocks:

`std::chrono::system_clock`	`std::chrono::steady_clock`	`std::chrono::high_resolution_clock`
- system wide wall clock time	- monotonic clock	- clock with the smallest tick period
- system time can be adjusted	- tick frequency constant	- alias of one of the other two
- maps to C-style time	- not related to wall clock time	- should be avoided (implementation-defined)

Wall clock time is the actual, real time a physical clock (be it a watch or an actual wall clock) would measure. A wall clock may be subject to unexpected changes which would invalidate any measurements taken with it. It has the ability to jump backward or forward in time through manual adjustments or automatic synchronization with NTP (Network Time Protocol). This makes std::chrono::system_clock biased and unfit for anything but giving us the current time.

Monotonic clocks, like std::chrono::steady_clock, on the other hand cannot jump forward or backward in time and their tick rate is constant. std::chrono::steady_clock uses the system startup time as its Epoch and will never be adjusted. It acts like a stopwatch, perfect for measuring intervals but not for telling time.

Let’s look at an example:

#include <chrono>
#include <iostream>

// to save us from typing std::chrono everytime
using namespace std::chrono;

int main() {
    auto start = steady_clock::now();
    // expensive operation
    auto end = steady_clock::now();

    auto duration = duration_cast<milliseconds>(end - start).count();
    std::cout << duration << "ms\n";

    return EXIT_SUCCESS;
}

Just like before, we define a start and end time point with the now() function of the clock. This static member function returns a std::chrono::time_point with the current time. In line 12, we first compute the difference between these two, returning a std::chrono::duration type, which we can then cast to actual time units with std::chrono::duration_cast(). The availabe units range from nanoseconds to years and are passed in as a template parameter. Finally, count() converts the chosen time unit to the underlying arithmetic type which we can then output.

A tale of abstractions: Rust et al.

Now that we know how C and C++ handle time, how clocks work and the differences between wall clock time and monotonic clocks, let’s look at one final systems programming language with an even higher level of abstraction. Unlike C++, Rust hides most of the implementation details (and spares us from typing std::chrono or verbose casts) while still providing the same level of precision:

use std::time::Instant;

fn main() {
    let start = Instant::now();
    // expensive operation
    let end = start.elapsed();

    println!("{}ms", end.as_millis());
}

All that is needed, is to take the current time with Instant::now() as a start point, then define an end point. An Instant type in Rust always represents a non-decreasing monotonic clock. This end point could be – just like in C or C++ – defined as another Instant::now() and we could compute the difference but Rust also allows us to just call the elapsed() method on the start point. This returns a Duration and is arguably more readable and declarative than a minus sign between two non-arithmetic types while also being shorter.

Finally, we can convert the Duration to time units with the corresponding methods similar to duration_cast<>() in C++.

Going from measuring to benchmarking: Google Benchmark

So far, we only looked at what C, C++ and Rust offer us in terms of measuring time. Of course, most languages offer similar features and functions though each with a slightly different syntax. The goal is not to present all of these different ways of calling such functions – that’s what the docs are for – but rather show different levels of abstractions while also highlighting similarities in the methodology.

Things like defining start and end points, computing their difference, being aware of clock types might be something to keep in mind regardless of the language used. But what we did so far was not benchmarking. In order for us to proclaim we performed a successful benchmark, we need to measure not only once but many more times. We need to calculate the mean or median of these multiple measurements. We want to rule out any random or statistical errors. Ideally, we also want to not have to define start and end points manually like we did before for every single measurement.

This is where (micro) benchmarking libraries come in handy. While many exist for every language, we are going to stick with C++ for now and take a look at Google Benchmark. This open-source micro benchmarking library from Google makes timing of small code snippets much easier and allows us to get good statistical averages through repeated sampling of said snippets.

The example in their README.md shows the basic idea:

#include <benchmark/benchmark.h>

static void BM_StringCreation(benchmark::State& state) {
    for (auto _ : state) {
        std::string empty_string;
    }
}
// Register the function as a benchmark
BENCHMARK(BM_StringCreation);

static void BM_StringCopy(benchmark::State& state) {
    std::string x = "hello";
    for (auto _ : state) {
        std::string copy(x);
    }
}
// Register another function as a benchmark
BENCHMARK(BM_StringCopy);

BENCHMARK_MAIN();

Any method that we wish to benchmark has to be marked as static. Furthermore, they also need to have a mutable reference to a benchmark::State as an argument. By iterating over the state object with the code we wish to benchmark, we “add” to the sampling process. The BENCHMARK macro registers the functions as a benchmark while the BENCHMARK_MAIN() macro generates an appropriate main() function.

If we compile the code as follows:

$ g++ main.cpp -std=c++11 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -o main

We get this output:

2022-01-06T00:26:34+01:00
Running ./main
Run on (6 X 3696 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_StringCreation       4.27 ns         4.33 ns    165925926
BM_StringCopy           7.84 ns         7.85 ns     89600000

Google Benchmark creates a table displaying wall clock time, CPU time and how often each function was sampled for us. It is able to sample a function up to a billion times. Now, to make this code faster, the first step would be to turn on optimizations. Currently, we compile with no optimizations (-O0), so let’s use -O3 and see what happens:

2022-01-07T21:27:42+01:00
Running ./main
Run on (6 X 3696 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_StringCreation      0.000 ns        0.000 ns   1000000000
BM_StringCopy           4.14 ns         4.17 ns    172307692

At first glance we seem to have created the world’s fastest string creation function though sadly, that is not what happened. Taking a look at line 5 of the example code reveals the problem. std::string empty_string is declared but we do not use it anywhere else in the code, so the compiler is smart and sees that removing it would have no side effects to the program and does so. Most of the time, this is the behavior we expect and want from our compiler but in this case we actually do want to keep this unused variable around.

Luckily, Google Benchmark has functions that can pretend to use a variable so the compiler can’t just remove it anymore:

static void BM_StringCreation(benchmark::State& state) {
    for (auto _ : state) {
        std::string empty_string;
        benchmark::DoNotOptimize(empty_string);
    }
}

This allows us to still benchmark the creation of an empty string while using -O3. There are many other ways to prevent certain optimizations from happening that would invalidate a benchmark though this and benchmark::ClobberMemory() are the ones Google Benchmark provides.⁵ Now, our table looks much more sensible:

2022-01-07T21:26:18+01:00
Running ./main
Run on (6 X 3696 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_StringCreation      0.681 ns        0.688 ns   1000000000
BM_StringCopy           4.04 ns         3.99 ns    172307692

Unwanted optimizations are just one thing to be aware of when doing benchmarks. Depending on what is tested, we might want clear our cache before a run, do warmup runs if I/O is involved or compare the same function with differing parameters – so called “parameterized benchmarks”. Many benchmarking libraries will have all of these advanced features and more but they are out of scope for this post. For quick tests and comparisons, Quick Bench, an online compiler using Google Benchmark, is a great alternative.

Profilers – more than just time

So far, we only measured the time it took for a program or a certain function to run. Going back to the quote presented at the beginning: in order for us to optimize our code, we need to know what to optimize. Just benchmarking some functions will not tell us where a potential bottleneck might be – we need more information about our program.

This is where profilers shine brightest. Profilers are usually external, language-agnostic tools that operate on a binary and not on source code. They usually offer:

(relative) timing of every function call
generating call graphs (who called what) and flamegraphs
frequency of instruction calls
frequency of generated assembly function calls
in-depth performance counter stats, e.g., branch misses, CPU cycles and many more

There are many different profilers out there, each specialized for their own use-case⁶ and it can help to know how to classify different types of profilers. The most common types are:

Flat profilers – computes average call times
Call-graph profilers – shows call times, function frequencies and creates a call-chain graph
Input-sensitive profilers – generate profiles and charts based on different inputs and how a function scales based on it
Event-based profilers – only collect statistics when certain pre-defined events happen
Statistical profilers – operate via sampling by probing the call stack through interrupts

This classification is not mutually exclusive – a profiler can offer any one or all of these features. We are going to take a look at one tool in particular: perf.

`perf`: jack of all trades

The reason why perf is a good first choice is that almost everyone (that uses Linux⁷) already has it as it is part of the kernel. It also does many of the aforementioned things just out of the box. Let’s start by creating a statistical profile:

$ perf stat ./peekng encode 123.png teSt secret

 Performance counter stats for './peekng encode 123.png teSt secret':

              2,54 msec task-clock:u              #    0,287 CPUs utilized
                 0      context-switches:u        #    0,000 /sec
                 0      cpu-migrations:u          #    0,000 /sec
               187      page-faults:u             #   73,490 K/sec
         4.724.914      cycles:u                  #    1,857 GHz
         5.977.214      instructions:u            #    1,27  insn per cycle
         1.103.539      branches:u                #  433,686 M/sec
            18.837      branch-misses:u           #    1,71% of all branches

       0,008852233 seconds time elapsed

       0,003009000 seconds user
       0,000000000 seconds sys

peekng here just acts as an example program, all it does is encode the word secret into 123.png. As we can see, we received a lot of additional information about our program which may help with identifying optimization opportunities. For example, we observe 187 page faults meaning peekng might have tried to access a memory page which was not loaded into RAM and had to be loaded from a disk 187 times (this is just one reason why a page fault might occur). This may lead us to look at how we handle file reads and writes in our program. Additionally, perf stat shows us the time it took to execute the program as wall clock time and CPU time (user + sys).

Another thing perf offers us, is creating an interactive call graph. This happens with a combination of perf record and perf report though we need to be careful to not eliminate certain debug symbols. Earlier, we looked at how we might want to prevent certain optimizations from happening while still using -O3 (or your preferred language’s equivalent) in source code – now we need to fiddle with some compiler flags.

GCC, for example, has -Og as an additional optimization level which optimizes for debugging experience. It enables some compiler passes for collecting debug information while only optimizing at a level close to -O1. Having this additional information still in the binary will make reading and following the call graph much easier. Another important thing is to keep the frame pointer register. The frame pointer stores the current stack frame pointer in a reserved register if needed for a function. It allows us to get additional information about how the stack was used during runtime. By default, most compilers omit the frame pointer to get one additional register but this can be disabled via -fno-omit-frame-pointer in GCC.

Languages that do not use the GCC backend may have similar options though under different names. Keeping the frame pointer under Rust, for example, actually requires us to modify the perf record command slightly. Let’s look at how it works:

$ perf record [-g / --call-graph=dwarf] <COMMAND>
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,135 MB perf.data (16 samples) ]

Usually, the -g flag will suffice to create the graph but to keep the frame pointer intact for languages like Rust, the second flag is required instead. To generate the call graph from the perf.data file, perf report [-g / -G] is used. Depending on if we want the function hierarchy to go from callee to caller or vice versa, either -g or -G is required. The call graph, going from caller to callee, looks like this:

We can now see relative timings for every function call made, see which functions call other functions and get a more general idea of where a bottleneck might be. Plus, perf also allows us to look at the assembly of a chosen function call by pressing the A key annotated with the frequency each instruction is called with. Had we not done the previous steps of preventing certain optimizations and keeping debug symbols, this graph would be full of mangled function names and call hierarchies going deep into system calls.

perf provides many other options like setting tracepoints and even doing kernel microbenchmarks, which we aren’t going to look at in this post, though it is certainly worth looking at what else it has to offer.

The underlying kernel interface: `perf_events`

Since the profiler tool is part of the Linux kernel, it has mostly direct access to any events the kernel picks up on. This is done via something called perf_events – an interface exported by the Linux kernel. It can measure events from different sources depending on the subcommand that was run.

Software events are pure kernel counters, utilized in part for perf stat. They include such things as context-switches, page faults, etc.

Hardware events are events stemming from the processor itself and its PMU (Performance Monitoring Unit). The PMU provides a list of michro-architectural events like CPU cycles, cache misses and some others.

Tracepoint events, implemented via the ftrace kernel infrastructure, provide a way to interface with certain syscalls when tracing is required.

For a full list of possible events, see the perf wiki. The statistical profile we generated earlier came together through perf keeping a running count during execution of these supported events. perf_events uses, as the name suggests, event-based sampling. This means that every time a certain event happens, the sampling counter is increased. Which event is chosen depends on how we intend to use perf. The record subcommand, for example, uses something called the cycles event as its sampling event. The kernel maps this event to a hardware event on the PMU which depends on the manufacturer of the processor.

Once the sampling counter overflows, a sample is recorded. The instruction pointer then stores where the program was interrupted. Unfortunately, the instruction pointer may not point at where the overflow happened but rather at where the PMU was interrupted, making it possible that the wrong instructions get counted. This is why one always needs to be cautious when looking at graphs such as generated assembly annotated with the frequency of its execution as it might just be one or two instructions off.

“A division or contrast between two things that are or are represented as being opposed or entirely different.” ↩︎
https://queue.acm.org/detail.cfm?id=3212479 ↩︎
Represented as clock ticks, not seconds – convert via division by CLOCKS_PER_SEC ↩︎
https://man7.org/linux/man-pages/man2/gettimeofday.2.html#NOTES ↩︎
For more details on that, see this great talk: https://www.youtube.com/watch?v=nXaxk27zwlk ↩︎
See here for a list of over a hundred different profilers and when to use which one ↩︎
perf is not available under Windows and also does not work with WSL (Windows Subsystem for Linux) ↩︎

Lossless data compression

2022-05-24T00:00:00+00:00

This post is an introduction to lossless data compression in which we will explore the approaches of entropy-based/statistical as well as dictionary-based compression and explain some of the most common algorithms.

But first of all, why do we even need data compression?

Audio and video communication as well as large multimedia platforms as we know them today are only possible because of data compression. Usually, every single photo or video posted on social media or video/streaming platforms has to be compressed. Otherwise, the size of the data would be too large to deal with effectively. Of course, in scientific research, there are also fields of application where we generate or measure large amounts of data. To store all this data, we need data compression.

Basics

A data compression technique usually contains two algorithms:

One compression algorithm which takes the original input A and generates a representation of this original input A’ which (ideally) requires less bits than A.
One reconstruction/decoding algorithm which operates on the compressed representation A’ and generates the reconstruction B.

If B is identical to A, the compression is called lossless. If B differs from A, the compression is called lossy. To compare different compression algorithms it is possible to use the data compression ratio which can be calculated by dividing the uncompressed size by the compressed size of the data.

\text{data compression ratio} = \frac{\text{uncompressed size of data}}{\text{compressed size of data}} = \frac{\text{size of A}}{\text{size of A'}}

Of course, this is only one of different useful measurements and the performance of compression algorithms is highly dependent on the input data. But if there are several algorithms that are suitable for the data which is to be compressed, comparing the compression ratio could be sensible.¹

A first compression algorithm: Run-Length Encoding (RLE)

Let us have a look at this rather easy compression algorithm, called run-length encoding: It stores runs of data as single data value and count. A run is a sequence in which the same data value occurs in consecutive data elements.

Let’s consider a line of 10 pixels, where the pixels can either be white or black. If W stands for a white pixel and B for a black pixel, we could have data which looks like this: BBBBBWWWWW. A run-length encoding algorithm could compress this input as following 5B5W, because there are 5 black pixels followed by 5 white pixels. So instead of saving 10 characters, the output of RLE would only need 4 characters.

Of course, we do not need to use chars but also could use other data types to save and compress our data. We use chars in this example because this should make it easier to understand the concept.

This approach works best if there are many longer runs in the data. Therefore, the best case scenario of the input for our example would be WWWWWWWWWW or BBBBBBBBBB, because this input can be compressed as 10W or 10B, which is the shortest possible output for this example. In this case we would have a compression ratio of $\frac{10}{3} = 3.\overline{3}$ (also sometimes displayed as 10:3).

But if there aren’t many runs in the file, which is to be compressed, the output file of the algorithm might be larger than the input file. In the case of our example the worst case would be one of these input files: WBWBWBWBWB BWBWBWBWBW.²

What would be the compression rate in this worst case? Click to show the answer.

0.5, because the uncompressed size is 10 chars and the 'compressed' size is 20 chars.

Entropy-based/statistical compression

The next approaches, we want to get to know, are called entropy-based because they use the entropy of the given data. The entropy of data depends on the probabilities of certain symbols to occur in the given data.³ While Run-Length Encoding assigns a fixed-size code to the symbols it operates on, entropy-based approaches have variable-sized codes. Entropy-based approaches work by replacing unique symbols within the input data with a unique and shorter prefix code.⁴ To ensure a good compression ratio, the used prefix should be shorter the more often a symbol occurs.⁴ ⁵ Examples of this approach are arithmetic coding, Shannon-Fano coding and Huffman coding.

Huffman coding

We will explore some of the previously mentioned properties of entropy-based compression algorithms on the example of Huffman coding.

This example is taken from page 136 of Tönnies' book "Grundlagen der Bildverarbeitung".⁶

In this example we want to compress the image shown on the top. To do so, we create a normed histogram of all values as the first step. In the image itself but also in its histogram we can see that the darkest possible greyscale value occurs quite often while the other lighter greyscale values have lower frequencies. The algorithm now merges the symbols according to their frequency until there are only 2 symbols left. So in the case of the example, the two least frequent greyscale values are merged in every step. Then the original symbols are given new prefix codes. Symbols which were previously merged are broken down into segments and for every segment the code is extended. Therefore, the most occuring symbol gets the shortest prefix code and the least occuring symbol gets the longest prefix code. The prefix code assignment is represented in a binary tree, which is also traversed to decode the information.⁶ This approach produces the best code when the probabilities of symbols are negative powers of 2.³

Dictionary-based compression

Dictionary-based approaches are the last group of lossless data compression algorithms we will cover in this article. Unlike entropy-based approaches, dictionary-based ones do not use a statistical model or a variable-sized code.³

Dictionary-based algorithms partition the data into phrases which are non-overlapping subsets of the original data.⁴ Each phrase is then encoded as a token using a dictionary.³ Accordingly, there are two stages in a dictionary-based compression algorithm:

The dictionary construction stage: In this stage the algorithm finds phrases and codewords.
The parsing stage: In this stage the phrases are replaced by codewords.⁷

There are static dictionary codes and dynamic/adaptive dictionary codes. Static dictionaries are created before the input processing and stay the same for the complete run, while dynamic dictionaries are updated during parsing which means that in this case the two stages (dictionary construction and parsing) are interleaved.⁴ After these rather theoretical basics about dictionary-based compression, we will now dive into different LZ-family algorithms to explain these things on a few examples.

LZ family

These algorithms are named after their creators Abraham Lempel and Jacob Ziv and are some of the most known dictionary compression methods.⁴ The algorithms we will go into detail about are LZ77 and LZ78, which are the two original algorithms developed by Lempel and Ziv, as well as a few of their variants.

LZ77 and its variants

LZ77 assumes and exploits that data is most likely to be repeated.⁴ The principle is to use a part of the previously-seen input stream as the dictionary. Thus the input is analyzed through a sliding window:

This example is taken from page 176 of Salomon's book "Data Compression: The Complete Reference".³

As seen in the figure above, the window is divided in two parts:

The search buffer: The current dictionary which includes symbols that have previously been input and encoded.
The look-ahead buffer which contains data yet to be encoded.³ When a word is repeated, it can be replaced by a pointer to the last occurrence accompanied by the number of matched characters.⁸

I will explain this further on the example shown in the image above:

The encoder scans the search buffer from right to left.
It looks for a match in the dictionary (search buffer) for the first symbol e which is in the front of the look-ahead buffer.
It finds an e in “easily” at a distance of 8 from the end of the search buffer (you have to count from right to left, distance of 1 would be the symbol left of the currently selected symbol).
The encoder then matches as many symbols following those 2 e’s as possible which are in this case the 3 symbols “eas”.
The length of the match is therefore 3.
The encoder then continues its backward scan to find a longer match.
In this case there is no longer match, but a same length match in “eastman”.

Generally, the encoder selects the longest match or the last one found and prepares the token. Why does it use the last one found? The answer is quite simple: The algorithm then doesn’t have to keep track of all found matches and can save memory space. In practical implementations the search buffer is some thousands of bytes long whereas the look-ahead buffer is some tens of bytes long.³

Here you can see what the first 5 steps and tokens look like for the example in the image above:³

Search Buffer	Look-Ahead Buffer	Token
	sir_sid_eastman_	(0,0,“s”)
s	ir_sid_eastman_e	(0,0,“i”)
si	r_sid_eastman_ea	(0,0,“r”)
sir	_sid_eastman_eas	(0,0,"_")
sir_	sid_eastman_easi	(4,2,“d”)

The token always consists of 3 elements:

The first element is the distance of the found match. If there is no match, this element is 0.
The second element is the length of the found match. If there is no match, this is again 0.
The third and last element is the new symbol which is to be appended.

This approach is suffix-complete, meaning that any suffix of a phrase is a phrase itself. So if the phrase “cold” is in the dictionary, so are “old”, “ld” and “d”. The performance of this algorithm is limited by the number of comparisons needed for finding a matching pattern.⁴ While the encoder is a bit more complicated, the decoder is rather simple meaning that LZ77 and its variants are useful in cases where data has to be compressed once but decompressed very often.³

Let’s try to decode some data compressed by LZ77: The 3 tokens (from left [first] to right [last]) are: (0,0,“y”), (0,0,“a”) and (2,1,"!").

Click to show a tip.

You have to "fill up" a buffer from right to left, using one token at a time and "pushing" the entries one space to the left in every step.

Did you find out the decoded text? Click to show the solution.

yay!

Let’s have a short look at a few LZ77 variants:

(Please note that this part is just a small overview and does not fully explain how these variants work since that would go beyond the scope of this article.)

LZSS

This derivative algorithm was developed by Storer and Szymanski. The look-ahead buffer is in this case improved by storing it in a circular queue and the algorithm holds the search buffer in a binary search tree.⁴ Because of that, the tokens have only 2 fields instead of 3.³

DEFLATE

This algorithm – developed by Philip Katz – was originally used in Zip and Gzip software and has been adopted by many applications including HTTP and PNG.³ It is based on LZSS, but uses a chained hash table to find duplicates. The matched lengths and distances are further compressed with two Huffman trees.⁴

LZMA

The last LZ77 variant of this short overview is the Lempel-Ziv-Markov chain-Algorithm which is the default compression algorithm of 7-zip. Its principle is similar to that of DEFLATE but it doesn’t use Huffman coding and instead uses range encoding which is an integer-based version of arithmetic coding (an entropy-based compression algorithm). This does complicate the encoder but also results in better compression.³

LZ78 and its variants

LZ78 constructs its dictionary differently than LZ77 and does therefore not use any search buffer, look-ahead buffer or sliding window.⁴ Compared to LZ77’s three-field tokens the LZ78 encoder outputs two-field tokens which each consist of a pointer to the dictionary and the code of a symbol. Since the length of the phrases are implied in the dictionary, it doesn’t need to be part of the token.

Each token corresponds to a phrase of input symbols. That phrase is added to the dictionary after the token is written on the compressed stream. The size of LZ78’s dictionary is only limited by the amount of available memory, because unlike in LZ77 nothing is ever deleted from the dictionary in LZ78. On the one hand, this can be an advantage since future pharses can be compressed by dictionary phrases which occured a lot earlier. On the other hand, this can also be a disadvantage because the dictionary tends to grow fast and can fill up the entire available memory.³

The LZ78 algorithm begins with a single symbol entry in its dictionary, which is the null string at position zero.³ ⁴ Then it concatenates the first symbol of the following input after every parsing step.⁴

Let’s try to further understand the algorithm by going through an example: We want to compress the input a_b_a. The current state could be displayed like this:

	Dictionary	Token
0	null

As previously mentioned the algorithm starts with the null pointer as the dictionary’s entry at position 0.

At first the dictionary is searched for “a”. If “a” is not found, the algorithm adds “a” to the dictionary at position 1 and outputs the token (0, “a”) since “a” is the concatenation of the null string and the symbol “a”.

a_b_a

	Dictionary	Token
0	null
1	"a"	(0, "a")

Now the dictionary is searched for “_”, since this symbol is also not yet part of the dictionary, it is added analogously to position 2 and the output token is (0, “_”). This then happens again for “b” at position 3.

a_b_a

	Dictionary	Token
0	null
1	"a"	(0, "a")
2	"_"	(0, "_")
3	"b"	(0, "b")

Now that the first 3 symbols of our input a_b_a are put in the dictionary, the encoder finds a dictionary entry for the next symbol “_”, but not for “_a” and therefore adds “_a” to the dictionary at position 4 and the output token is (2, “a”), because 2 is the position of “_”.

a_b_a

	Dictionary	Token
0	null
1	"a"	(0, "a")
2	"_"	(0, "_")
3	"b"	(0, "b")
4	"_a"	(2, "a")

This is another longer example taken from Salomon’s “Data Compression: The Complete Reference”³: It shows the first 14 steps for this string which is to be compressed: sir_sid_eastman_easily_teases_sea_sick_seals

	Dictionary	Token
0	null
1	"s"	(0, "s")
2	"i"	(0, "i")
3	"r"	(0, "r")
4	"_"	(0, "_")
5	"si"	(1, "i")
6	"d"	(0, "d")
7	"_e"	(4, "e")
8	"a"	(0, "a")
9	"st"	(1, "t")
10	"m"	(0, "m")
11	"an"	(8, "n")
12	"_ea"	(7, "a")
13	"sil"	(5, "l")
14	"y"	(0, "y")

So let us go through the procedure of LZ78 again quickly: Generally, the current symbol is read and becomes a one-symbol phrase. Then the encoder tries to find it in the dictionary. If the symbol is found in the dictionary, the next symbol is read and concatenated to the first symbol and then this two-symbol phrase is being searched for in the dictionary. As long as those phrases are found the process repeats. At some point the phrase is not found in the dictionary and thus added to it, while the output is a token consisting of the last dictionary match and the last symbol of the phrase which could not be found in the search.³

This approach is called greedy parsing because the longest phrase with a prefix match is replaced by a codeword.⁴ Therefore, LZ78 is prefix-complete, meaning any prefix of a phrase is a phrase itself. So if “hello” is part of the dictionary, so are “hell”, “hel”, “he” and “h”.

Since we’ve now gone through the base algorithm, let us have a look at variants of LZ78.

LZW

This variant was developed by Terry Welch. Its main feature is that it eliminates the second field of a token. The LZW token consists only of a pointer to the dictionary. This is possible because the dictionary is initialized with all the symbols in the alphabet.³ The GIF encoding algorithm is based on LZW.⁴

LZMW

The second variant we will shortly mention is LZMW which was developed by V. Miller and M. Wegman. It is based on two principles:

When the dictionary is full, the least-recently-used dictionary phrase is deleted.
Each phrase which is added to the dictionary is a concatenation of two phrases. This means that a dictionary phrase can grow by more than one symbol at a time (unlike in the base LZ78 algorithm).

A drawback of this implementation is that it complicates the choice of the data structure for the dictionary because the principles of LZMW lead to a non-prefix-complete dictionary and because a phrase may be added twice due to the deletion of the least-recently-used dictionary phrase if the dictionary is full.³

Limitations of lossless data compression

After all these different algorithms, there is one last topic to cover: The limitations of lossless data compression. Generally, there are of course many more lossless compression algorithms, some of which are very specialised for a specific area like image or audio compression. These algorithms could perform badly if they would be used outside of their designated area. Which brings us to the question if a perfect compression algorithm could exist. Perfect in this case meaning that the compressed file will always be smaller than the original file.

We can find this out by using a counting argument, the pigeonhole principle.⁹ The pigeonhole principle states that if n items are put into m containers while n is greather than m, then at least one container must contain more than one item (n and m are natural numbers).¹⁰ So if we consider that there are 10 pigeons but only 9 holes, at least one hole must contain more than one pigeon.

Pigeons-in-holes.jpg by en:User:BenFrantzDale; this image by en:User:McKay, CC BY-SA 3.0, via Wikimedia Commons

Let’s go through this proof from a Stanford University lecture¹¹: We already know that in lossless data compression, we have compression function C and a decompression function D. To ensure that we can uniquely encode or decode a bitstring, these functions must be the inverses of each other: $$D(C(x)) = x$$ . This means that C must be injective.

An injective function is a function, where distinct inputs map to distinct outputs.

Ideally, the compressed version of a bitstring would always be shorter than the input bitstring.

Let $$B^n$$ be the set of bitstrings of length n and $B^{<n}$ be the set of bitstrings of length less than n. There are $$2^n$$ bitstrings of length n and there are $2^0 + 2^1 + ... + 2^{n-1} = 2^n - 1$ bitstrings of length less than n. Since $B^{<n}$ has less elements than $$B^n$$ , there cannot be an injection from $$B^n$$ to $B^{<n}$ .

And because a perfect compression function would have to be an injection from $$B^n$$ to $B^{<n}$ , there is no perfect compression function and every lossless compression function will produce a larger output file given certain input data, to ensure that the produced output file is unique. Otherwise we would have a lossy compression.¹¹

This means that for every lossless data compression algorithm there is input data which cannot be compressed. Therefore, a check if the compressed file is in fact smaller than input file is necessary. Furthermore, it is always useful to know what kind of data is to be compressed, so the algorithm can be chosen based on this information.

Libfabric: A generalized way for fabric communication

2022-04-25T00:00:00+00:00

In this post, we will look at the challenges of efficient communication between processes and how Libfabric abstracts them. We will see how OFI (Open Fabrics Interfaces) enables a fast and generalized communication.

What is a fabric and how to communicate in it?

A fabric is nothing more or less than several, more or less uniform, nodes connected via links or, in other words, the typical HPC or cloud computing landscape.

Nodes can be linked via different physical media (e.g., copper or optical fiber) and various communication protocols. While the physical medium is hidden behind the network cards, the communication protocol is something we still need to manage in user-space because different protocols require other interactions with the network to function.

To have a unified interface for the typical messaging data transfer would be nice, while not necessarily being a game changer. But in perspective to RDMA, it differs.

RDMA

Remote direct memory access (RDMA) sounds counter intuitive at first, because how would you access remote memory directly? Directly in this context means without involving the operating system and CPU. Instead, the data transfer is entirely managed by the NIC. Therefore, we only need to signal we want to read data X from source Y to the memory segment Z, and the NIC does the rest.

In contrast, for normal kernel mode networking, we will copy the buffer multiple times and run it through various layers of code (e.g., socket, TCP protocol implementation, and driver). This will cause a load on the CPU and bus, while RDMA, thanks to kernel bypass to the NIC, can offload a huge part from the network stack.

This opens many questions, to name a few:

When is the memory transfer finished?
How to avoid inconsistency due to invalidated caches?
Is RDMA even possible with this NIC?
How to queue RDMA requests?

The answers to these questions depend strongly on the implementation and the network protocol. Therefore, a unified solution is quite welcome if you want the flexibility to change your link type.

A short reminder: RDMA still uses the same network as typical network messages, therefore the bandwidth and latency will not change much, but it will reduce the work done by the CPU, which leads to fewer interrupts and more processing time for your calculation running.

Libfabric abstraction

Libfabric offers a unified interface to use different communication types over different communication protocols, and each time tries to minimize the overhead.

The supported communication types are:

Message Queue: Message-based FIFO queue
Tagged Message Queue: Similar to Message Queue but enables operations based on a 64-bit tag attached to each message
RMA (remote memory access): Abstraction of RDMA to enable it also on systems that are not RDMA-capable
Atomic: Allow atomic operations at the network level

JULEA is a flexible storage framework for clusters that allows offering arbitrary I/O interfaces to applications. It runs completely in user space, which eases development and debugging. Because it runs on a cluster, a lot of network communication must be handled. Until now, it used TCP (via GSocket). While TCP connections normally work everywhere, the cluster may provide better fabrics, which we were unable to use. Now, with Libfabric, we can use a huge variety of other fabrics like InfiniBand.

For JULEA, Message Queue and RMA are the most interesting. Message Queue fits the communication structure currently used in JULEA. RMA enables processing many data transfers in parallel. With RMA, we can, for example, process a message with multiple read access and tell the link that the data have no specific order.

To achieve this, Libfabric uses different abstracted modules, where each of them is equipped with an optional argument to even use it only for one protocol or just let Libfabric decide what is best.

Each module enables us to create the next in the chain until we archive the connection we want. The modules of interest are:

Fabric information: List of available networks, which can be filtered and is sorted by performance
Fabric: All resources needed to use a network
Domain: Represents a connection in a fabric (e.g., a port or a NIC)
Endpoint: Communication portal to a domain
Event queue: Reports asynchronous meta events for an endpoint, like connection established/shutdown
Completion queue/counter: High-performance queue reports completed data transfers or just a counter

If we want, for example, to build a connection to a server (with a known address), we can use fi_getinfo to request all available fabrics which are capable of connecting to the server.

Then we pick the first of them (because this is likely the most performant) and construct a fabric. After this because we do not have special requirements (and have already defined our communication destination), we just create a domain at that fabric and then an endpoint with event and completion counter at that.

With the endpoint, we get a connect request that needs to be accepted from the server and confirmed via a FI_CONNECTED in the event queue.

Now each time the completion counter increases, we know something has happened; for simple communication, this is enough. We can bind different counters or queues to this if we want to differ between incoming and outgoing completion. Queues enable us also to keep track of an action based on a context we may freely choose (it is basically an ID).

If you want a more detailed explanation, the official introduction to the interface can be found here.

Conclusion and first measurements

Libfabric allows using different fabrics with the same interface. This way, you can write RDMA-compatible code, and Libfabric makes it also work on a system that does not support RDMA.

Comparing the performance of JULEA with GSocket using the operations per second for object creation and deletion. This shows that the performance via TCP is slightly in favor of Libfabric and that InfiniBand is multiple orders of magnitude faster than TCP, but impossible to use with GSocket.

Comparing performance of JULEA with GSocket and Libfabric network code using the througput of read and write operations. Shows that performance via TCP is similar, while performance via InfiniBand with Libfabric is multiple orders of mangitude faster, while impossible to use with GSocket.

We already tested it in JULEA. We rewrote the GSocket network code with Libfabric. This resulted in working InfiniBand and RDMA support. But also without RDMA, its performance is still similar to the GSocket implementation.

Therefore, Libfabric enables to use the most efficient fabric available without having to modify the code.

heimdallr: Compile time correctness checking for message passing in Rust

2021-11-18T00:00:00+00:00

In this post we will look at how the Rust programming language and its built-in correctness features can be applied to the message passing parallelization method. We will see how Rust’s memory safety features can be leveraged to design a message passing library which we call heimdallr. It is able to detect parallelization errors at compile time that would go unnoticed by the compiler when using the prevalent message passing interface MPI.

For readers who are new to this topic we will start with a very brief synopsis of message passing. In the field of high performance computing (HPC), parallel programs are executed on large computing clusters with often hundreds of computing nodes. Running an application in parallel on more than one computing node requires different parallelization techniques than multi-threading because the computing nodes do not have shared memory. Therefore a mechanism for sharing data between processes running on different nodes is needed. In HPC, the standard method of achieving this is called message passing. The applications have to explicitly send and receive the data that needs to be shared over a network. The most commonly used library for this is called MPI which stands for Message Passing Interface.

At the start of an MPI application every participating process is given an ID (often called rank) that can be used to differentiate between them in the code. MPI then provides many different send and receive functions with varying semantics such as blocking/non-blocking and synchronous/asynchronous. Additionally collective operations such as barriers for synchronization or broadcast/gather operations are provided.

MPI_Init(NULL, NULL);

int rank,size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

double *buf = malloc(sizeof(double) * BUF_SIZE);

if (rank == 0) {
    for (int i = 0; i < BUF_SIZE; ++i) {
        buf[i] = 42.0;
    }
    MPI_Send(buf, BUF_SIZE, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
}
else if (rank == 1) {
    MPI_Recv(buf, BUF_SIZE, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}

MPI_Finalize();

Here we can see a simple MPI program. After MPI’s initialization in line 1 each process asks for the values of their own rank and the number of overall participating processes (here called size) in lines 4-5. The goal of the program is to send a message containing the contents of the buf array from process 0 to process 1. This message exchange happens in lines 13 and 16, where process 0 uses the MPI_Send function to send the message and process 1 receives it with the MPI_Recv function.

As we can see, the MPI functions take a lot of arguments but only the first four are important to follow this example. First comes a pointer to the buffer that is being sent from and received into. The next two arguments specify the number of elements that are sent and their data type, which is needed to calculate the correct number of bytes that will be sent. Lastly, the target or source process rank for the operation is specified. As mentioned in this example, process 0 targets process 1 with its send operation and process 1 tries to receive the data from process 0.

An avid reader might already have spotted that there is a problem in the code of the example. The data type of the buf array is double but in the MPI function calls MPI_FLOAT is specified. This is in fact a bug and leads to the result that not all of the array’s data is transmitted but only half of it.

These kinds of parallelization errors can be hard to track down in real programs because no crash will occur here but the results of the program will be wrong. Furthermore, the C compiler and the MPI library are not able to detect this error and give the user a warning. Programming with MPI has many such pitfalls which are often due to MPI’s low-level nature combined with the dangers of C memory management with void pointers.

Compile time correctness through Rust

Rust is a modern system programming language that focuses on memory and concurrency safety with strong compile time correctness checks. In recent times Rust has garnered more and more attention in circles where C is the current predominant language but a more safe solution is desired. In the field of HPC, C/C++ and Fortran are by far the most used languages. They provide great performance, have been around for a long time and there exists a lot of infrastructure in the form of libraries and tools for them. However, these languages do come with their drawbacks which can often be found in aspects like usability, programmability and a general lack of modern features.

Developing massive parallel programs for HPC is a complicated task and in our opinion the languages and libraries used should provide the developers with as much help as possible. Therefore we asked ourselves whether a language like Rust could provide an easier programming experience for message passing applications by avoiding and detecting as many errors in parallel code as possible at compile time.

Out of this research a Rust message passing library called heimdallr was developed. heimdallr should currently be seen as a prototype implementation but it already has good examples of correctness checks that are currently nonexistent for MPI.

Eliminating type safety errors with generics

In the previously given example one might ask themselves why it is necessary for the user to manually specify the concrete data type of a buffer when this is information that a compiler should absolutely be able to derive by itself. The type safety problems with MPI stem from the fact that the whole API works on untyped memory addresses for data buffers via the use of C’s void pointers to allow the MPI functions to work with any type of data. The type information is therefore explicitly discarded and must be manually passed to a MPI function call by the user.

let client = HeimdallrClient::init(env::args()).unwrap();
let mut buf = vec![0.0;BUF_SIZE];

if client.id == 0 {
    for i in 0..BUF_SIZE {
        buf[i] = 42.0;
    }
    client.send(&buf, 1, 0)?;
} else if client.id == 1 {
    buf = client.receive(0, 0)?;
}

Here we see an equivalent program written in Rust with our heimdallr message passing library. First of all, it is apparent that the message passing code is less verbose when compared to its MPI counterpart. Our design principles with heimdallr are safety and usability. From the usability perspective we can see that some of the boilerplate code that is necessary in MPI, like for example manually asking for and storing a process’s rank variable, is not required with heimdallr.

More importantly, the previously discussed type safety issue for sending a data buffer does not come up with heimdallr. We are making use of the language’s generic programming features to let the compiler handle the type deduction of a transmitted variable. This does not only make it more safe but also easier to use for a developer.

Of course Rust is by far not the only modern language to provide generic programming features and this interface change to the send and receive functions could have been done in a myriad of languages. Therefore we should go on to an example where some of Rust’s unique features allow us to provide a safer message passing interface to the users.

Ensuring buffer safety for non-blocking communication

As previously mentioned, MPI provides multiple send and receive functions with varying semantics. The most basic form of message passing is called blocking. When a message passing function is called in this context the sender process is blocked until the data buffer that is being sent is guaranteed to have been processed by the message passing library. The receiving process is also blocked until the contents of the incoming message have been safely copied into the receiving data buffer. This form of message passing is the most intuitive from a user’s perspective but it can also be subpar from a performance perspective due to the resulting idle times for both processes.

A solution that is often better suited from the performance perspective is the use of so called non-blocking communication. Here the process of passing the message is handled in the background and the program can continue with its execution almost immediately. This type of message passing however does not come without dangers, as we will see in the following code snippet.

if (rank == 0) {
    MPI_Isend(buf, BUF_SIZE, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &req);
    for (int i = 0; i < BUF_SIZE; ++i) {
        buf[i] = 42.0;
    }
}
else if (rank == 1) {
    MPI_Recv(buf, BUF_SIZE, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);
}

In this example process 0 tries to send a buffer to process 1 using MPI’s non-blocking send function MPI_Isend. The non-blocking send operation in line 2 allows process 0 to continue its execution before the sending of the message has concluded. The problem arises in lines 3-4 where process 0 also modifies the contents of the data buffer that is being sent. Since the message passing process might still be running this may also modify the contents of the sent message and thereby cause a program error because this behavior was not intended by the programmer.

This is a known safety issue with the use of non-blocking communication in MPI. A data buffer that is used in a non-blocking operation is in an unsafe state until it has been made sure that the message passing operation on it has concluded. To check the status of a non-blocking operation and thereby the safety status of its data buffer, MPI provides functions like MPI_Wait that block the current process until the referenced message passing operation is confirmed to be finished. The MPI standard requires such a function to be called before accessing a data buffer again that has been used in non-blocking communication. Adding a MPI_Wait call between lines 2-3 of the example code would make this program work correctly.

The problem with all of this is that MPI requires the programmer to always remember this behavior and neither the library nor the compiler are able to detect and warn users of potential errors with buffer safety for non-blocking communication.

Leveraging Rust’s ownership for buffer safety

The core concept of Rust’s memory management is the so called ownership feature. Ownership works in a way that every data object in Rust has exactly one owner. Once the owner variable goes out of scope the data is automatically deallocated. There can be references to an object but only within a limited rule-set. A variable can either have an unlimited number of immutable (read-only) references or exactly one mutable reference. These limitations allow the Rust compiler to reason about correct memory usage.

if client.id == 0 {
    let handle = client.send_nb(buf, 1, 0)?;
    buf = handle.data()?;
    for i in 0..BUF_SIZE {
        buf[i] = 42.0;
    }
} else if client.id == 1 {
    buf = client.receive(0, 0)?;
}

This is the heimdallr equivalent of the non-blocking MPI code that we have seen previously. The send operation in line 2 makes use of Rust’s ownership concept to protect the data buffer that is being sent. Since there can be only one owner of the buf variable, passing it directly to a function call means that the ownership is moved into the function. This has the side effect that buf is no longer accessible from outside the function. Therefore it is impossible to modify the data buffer while the message passing operation is running. Trying to do so would lead to a compilation error. For a user to access the data again they need to request ownership back from the message passing operation, which happens in line 3. The data function called there on the handle that was returned by the non-blocking send function is an equivalent to MPI_Wait. It blocks until the used data buffer is safe to be accessed again and then returns the ownership to the caller.

So in essence it is the same workflow as for an MPI application, but Rust’s ownership rules allow the library to be designed in a way where correct and safe usage of non-blocking communication can be enforced at compile time. This is a big step up in usability and correctness because it is no longer the users task to remember the implicit rules of non-blocking communication but instead it is a detected program error if the correct procedure is not followed.

This is of course just one small example on how the safety features of Rust can be used to design safer interfaces but in our opinion in showcases the possibilities very well.

Conclusion and further reading

This blog post is supposed to give a brief overview on the challenges of message passing parallelization and how the programming interfaces used for it could be designed in a safer way. Parallel programming is a complex topic and introduces a variety of new error classes. Therefore we find it very important that the libraries and tools used for it offer as much help as possible to developers by enforcing correctness and detecting possible errors.

The heimdallr library introduced in this post is a prototype implementation of a message passing library that concentrates on the compile time correctness aspects. It is not yet feature complete and is mainly supposed to show some of the possibilities for better usability and safety in MPI.

To keep this post brief, we have not gone into too much detail about the implementation and some of the open problems with this solution. heimdallr does have some open problems which we could not go over here without making this blog way too long. We also did not talk about the performance aspects, which is quite an important topic in the context of using it for HPC.

If your interest was piqued, a more detailed discussion about the pros and cons of heimdallr can be found in our heimdallr paper. There, we also discuss some of the problems with the current implementation and show benchmark results where heimdallr’s performance is compared to MPI.

If you would like to try out heimdallr or have a look at the code, you can visit our GitHub repository.

Performance of conditional operator vs. fabs

2021-09-21T00:00:00+00:00

Today, we will take a look at potential performance problems when using the conditional operator ?:. Specifically, we will use it to calculate a variable’s absolute value and compare its performance with that of the function fabs.

Assume the following numerical code written in C, where we need to calculate the absolute value of a double variable called residuum.¹ Since we want to perform this operation within the inner loop, we will have to keep performance overhead as low as possible. To reduce dependencies on math libraries and avoid function call overhead, we manually get the absolute value by first checking whether residuum is less than 0 and, if it is, negating it using the - operator.

for (int i = 0; i < 1000; i++)
{
	for (int j = 0; j < 1000; j++)
	{
		for (int k = 0; k < 1000; k++)
		{
			residuum = (residuum < 0) ? -residuum : residuum;
		}
	}
}

This looks easy enough and, in theory, should provide satisfactory performance. Just to be sure, let’s do the same using the fabs function from the math library, which returns the absolute value of a floating-point number.

for (int i = 0; i < 1000; i++)
{
	for (int j = 0; j < 1000; j++)
	{
		for (int k = 0; k < 1000; k++)
		{
			residuum = fabs(residuum);
		}
	}
}

Let’s compare the two implementations using hyperfine.²

Benchmark #1: ./conditional
  Time (mean ± σ):     476.3 ms ±   0.4 ms    [User: 474.5 ms, System: 0.7 ms]
  Range (min … max):   475.6 ms … 476.8 ms    10 runs

Benchmark #2: ./fabs
  Time (mean ± σ):     243.8 ms ±   2.0 ms    [User: 242.2 ms, System: 0.8 ms]
  Range (min … max):   242.1 ms … 249.0 ms    12 runs

Summary
  './fabs' ran
    1.95 ± 0.02 times faster than './conditional'

As we can see, the fabs implementation ran faster by more than a factor of 1.9! Where does this massive performance difference come from? Let’s use perf stat to analyze the two implementations in a bit more detail.

Performance counter stats for './conditional':

           478,51 msec task-clock:u              #    0,998 CPUs utilized
                0      context-switches:u        #    0,000 /sec
                0      cpu-migrations:u          #    0,000 /sec
               55      page-faults:u             #  114,940 /sec
    2.035.211.626      cycles:u                  #    4,253 GHz                      (83,28%)
        1.592.587      stalled-cycles-frontend:u #    0,08% frontend cycles idle     (83,28%)
          223.899      stalled-cycles-backend:u  #    0,01% backend cycles idle      (83,28%)
    4.009.332.175      instructions:u            #    1,97  insn per cycle
                                                 #    0,00  stalled cycles per insn  (83,32%)
    2.001.712.079      branches:u                #    4,183 G/sec                    (83,49%)
        1.503.325      branch-misses:u           #    0,08% of all branches          (83,34%)

      0,479296441 seconds time elapsed

      0,474423000 seconds user
      0,001996000 seconds sys

The most important metrics here are the number of instructions and the number of cycles. Our processor can run around 4,250,000,000 cycles per second, resulting in a runtime of 0.48 seconds to process the roughly 4,000,000,000 instructions at 1.97 instructions per cycle.

Performance counter stats for './fabs':

           245,48 msec task-clock:u              #    0,997 CPUs utilized
                0      context-switches:u        #    0,000 /sec
                0      cpu-migrations:u          #    0,000 /sec
               51      page-faults:u             #  207,757 /sec
    1.039.265.407      cycles:u                  #    4,234 GHz                      (83,31%)
        1.720.716      stalled-cycles-frontend:u #    0,17% frontend cycles idle     (83,30%)
          356.067      stalled-cycles-backend:u  #    0,03% backend cycles idle      (83,30%)
    3.007.112.338      instructions:u            #    2,89  insn per cycle
                                                 #    0,00  stalled cycles per insn  (83,29%)
    1.003.303.373      branches:u                #    4,087 G/sec                    (83,46%)
        1.662.984      branch-misses:u           #    0,17% of all branches          (83,34%)

      0,246272015 seconds time elapsed

      0,243024000 seconds user
      0,000977000 seconds sys

The reduction from 2,000,000,000 to 1,000,000,000 cycles corresponds to the performance improvement of 1.95. Using the fabs function reduced the number of instructions by roughly 25% and, at the same time, increased the number of instructions per cycle to 2.89 (a factor of 1.47). Getting rid of the conditional operator reduced the number of branches by half, allowing the processor to process more instructions per cycle. The conditional operator is more or less a short-hand version of the if statement and introduced a significant number of branches into our inner loop.

Running three nested loops with 1,000 iterations each resulted in 1,000,000,000 inner loop iterations, that is, we saved one instruction per inner loop iteration. These branch and instruction differences can be checked in even more detail using objdump -S; this is left as an exercise for the reader.

The magnitude of these performance differences is rather surprising and shows that it makes sense to check even seemingly simple code for potential performance problems.

The code shown is only an excerpt, the full code is available here. It was compiled with GCC 11.2 using the -O2 -Wall -Wextra -Wpedantic flags and the -lm library. ↩︎
hyperfine performs a statistical performance analysis. It runs the provided commands multiple times to reduce the influence of random errors and calculates derived metrics such as the mean and standard deviation. ↩︎