HugoParallel Computing and I/O BlogWe conduct research and development on parallel systems2024-01-12T23:39:58+00:00https://blog.parcio.de/HDF5: Self-describing data in modern storage architectureshttps://blog.parcio.de/posts/2022/08/hdf5/Timm Erxleben2022-08-02T00:00:00+00:002022-08-02T00:00:00+00:00
<p>In today’s post, we will discuss the advantages of self-describing data formats.
As a case study, we will examine the popular self-describing data format HDF5.
After a description of HDF5’s basic features and its data model, we will follow the development of support for modern storage architectures through history.</p>
<p>To understand the advantages of self-describing data formats, we first need to understand what self-describing data formats are.
Taking the <a
class="gblog-markdown__link"
href="https://ops.aps.anl.gov/manuals/SDDStoolkit/SDDStoolkitse1.html"
>definition by Argonne National Laboratory</a>, self-describing data formats have the following two properties:</p>
<ol>
<li>The data is accessed by name and by class. Instead of reading 20 bytes starting at offset 1337, one would request to read the dataset named XYZ.</li>
<li>Various data attributes that may be necessary for interpretation are available. For example, data types, units, and file contents can be discovered by a user without prior knowledge.</li>
</ol>
<p>The first point is only possible if the data format is paired with a programming library to access it.
Otherwise, users would need prior knowledge to parse the file’s structure.
Another advantage is that the file format can be updated without dropping support for older applications because the data model is abstracted from the actual file.</p>
<div class="gblog-post__anchorwrap">
<h2 id="why-do-we-need-self-describing-data-formats">
Why do we need self-describing data formats?
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#why-do-we-need-self-describing-data-formats" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Why do we need self-describing data formats?" href="#why-do-we-need-self-describing-data-formats">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>As explained above, abstracting the data model from files is beneficial for the maintainability of code.
Nevertheless, there is more to self-describing data.</p>
<p>The history of self-describing formats started in the 1980s<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> as the amount of scientific data produced by simulations increased.
For global exchangeability of datasets, standards were needed to abstract from architecture-dependent data types and software-dependent storage layouts.
Take the following image as an example:</p>
<p><img
src="motivation.png"
alt="Motivation"
/></p>
<p>Imagine you receive the file on the left without information on how to interpret it.
You will have to invest some time until you realize that it contains an ASCII encoded string.
Understanding more complex data (especially some architecture-dependent float data types) would be practically impossible without further hints.</p>
<p>However, the file on the right side contains the data type of the file content and even a comment describing its content.
Using this information, it is easy to read the file’s actual content, independent of the complexity of the data.
The ability to annotate data with units and comments further supports exchangeability.</p>
<p>Examples of self-describing formats for scientific data are HDF5 and NetCDF.
In this post, we will look at HDF5 as it is one of the most popular formats and, meanwhile, the base of NetCDF4.</p>
<div class="gblog-post__anchorwrap">
<h2 id="basics-of-hdf5">
Basics of HDF5
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#basics-of-hdf5" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Basics of HDF5" href="#basics-of-hdf5">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>HDF5 offers a complex and feature-rich data model.
Files can be understood as containers that can hold many different types of data.</p>
<p>Take a simulation experiment as an example.
During the experiment, a lot of data is created: a model describing the problem, a mesh discretizing the simulated space, initial and boundary conditions, the solver in use, parameters to the solver, the time series of the solution, and some visualizations of the result.
For reproducibility, you want to keep track of all the metadata describing how you obtained your results.
Those heterogeneous but logically related datasets may be stored in the same HDF5 file.
In doing so, the data and metadata are guaranteed not to be separated by accident.
Whoever receives a copy of this file will fully understand the details of your simulation and will be able to reproduce your results.</p>
<p><img
src="highlevel.png"
alt="Conceptional example of an HDF5 file"
/></p>
<div class="gblog-post__anchorwrap">
<h3 id="so-how-is-data-modeled-in-hdf5">
So, how is data modeled in HDF5?
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#so-how-is-data-modeled-in-hdf5" class="gblog-post__anchor clip flex align-center" aria-label="Anchor So, how is data modeled in HDF5?" href="#so-how-is-data-modeled-in-hdf5">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>The most common objects in HDF5 files are groups, data types, dataspaces, datasets, and attributes.</p>
<p><strong>Groups</strong> act like directories in file systems by mapping names to objects.
Nesting groups creates a hierarchical namespace in which objects are identified by a path.
The same object can be part of multiple groups, either via hard or soft links.
Therefore, care must be taken to not create loops that are not prevented by HDF5.
Every file has a root group denoted as <code>/</code>.
One could say that HDF5 creates a file system within a file.</p>
<p>In addition to built-in <strong>data types</strong>, e.g., floats and integers of different flavors, users may define their own complex data types.
Apart from creating arrays of a particular data type or packing up different data types into a compound, it is also possible to create atomic data types.
The definition of user-defined data types needs to be saved to the file so that it can be reread without prior knowledge.
This will happen automatically by using the data type resulting in an unnamed (i.e., <em>transient</em>) data type, but it is also possible to name the data type and save it to a group (i.e., saving it as a <em>committed</em> type).
Conversion functions can be registered and saved to the HDF5 file for user-defined atomic types.</p>
<p>In contrast to the POSIX data model, where a file is understood as a stream of bytes, elements contained in a dataset are addressed according to an associated <strong>dataspace</strong>.
The dataspace describes the number of dimensions and each dimension’s size and maximum size.
If the size and maximum size are not equal, the dataset can grow in that dimension.
This is especially useful for time series, which might grow when more data is collected.
Growth may be unbound when the maximum size is set to infinity.
Unlike data types, dataspaces are always saved implicitly, i.e., they do not have a name.</p>
<p><strong>Datasets</strong> hold the actual data in HDF5 files.
Their most important properties are the dataspace describing its shape and the data type of which its elements are.
Nevertheless, datasets have lots of settings and properties.
For example, the fill value for elements can be modified.
Reading from a new dataset, which was not yet written, will return the fill value.</p>
<p>Another vital setting controls if datasets are stored continuously or chunked.
If the dataspace contains a dimension that allows for growth, the dataset must be stored in chunks.
When more data is added, chunks can be appended without moving existing data.
The chunk size is set at the creation of the dataset.
While writing or reading, chunked data can be passed to a filter pipeline, transforming the data stream.
The most popular (and probably most useful) filter class is compression.
However, the user may define their own filter functions.</p>
<p>There is no limit to the size of datasets.
Nevertheless, it is neither practical nor possible to write or read several terabytes at once.
Fine-grained partial access is realized to solve that problem.
For every write and read, a selection of the dataset’s dataspace needs to be passed.
This selection can be the original dataspace to access the whole set.
Basic selection types are point and hyperslab selections.
Point selections are created by supplying a list of coordinates that should be included.
Hyperslabs are regular patterns of arbitrary-sized blocks.
When dealing with large and dense matrices, hyperslabs can reflect the distribution of matrix parts to different clients.
Combining multiple selections using set operators provides an intuitive way to construct complex selections.</p>
<p><strong>Attributes</strong> are metadata objects that may be attached to all named objects except other attributes.
They are similar to datasets as they are named objects (i.e., are referred to by a path) and have a dataspace and a data type.
However, there are some key differences:</p>
<ul>
<li>They do not support partial I/O, so they need to be written/read at once.</li>
<li>They do not support chunked storage and are therefore of fixed size.</li>
<li>They do not support compression.</li>
<li>They are stored as part of the header of other objects inside the HDF5 file.</li>
</ul>
<p>Attributes do not only explain the file’s content to users but also enable visualization or search tools to interact with the data based on its meaning.
Several domain-specific conventions exist for this purpose.
One of the most popular sets of conventions are the <a
class="gblog-markdown__link"
href="http://cfconventions.org/cf-conventions/cf-conventions.html"
><em>Climate and Forecast (CF) Conventions</em></a><sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.
If a file uses a specific set of conventions, it is automatically compatible with tools using the same conventions.</p>
<p>All relations between the objects explained above are summarized in the following diagram.
Please note that this is a simplified version to highlight the core concepts.</p>
<p><img
src="data-model.png"
alt="HDF5 data model"
/></p>
<p>Only a short introduction to HDF5 features and concepts can be given in this post.
The nitty-gritty details of the concepts explained above, as well as additional features like maps and tables, are left for research to the curious user<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>.</p>
<div class="gblog-post__anchorwrap">
<h2 id="programming-with-hdf5">
Programming with HDF5
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#programming-with-hdf5" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Programming with HDF5" href="#programming-with-hdf5">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Now that we know the basics of the HDF5 data model let us look at the practical usage of HDF5.</p>
<p>The library is shipped with C, C++, Fortran, and Java interfaces.
Apart from those, there are bindings for most popular programming languages, including Rust, Go, Python, Julia, and Matlab.
As most scientific software is written in C or Fortran, examples will be given in C.</p>
<p>The interface is grouped into several modules for a better overview:</p>
<ul>
<li><code>H5A</code> - Attributes</li>
<li><code>H5D</code> - Datasets</li>
<li><code>H5S</code> - Dataspaces</li>
<li><code>H5T</code> - Data types</li>
<li><code>H5F</code> - Files</li>
<li><code>H5G</code> - Groups</li>
<li><code>H5P</code> - Property Lists</li>
<li>etc.</li>
</ul>
<p>The general workflow is similar for all objects in HDF5.
First, objects are created or opened, returning a unique handle for that object.
Using the handle, objects can be manipulated.
When everything is done, the object needs to be closed.
The handle will then be invalid.</p>
<p>Let us make a short example of how to write a dataset and some attributes:</p>
<p>First, we need to create a file and a group.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="c1">// create a file and a group
</span><span class="c1"></span><span class="n">hid_t</span> <span class="n">file_id</span> <span class="o">=</span> <span class="n">H5Fcreate</span><span class="p">(</span><span class="s">"solution.h5"</span><span class="p">,</span> <span class="n">H5F_ACC_TRUNC</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">);</span>
<span class="n">hid_t</span> <span class="n">group_id</span> <span class="o">=</span> <span class="n">H5Gcreate</span><span class="p">(</span><span class="n">file_id</span><span class="p">,</span> <span class="s">"important_data"</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">);</span>
</code></pre></td></tr></table>
</div>
</div><p>Though the code is mainly self-explanatory, you may have noticed some mysterious <code>H5P_DEFAULT</code> constants in the code.
Those are default property lists.
Property lists contain many parameters controlling fine operations details and are manipulated using the <code>H5P</code> module.
Most functions accept several property lists for different purposes.
<code>H5Fcreate</code>, for example, takes a file creation property list and a file access property list.
We will later see how the file access property list is used to access files via specific HDF5 plugins.
However, in most cases, the standard is sufficient.</p>
<p>After creating the file and a group, we should write some data.
Therefore we create a dataspace for a 3x3 matrix which will be used to store <code>important_numbers</code>.
Using our new dataspace we create the dataset named <code>my_cool_data</code> in the group created above.
The data type for the numbers on the disk will be the native float type of the machine.</p>
<p>Everything is set to actually write the matrix to the file.
As explained above, the dataspace is again given for partial I/O.
As we pass the original dataspace, the whole matrix will be written.</p>
<p>In addition, <code>H5Dwrite</code> retakes the data type and the dataspace.
You probably wonder why the type and space need to be passed twice.
The reason is that HDF5 can read a different data type from a different shape from memory than it may be written to disk.
Therefore, it would be possible to only take the main diagonal of a matrix in double precision from memory and write it to disk as a vector in single precision.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="c1">// create and write to a dataset
</span><span class="c1"></span><span class="kt">float</span> <span class="n">important_numbers</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="p">{{</span><span class="mi">42</span><span class="p">,</span> <span class="mi">42</span><span class="p">,</span> <span class="mi">42</span><span class="p">},</span>
<span class="p">{</span><span class="mi">42</span><span class="p">,</span> <span class="mi">42</span><span class="p">,</span> <span class="mi">42</span><span class="p">},</span>
<span class="p">{</span><span class="mi">42</span><span class="p">,</span> <span class="mi">42</span><span class="p">,</span> <span class="mf">42.42</span><span class="p">}};</span>
<span class="n">hsize_t</span> <span class="n">dims</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">};</span>
<span class="n">hsize_t</span><span class="o">*</span> <span class="n">max_dims</span> <span class="o">=</span> <span class="n">dims</span><span class="p">;</span>
<span class="n">hid_t</span> <span class="n">space_matrix_id</span> <span class="o">=</span> <span class="n">H5Screate_simple</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">dims</span><span class="p">,</span> <span class="n">max_dims</span><span class="p">);</span>
<span class="n">hid_t</span> <span class="n">set_id</span> <span class="o">=</span> <span class="n">H5Dcreate</span><span class="p">(</span><span class="n">group_id</span><span class="p">,</span> <span class="s">"my_cool_data"</span><span class="p">,</span> <span class="n">H5T_NATIVE_FLOAT</span><span class="p">,</span> <span class="n">space_matrix_id</span><span class="p">,</span>
<span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">);</span>
<span class="n">H5Dwrite</span><span class="p">(</span><span class="n">set_id</span><span class="p">,</span> <span class="n">H5T_NATIVE_FLOAT</span><span class="p">,</span> <span class="n">space_matrix_id</span><span class="p">,</span> <span class="n">space_matrix_id</span><span class="p">,</span>
<span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="o">&</span><span class="n">important_numbers</span><span class="p">);</span>
</code></pre></td></tr></table>
</div>
</div><p>The following code snippet shows how to add metadata in the form of attributes to the file.
Writing attributes is mostly similar to writing datasets.
Nevertheless, as no partial I/O is supported for attributes, the write function takes no selection of a dataspace.</p>
<p>It is also shown how to use strings in HDF5.
The built-in type <code>H5T_C_S1</code> is copied, and its size is modified because the standard only takes 1 character.
To get a variable-sized string, you can pass <code>H5T_VARIABLE</code>.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span><span class="lnt">21
</span><span class="lnt">22
</span><span class="lnt">23
</span><span class="lnt">24
</span><span class="lnt">25
</span><span class="lnt">26
</span><span class="lnt">27
</span><span class="lnt">28
</span><span class="lnt">29
</span><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="c1">// create some attributes
</span><span class="c1"></span><span class="n">hid_t</span> <span class="n">space_scalar_id</span> <span class="o">=</span> <span class="n">H5Screate</span><span class="p">(</span><span class="n">H5S_SCALAR</span><span class="p">);</span>
<span class="kt">float</span> <span class="n">mean</span> <span class="o">=</span> <span class="mf">42.05</span><span class="p">;</span>
<span class="kt">char</span> <span class="n">content_description</span><span class="p">[]</span> <span class="o">=</span> <span class="s">"Contains a dataset with the answer to everything!"</span><span class="p">;</span>
<span class="n">hid_t</span> <span class="n">string_type</span> <span class="o">=</span> <span class="n">H5Tcopy</span><span class="p">(</span><span class="n">H5T_C_S1</span><span class="p">);</span>
<span class="n">H5Tset_size</span><span class="p">(</span><span class="n">string_type</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">content_description</span><span class="p">));</span>
<span class="n">hid_t</span> <span class="n">attr_group</span> <span class="o">=</span> <span class="n">H5Acreate</span><span class="p">(</span><span class="n">group_id</span><span class="p">,</span> <span class="s">"content"</span><span class="p">,</span> <span class="n">string_type</span><span class="p">,</span> <span class="n">space_scalar_id</span><span class="p">,</span>
<span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">);</span>
<span class="n">H5Awrite</span><span class="p">(</span><span class="n">attr_group</span><span class="p">,</span> <span class="n">string_type</span><span class="p">,</span> <span class="n">content_description</span><span class="p">);</span>
<span class="n">hid_t</span> <span class="n">attr_set</span> <span class="o">=</span> <span class="n">H5Acreate</span><span class="p">(</span><span class="n">set_id</span><span class="p">,</span> <span class="s">"mean"</span><span class="p">,</span> <span class="n">H5T_NATIVE_FLOAT</span><span class="p">,</span> <span class="n">space_scalar_id</span><span class="p">,</span>
<span class="n">H5P_DEFAULT</span><span class="p">,</span> <span class="n">H5P_DEFAULT</span><span class="p">);</span>
<span class="n">H5Awrite</span><span class="p">(</span><span class="n">attr_set</span><span class="p">,</span> <span class="n">H5T_NATIVE_FLOAT</span><span class="p">,</span> <span class="o">&</span><span class="n">mean</span><span class="p">);</span>
</code></pre></td></tr></table>
</div>
</div><p>At last, every opened object can be closed.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">30
</span><span class="lnt">31
</span><span class="lnt">32
</span><span class="lnt">33
</span><span class="lnt">34
</span><span class="lnt">35
</span><span class="lnt">36
</span><span class="lnt">37
</span><span class="lnt">38
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="c1">// close all objects
</span><span class="c1"></span><span class="n">H5Tclose</span><span class="p">(</span><span class="n">string_type</span><span class="p">);</span>
<span class="n">H5Dclose</span><span class="p">(</span><span class="n">set_id</span><span class="p">);</span>
<span class="n">H5Aclose</span><span class="p">(</span><span class="n">attr_group</span><span class="p">);</span>
<span class="n">H5Aclose</span><span class="p">(</span><span class="n">attr_set</span><span class="p">);</span>
<span class="n">H5Sclose</span><span class="p">(</span><span class="n">space_scalar_id</span><span class="p">);</span>
<span class="n">H5Sclose</span><span class="p">(</span><span class="n">space_matrix_id</span><span class="p">);</span>
<span class="n">H5Gclose</span><span class="p">(</span><span class="n">group_id</span><span class="p">);</span>
<span class="n">H5Fclose</span><span class="p">(</span><span class="n">file_id</span><span class="p">);</span>
</code></pre></td></tr></table>
</div>
</div><p>Putting all those snippets together to a valid C program<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> and executing it yields the file <code>solution.h5</code>.
Using <code>h5dump</code> we can verify that the file indeed contains our data and metadata:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="gp">$ </span>h5dump solution.h5
<span class="go">HDF5 "solution.h5" {
</span><span class="go">GROUP "/" {
</span><span class="go"> GROUP "important_data" {
</span><span class="go"> ATTRIBUTE "content" {
</span><span class="go"> DATATYPE H5T_STRING {
</span><span class="go"> STRSIZE 50;
</span><span class="go"> STRPAD H5T_STR_NULLTERM;
</span><span class="go"> CSET H5T_CSET_ASCII;
</span><span class="go"> CTYPE H5T_C_S1;
</span><span class="go"> }
</span><span class="go"> DATASPACE SCALAR
</span><span class="go"> DATA {
</span><span class="go"> (0): "Contains a dataset with the answer to everything!"
</span><span class="go"> }
</span><span class="go"> }
</span><span class="go"> DATASET "my_cool_data" {
</span><span class="go"> DATATYPE H5T_IEEE_F32LE
</span><span class="go"> DATASPACE SIMPLE { ( 3, 3 ) / ( 3, 3 ) }
</span><span class="go"> DATA {
</span><span class="go"> (0,0): 42, 42, 42,
</span><span class="go"> (1,0): 42, 42, 42,
</span><span class="go"> (2,0): 42, 42, 42.42
</span><span class="go"> }
</span><span class="go"> ATTRIBUTE "mean" {
</span><span class="go"> DATATYPE H5T_IEEE_F32LE
</span><span class="go"> DATASPACE SCALAR
</span><span class="go"> DATA {
</span><span class="go"> (0): 42.05
</span><span class="go"> }
</span><span class="go"> }
</span><span class="go"> }
</span><span class="go"> }
</span><span class="go">}
</span><span class="go">}
</span></code></pre></div><p>Examples for reads are omitted as they are conceptually similar to writes.
More examples of short HDF5 programs can be found <a
class="gblog-markdown__link"
href="http://web.mit.edu/fwtools_v3.1.0/www/Intro/IntroExamples.html"
>here</a>.</p>
<div class="gblog-post__anchorwrap">
<h2 id="parallelism-in-hdf5">
Parallelism in HDF5
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#parallelism-in-hdf5" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Parallelism in HDF5" href="#parallelism-in-hdf5">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>All we have seen so far is how to write data using a single process on a single client.
In the context of HPC, parallel access to HDF5 files is necessary.
Otherwise, the I/O performance would be limited by the throughput of a single process on a single client.
Multiple approaches exist for parallel access.</p>
<p>The most straightforward way is to write one HDF5 file per process and “stitch” them together using external links in a central file.
Even though this approach is older than HDF5, it is further supported with <em>Virtual Datasets</em> (VDS), added in release 1.10.
A VDS is an object which behaves similarly to a single dataset.
In reality, however, it is a mapping to other datasets that may be part of another file.</p>
<p>Nevertheless, using multiple files contradicts the idea of a single container with all necessary data.
For real parallel access to a single file <em>Parallel HDF5</em> (PHDF5) was added in version 1.0.1 using MPI-IO.
Files are accessed with PHDF5 by passing a modified file access property list containing a reference to an MPI communicator at open or create time:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="n">hid_t</span> <span class="n">plist_id</span> <span class="o">=</span> <span class="n">H5Pcreate</span><span class="p">(</span><span class="n">H5P_FILE_ACCESS</span><span class="p">);</span>
<span class="n">H5Pset_fapl_mpio</span><span class="p">(</span><span class="n">plist_id</span><span class="p">,</span> <span class="n">comm</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span>
<span class="n">H5Fopen</span><span class="p">(</span><span class="s">"my_file.h5"</span><span class="p">,</span> <span class="n">H5F_ACC_RDWR</span><span class="p">,</span> <span class="n">plist_id</span><span class="p">);</span>
</code></pre></div><p>Reads and writes are performed using the regular functions and appropriate dataspace selections.
Care must be taken on which operations are <em>collective</em> (i.e., all processes must participate) or <em>independent</em>.
All modifications of the file’s structural metadata, such as creating or linking objects, are always collective.
Reads and writes can either be collective or independent, which is controlled by the data transfer property list.
In most cases, collective I/O leads to higher throughput.</p>
<p>Despite its easy usage, it is hard to get good performance with PHDF5.
I/O on a parallel distributed file system alone is a complex task where throughput is influenced by many factors.
The introduction of additional I/O layers further complicates I/O tuning<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>.</p>
<div class="gblog-post__anchorwrap">
<h2 id="virtual-file-layer">
Virtual File Layer
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#virtual-file-layer" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Virtual File Layer" href="#virtual-file-layer">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>For PHDF5, MPI-IO was added as an additional storage interface next to POSIX.
This gave rise to the idea of a plugin system for different storage backends.
Consequently, the structure of the HDF5 library was changed, and the <em>Virtual File Layer</em> (VFL) was introduced in version 1.4.
Instead of using POSIX or MPI-IO directly, all I/O calls are abstracted and passed to a <em>Virtual File Driver</em> (VFD).
The VFD, in turn, will map the linear address space of an HDF5 file to the address space of a storage backend.
VFDs are used by manipulating the file access property list and setting the respective driver, which must be registered beforehand.
For details on registering a VFD with the HDF5 library, please refer to HDF5’s documentation.
HDF5 provides several pre-defined VFDs.
Some interesting examples are:</p>
<ul>
<li><code>H5FD_CORE</code>: perform I/O to RAM</li>
<li><code>H5FD_SEC2</code>: default VFD using POSIX</li>
<li><code>H5FD_MPIIO</code>: parallel access via MPI-IO</li>
<li><code>HDF5_HDFS</code>: direct access to files in Hadoop Distributed File System</li>
<li><code>H5FD_ROS3</code>: direct read-only access to files stored in Amazon S3</li>
<li><code>H5FD_MULTI</code>: call different underlying VFDs depending on the address range accessed</li>
</ul>
<p>In addition, users can implement their own VFD to support their specific storage needs.
Currently, work is done to enable the dynamic loading of plugins at runtime.</p>
<div class="gblog-post__anchorwrap">
<h3 id="limits-of-vfds">
Limits of VFDs
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#limits-of-vfds" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Limits of VFDs" href="#limits-of-vfds">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>VFDs only abstract I/O calls (i.e., only handle byte streams) and are therefore unaware of the HDF5 data model.
Though decisions can be made based on address ranges (e.g., as in <code>H5FD_MULTI</code>), the file’s structure can not be changed to leverage features of modern storage technologies.
In practice, this approach excludes storage types that could (more or less) directly map the data model like, for example, <a
class="gblog-markdown__link"
href="https://github.com/daos-stack/daos"
>DAOS</a>.</p>
<div class="gblog-post__anchorwrap">
<h2 id="new-architecture-and-virtual-object-layer">
New architecture and Virtual Object Layer
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#new-architecture-and-virtual-object-layer" class="gblog-post__anchor clip flex align-center" aria-label="Anchor New architecture and Virtual Object Layer" href="#new-architecture-and-virtual-object-layer">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>To address the limitation of the VFD, the <em>Virtual Object Layer</em> (VOL) was introduced in version 1.12.
It provides another interface for plugins to interact with HDF5.
Unlike the VFL, the VOL operates on the data model abstraction level and defines an interface for callbacks for the public HDF5 interface functions.</p>
<p>For the VOL’s implementation, the library was yet again restructured.
The default VOL plugin implements the HDF5 file format specification and uses the VFL to interact with storage backends.
The following picture summarizes the layers used in the library, in addition to some example VOL plugins not included with HDF5.</p>
<p><img
src="architecture.png"
alt="HDF5 architecture"
/></p>
<p>There are multiple ways to use VOL plugins.
The easiest way is to set environment variables to dynamically load a plugin at the program start.
However, just like VFDs, they can be used via file access control lists.</p>
<p>Interesting new possibilities are enabled by VOL.
For example, plugins can be stacked to a VOL chain.
I/O behavior can be traced easily by using these passthrough connectors.
Another use case is to transform data while passing it through the chain.</p>
<p>Nevertheless, the most interesting use case of VOL is to map HDF5 files to modern storage backends in a more intuitive way.
Metadata, for example, might be separated and stored in a key-value store or database, while datasets might be stored in an object store.
This is the case for the two VOL plugins currently under development in the <a
class="gblog-markdown__link"
href="https://github.com/parcio/julea"
>JULEA storage framework</a>.
The goal is to make use of the enhanced query capabilities of those backends to speed up the analysis of data.
Another example is given by the <a
class="gblog-markdown__link"
href="https://github.com/HDFGroup/vol-daos"
>DAOS VOL plugin</a>, where the data model is mapped to the modern object store DAOS, which is designed for use with persistent RAM and NVMe SSDs.</p>
<p>As of the current version 1.13, the VOL interface was changed based on the gained experiences.
It will remain unstable until version 1.14, which is yet to be released.</p>
<div class="gblog-post__anchorwrap">
<h2 id="summary-and-conclusion">
Summary and conclusion
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#summary-and-conclusion" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Summary and conclusion" href="#summary-and-conclusion">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Self-describing data formats are essential standards for exchanging scientific data as they abstract technical details from the user and enable the annotation of data with important metadata such as units.
HDF5 offers a feature-rich data model based on groups, datasets, data types, and dataspaces.
We have seen how HDF5 changed to fulfill growing requirements on storage systems.
Based on the idea of exchangeable backends, the VFL was created.
Today the actual HDF5 file format is only one implementation of the HDF5 data model among many other variants due to the introduction of the Virtual Object Layer.
At this point, the classical files and file systems are challenged, and new ways to model and access scientific data have emerged.</p>
<p>Of course, only the basics of HDF5 could be covered in this post, and many details need to be left out.
Because the VFL and VOL APIs are currently under change, only their high-level concepts are featured.
If you would like to gain further insight and hands-on experience with VOL plugins, the <a
class="gblog-markdown__link"
href="https://www.hdfgroup.org/category/webinar/"
>webinars</a> offered by the HDF Group might be something for you.</p>
<div class="gblog-post__anchorwrap">
<h2 id="sources">
Sources
<a data-clipboard-text="https://blog.parcio.de/posts/2022/08/hdf5/#sources" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Sources" href="#sources">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>All information is taken from the HDF5 documentation and the <a
class="gblog-markdown__link"
href="http://web.mit.edu/fwtools_v3.1.0/www/ADGuide/HISTORY.txt"
>HDF5 changelog</a> if not stated otherwise.</p>
<p>The graphics were made using <a
class="gblog-markdown__link"
href="https://app.diagrams.net/"
>draw.io</a> and the <a
class="gblog-markdown__link"
href="https://commons.wikimedia.org/wiki/GNOME_Desktop_icons"
>Gnome desktop icons</a> which are licensed under the <a
class="gblog-markdown__link"
href="https://opensource.org/licenses/gpl-2.0.php"
>GPLv2</a>.</p>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>Development of <a
class="gblog-markdown__link"
href="https://en.wikipedia.org/wiki/Common_Data_Format"
>CDF</a> started 1985. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Technically speaking, those conventions apply to the NetCDF self-describing data format. However, the naming of attributes can be transferred to HDF5 as done in the <a
class="gblog-markdown__link"
href="https://earthdata.nasa.gov/esdis/eso/standards-and-references/dataset-interoperability-recommendations-for-earth-science"
>Recommendations by NASA for Earth Science</a>. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>A good starting point is the <a
class="gblog-markdown__link"
href="https://docs.hdfgroup.org/hdf5/v1_12/_r_m.html"
>HDF5 documentation</a>. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>The full code of the HDF5 example can be found <a
class="gblog-markdown__link"
href="hdf5-example.c"
>here</a>. <a href="#fnref:4" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Further information on I/O tuning can be found in the <a
class="gblog-markdown__link"
href="https://confluence.hdfgroup.org/display/HDF5/Parallel+HDF5"
>HDF5 documentation</a>. <a href="#fnref:5" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</section>
Rust for Python developers: Using Rust to optimize your Python codehttps://blog.parcio.de/posts/2022/07/rust-for-python/David Hausmann2022-07-20T00:00:00+00:002022-07-20T00:00:00+00:00
<p>This post covers how to use Rust and PyO3 to optimize existing Python projects.
It will also give you a basic introduction to Rust on the way.</p>
<div class="gblog-post__anchorwrap">
<h2 id="example-program">
Example program
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/rust-for-python/#example-program" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Example program" href="#example-program">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>The following Python program creates a simple visualisation of the Mandelbrot set using matplotlib.
It takes about 20s to finish on my machine.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>
<span class="k">def</span> <span class="nf">simple_stability</span><span class="p">(</span><span class="n">real</span><span class="p">:</span><span class="nb">float</span><span class="p">,</span> <span class="n">imag</span><span class="p">:</span><span class="nb">float</span><span class="p">,</span> <span class="n">max_iterations</span><span class="p">:</span><span class="nb">int</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="n">zr</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">zi</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iterations</span><span class="p">):</span>
<span class="n">new_zr</span> <span class="o">=</span> <span class="n">zr</span><span class="o">**</span><span class="mi">2</span> <span class="o">-</span> <span class="n">zi</span><span class="o">**</span><span class="mi">2</span> <span class="o">+</span> <span class="n">real</span>
<span class="n">zi</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">zr</span> <span class="o">*</span> <span class="n">zi</span> <span class="o">+</span> <span class="n">imag</span>
<span class="n">zr</span> <span class="o">=</span> <span class="n">new_zr</span>
<span class="k">if</span> <span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">zr</span><span class="o">**</span><span class="mi">2</span> <span class="o">+</span> <span class="n">zi</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span> <span class="o">></span> <span class="mi">2</span><span class="p">:</span>
<span class="k">return</span> <span class="n">i</span>
<span class="k">return</span> <span class="n">max_iterations</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">values</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="n">line</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="n">line</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">simple_stability</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span>
<span class="n">values</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="n">values</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code></pre></div><p>We can see that most of the calculation time is spent in <code>simple_stability</code>, which makes it a performance-critical function.
This means that any speedup we achieve for <code>simple_stability</code> will also have a big impact on the overall performance of our program.
With that in mind, let’s try translating this function into Rust.</p>
<div class="gblog-post__anchorwrap">
<h2 id="first-steps-in-rust">
First steps in Rust
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/rust-for-python/#first-steps-in-rust" class="gblog-post__anchor clip flex align-center" aria-label="Anchor First steps in Rust" href="#first-steps-in-rust">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Rust is a compiled language, unlike Python, which is interpreted.
This means that we can’t just start writing <code>.rs</code> files and run them from the console (or IDE).
We have to compile them first.</p>
<p>Rust has an excellent tool called Cargo that takes care of all our compilation and dependency management needs.
To create a new <em>crate</em>, that is, a new Rust project using Cargo, run <code>cargo new --lib mandelbrot_module</code> in the directory of your choice.
(Install Rust and Cargo if you have not done so already.)
The contents of your new directory should look something like this:</p>
<pre tabindex="0"><code>mandelbrot_module/
├─ src/
│ ├─ lib.rs
├─ .gitignore
├─ Cargo.toml
</code></pre><p>This is the standard structure for all Rust crates.
<code>src</code> is where all our source code will be stored and Cargo requires a specific name for our main file.
If we were trying to write an executable, our main file would be <code>src/main.rs</code> and the execution of our compiled program would start in the <code>main</code> function of that file.
Since we want to write a library/module, our main file is going to be <code>lib.rs</code> and everything we might want to use from our library after compilation needs to be available from this file.</p>
<p>Since Cargo already wrote some test code into our <code>lib.rs</code>, let’s run it to see that everything works.
To do this run <code>cargo test</code> anywhere within the main directory of the crate.</p>
<p>Test code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="cp">#[cfg(test)]</span><span class="w">
</span><span class="w"></span><span class="k">mod</span> <span class="nn">tests</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="cp">#[test]</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">it_works</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="fm">assert_eq!</span><span class="p">(</span><span class="mi">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>Expected console output:</p>
<pre tabindex="0"><code>running 1 test
test tests::it_works ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Doc-tests mandelbrot_module
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
</code></pre><p>You should now have a <code>target</code> directory in your crate.
This directory contains all the files that get created during compilation, but we don’t actually need it for this project.</p>
<p>We do however need to add PyO3 to our crate’s dependencies before we can start using it, so let’s do that now.
Adding dependencies to a crate is normally pretty simple.
You just have to write the dependency name and version number under <code>[dependencies]</code> in your <code>Cargo.toml</code> file like this:</p>
<pre tabindex="0"><code>[dependencies]
threadpool = "1.8.1"
</code></pre><p>But PyO3 needs some extra configuration which I won’t explain in this post.
Just paste the following into your <code>Cargo.toml</code> file:</p>
<pre tabindex="0"><code>[package]
name = "mandelbrot_module"
version = "0.1.0"
edition = "2018"
[lib]
name = "mandelbrot_module"
crate-type = ["cdylib"]
[dependencies.pyo3]
version = "0.15.1"
features = ["extension-module"]
</code></pre><p>With that done, let’s write some actual Rust code in <code>lib.rs</code>.</p>
<div class="gblog-post__anchorwrap">
<h2 id="rust-functions">
Rust functions
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/rust-for-python/#rust-functions" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Rust functions" href="#rust-functions">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>We’re going to start by just writing the function how you would in a pure Rust program.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="k">fn</span> <span class="nf">simple_stability</span><span class="p">(</span><span class="n">real</span>:<span class="kt">f64</span><span class="p">,</span><span class="w"> </span><span class="n">imag</span>:<span class="kt">f64</span><span class="p">,</span><span class="w"> </span><span class="n">max_iterations</span>:<span class="kt">usize</span><span class="p">)</span><span class="w"> </span>-> <span class="kt">usize</span> <span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">zr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="k">f64</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">zi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="k">f64</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">0</span><span class="o">..</span><span class="n">max_iterations</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">new_zr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">zr</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">zi</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">real</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="n">zi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mf">2.0</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">zr</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">zi</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">imag</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="n">zr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_zr</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">zr</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">zi</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)).</span><span class="n">sqrt</span><span class="p">()</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="mf">2.0</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">i</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">max_iterations</span><span class="p">;</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>Let’s first look at the function declaration which looks pretty similar to its Python counterpart.
Rust uses the <code>fn</code> keyword instead of <code>def</code> to declare functions.
It also uses different names for its types.
<code>f64</code> is a 64-bit float, which is equivalent to Python floats and C’s double type.
32-bit floats are <code>f32</code>.
Integers in Rust use a similar naming scheme.
The <code>u</code> in <code>usize</code> tells us that we’re dealing with an unsigned integer.
The <code>size</code> means that the size of our integer is dependent on our operating system (OS), so this type would be equivalent to <code>u64</code> on a 64-bit OS and to <code>u32</code> on a 32-bit OS.
If we wanted to use more than just positive integers we could use <code>isize</code> and the same naming scheme applies to <code>i</code>-types.
There are also integer types that fit into a single byte with <code>i8</code> and <code>u8</code>.
Choosing a smaller type can make a huge difference in your program’s memory usage and even performance, so Rust takes typing very seriously.
While type annotations in function declarations are only recommended and not mandatory in Python, they are mandatory in Rust.
In fact, the type of every variable has to be known at compile time or Rust simply won’t compile your code.
This may sound like a lot of type annotations, but the compiler does a great job at inferring a variable’s type most of the time.
Note also that Rust functions do not support optional arguments, so we always have to specify <code>max_iterations</code> with our new function.</p>
<p>Let’s take a look at the declaration of <code>zr</code> and <code>zi</code> now.
They’re both <code>f64</code> as can be inferred from the right-hand side of their declaration.
The <code>let</code> keyword is used to declare a new variable and the <code>mut</code> keyword specifies that this variable is mutable.
Variables declared without <code>mut</code> are immutable.
This might seem weird at first, but it actually makes the code more readable by telling you which values will change throughout this function’s runtime.</p>
<p>The rest of this code looks remarkably similar to its Python equivalent with the exeption that Rust has no power operator and that <code>sqrt</code> is a method of float types instead of being an import from the <code>math</code> module.</p>
<div class="gblog-post__anchorwrap">
<h2 id="using-pyo3">
Using PyO3
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/rust-for-python/#using-pyo3" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Using PyO3" href="#using-pyo3">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>We just need to make this function accessible in a Python module now.
First of all, let’s import <code>pyo3</code> using <code>use pyo3::prelude::*;</code>.
Importing external crates in Rust is done via the <code>use</code> keyword.
The <code>::</code> are used to specify namespaces in Rust.
The namespace <code>prelude</code> is a Rust convention and contains most functionality you’d need from this crate.
We import everything from prelude the same way we would in Python via the <code>*</code> operator.</p>
<p>Every function we want to include in our final Python module needs to be annotated with <code>#[pyfunction]</code>.
This is a <em>macro</em> that will make some changes to our code during compilation to make it compatible with Python.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="cp">#[pyfunction]</span><span class="w">
</span><span class="w"></span><span class="k">fn</span> <span class="nf">simple_stability</span><span class="p">(</span><span class="n">real</span>:<span class="kt">f64</span><span class="p">,</span><span class="w"> </span><span class="n">imag</span>:<span class="kt">f64</span><span class="p">,</span><span class="w"> </span><span class="n">max_iterations</span>:<span class="kt">usize</span><span class="p">)</span><span class="w"> </span>-> <span class="kt">usize</span> <span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="c1">// ...
</span><span class="c1"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>It’s not always this simple, though, because some Rust types can’t be converted to and from Python types.
A list of all Rust types that implement <code>IntoPy</code> and are therefore valid argument and return types in a PyO3 pyfunctions can be found <a
class="gblog-markdown__link"
href="https://docs.rs/pyo3/latest/pyo3/conversion/trait.IntoPy.html"
>here</a>.</p>
<p>The last thing we need before compilation is a piece of boilerplate code to assemble our module.
Copy and paste the following at the end of your <code>lib.rs</code> file.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="cp">#[pymodule]</span><span class="w">
</span><span class="w"></span><span class="k">fn</span> <span class="nf">mandelbrot_module</span><span class="p">(</span><span class="n">_py</span>: <span class="nc">Python</span><span class="p">,</span><span class="w"> </span><span class="n">m</span>: <span class="kp">&</span><span class="nc">PyModule</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">PyResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">add_function</span><span class="p">(</span><span class="n">wrap_pyfunction</span><span class="o">!</span><span class="p">(</span><span class="n">simple_stability</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="p">)</span><span class="o">?</span><span class="p">)</span><span class="o">?</span><span class="p">;</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="nb">Ok</span><span class="p">(())</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>The <code>#[pymodule]</code> macro stitches our Python module together from the function we attach it to.
It’s important that your module has the same name as this function or Python won’t be able to find it.
The code for adding a function is a bit advanced and you don’t really need to understand what’s going on here.
Just add another line of <code>m.add_function(...)</code> and replace the <code>simple_stability</code> with the name of your function if you want to add another function to this module.</p>
<p>We can now finally build our module and try using it in our Python program.
There are multiple ways of going about this, but we are going to use maturin in this post.
(Have a look at <a
class="gblog-markdown__link"
href="https://pyo3.rs/latest/building_and_distribution.html#manual-builds"
>https://pyo3.rs/latest/building_and_distribution.html#manual-builds</a> if maturin doesn’t suit your needs.)</p>
<p>To use maturin, we first need to create a virtual environment in our <code>mandelbrot_module</code> directory and then install and run maturin in said virtual environment.</p>
<pre tabindex="0"><code>$ py -m venv .env
$ ./.env/scripts/activate
$ pip install maturin
$ maturin develop
</code></pre><p>You should now see some build output in your console while maturin compiles your module.
And it should finish with:</p>
<pre tabindex="0"><code>🛠 Installed mandelbrot_module-0.1.0
</code></pre><p>Let’s confirm that our module actually works.
Copy the previous Python program into the <code>mandelbrot_module</code> directory and modify it so that it uses our new Rust module.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>
<span class="kn">from</span> <span class="nn">mandelbrot_module</span> <span class="kn">import</span> <span class="n">simple_stability</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="n">values</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="n">line</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="n">line</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">simple_stability</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="mi">100</span><span class="p">))</span>
<span class="n">values</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="n">values</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">main</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>
</code></pre></div><p>This new version of our program takes about 4.6s on my machine, which means we achieved a speedup of more than 400%!
This example is very simple and was specifically chosen to be translated into Rust so our speedup is close to a best case scenario, but it shows how powerful translating performance-critical tasks into Rust can be.</p>
<div class="gblog-post__anchorwrap">
<h2 id="writing-python-classes-in-rust">
Writing Python classes in Rust
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/rust-for-python/#writing-python-classes-in-rust" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Writing Python classes in Rust" href="#writing-python-classes-in-rust">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Your real world code will most likely not be this simple.
You might for instance have many different functions that rely on one or two classes for some shared functionality.
In this case you could translate your class to improve your code’s performance.</p>
<p>We are going to implement a complex number class because our <code>simple_stability</code> function has been doing complex calculations all along.
<code>zr</code>, <code>zi</code>, <code>real</code> and <code>imag</code> are the real and imaginary components of two complex numbers <code>z</code> and <code>c</code>.
And our function is iterating over the formula
<link
rel="stylesheet"
href="/katex-e4de31b5.min.css"
/>
<script defer src="/js/katex-3c86c25a.bundle.min.js"></script>
<span class="gblog-katex ">
\(z(n+1) = z(n)^2 + c\)</span> with
<span class="gblog-katex ">
\(z(0) = 0 + 0i\)</span>.</p>
<p>Let’s start with structs then, Rust’s rough equivalent to classes.
Here is the declaration for a complex number struct:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="k">struct</span> <span class="nc">Complex</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="n">real</span>: <span class="kt">f64</span><span class="p">,</span><span class="w">
</span><span class="w"> </span><span class="n">imag</span>: <span class="kt">f64</span>
<span class="p">}</span><span class="w">
</span></code></pre></div><p>Simply use the <code>struct</code> keyword followed by your struct’s name and a declaration of the data types that will be stored in your struct.
We can now create objects of the type <code>Complex</code> with similar syntax.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="k">fn</span> <span class="nf">_example1</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">_origin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Complex</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="n">real</span>: <span class="mf">0.0</span><span class="p">,</span><span class="w">
</span><span class="w"> </span><span class="n">imag</span>: <span class="mf">0.0</span><span class="w">
</span><span class="w"> </span><span class="p">};</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>Next up, we’re going to create an <code>impl</code> block to implement the methods we need for our calculations.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="k">impl</span><span class="w"> </span><span class="n">Complex</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">new</span><span class="p">(</span><span class="n">real</span>: <span class="kt">f64</span><span class="p">,</span><span class="w"> </span><span class="n">imag</span>: <span class="kt">f64</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">Complex</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="n">real</span>: <span class="nc">real</span><span class="p">,</span><span class="w">
</span><span class="w"> </span><span class="n">imag</span>: <span class="nc">imag</span><span class="w">
</span><span class="w"> </span><span class="p">};</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">add</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">other</span>: <span class="nc">Self</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="bp">Self</span>::<span class="n">new</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="p">,</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">sub</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">other</span>: <span class="nc">Self</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="bp">Self</span>::<span class="n">new</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="p">,</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">mul</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">other</span>: <span class="nc">Self</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">new_real</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">new_imag</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="bp">Self</span>::<span class="n">new</span><span class="p">(</span><span class="n">new_real</span><span class="p">,</span><span class="w"> </span><span class="n">new_imag</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">dist_from_origin</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span>-> <span class="kt">f64</span> <span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)).</span><span class="n">sqrt</span><span class="p">()</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>Notice that our methods use an uppercase <code>Self</code> and a lowercase <code>self</code>.
Lowercase <code>self</code> refers to the object that this method is called on just like in Python.
Uppercase <code>Self</code> is shorthand for the type that we’re implementing this method for.
So the <code>add</code> method takes an object of type <code>Complex</code> as an argument and also returns an object of type <code>Complex</code>.</p>
<p>Let’s try using these methods in some actual Rust code.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="k">fn</span> <span class="nf">complex_test</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Complex</span>::<span class="n">new</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span><span class="w"> </span><span class="mf">2.0</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Complex</span>::<span class="n">new</span><span class="p">(</span><span class="o">-</span><span class="mf">1.0</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="mf">2.0</span><span class="p">).</span><span class="n">add</span><span class="p">(</span><span class="n">x</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">.</span><span class="n">mul</span><span class="p">(</span><span class="n">x</span><span class="p">);</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>If we try to compile this code we will get this error:</p>
<pre tabindex="0"><code>error[E0382]: use of moved value: `x`
--> src/lib.rs:82:23
|
80 | let x = Complex::new(1.0, 2.0);
| - move occurs because `x` has type `Complex`, which does not implement the `Copy` trait
81 | let y = Complex::new(-1.0, -2.0).add(x);
| - value moved here
82 | let z = y.mul(x);
| ^ value used here after move
</code></pre><p>This error is a result of Rust’s <em>ownership</em> rules I mentioned earlier.
So what is ownership?</p>
<div class="gblog-post__anchorwrap">
<h2 id="ownership-in-rust">
Ownership in Rust
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/rust-for-python/#ownership-in-rust" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Ownership in Rust" href="#ownership-in-rust">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>The basis of ownership is that every value has exactly one variable that <em>owns</em> it, and it gets automatically deallocated as soon as its owner variable leaves the current scope.
This enables Rust to have automatic deallocation without a garbage collector.</p>
<p>The following example code shows when values get dropped (that is, deallocated) in Rust and how ownership gets moved between two values.
The <code>DropMe</code> struct used in this example will print a message to the console as soon as its value gets dropped.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="k">fn</span> <span class="nf">drop_example</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">DropMe</span><span class="p">{</span><span class="n">val</span>: <span class="o">'</span><span class="na">a</span><span class="o">'</span><span class="p">};</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">DropMe</span><span class="p">{</span><span class="n">val</span>: <span class="o">'</span><span class="na">b</span><span class="o">'</span><span class="p">};</span><span class="w">
</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">other_b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b</span><span class="p">;</span><span class="w"> </span><span class="c1">// takes ownership
</span><span class="c1"></span><span class="w"> </span><span class="c1">// other_b leaves scope here
</span><span class="c1"></span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"b has been dropped"</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"a drops after this"</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="c1">// a leaves scope here
</span><span class="c1"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>This will result in the following output:</p>
<pre tabindex="0"><code>dropping b
b has been dropped
a drops after this
dropping a
</code></pre><p>We can see that the ownership of the value we initially stored in <code>b</code> gets moved to <code>other_b</code>, which then leaves the inner scope delimited by <code>{}</code>.
This results in the value getting dropped and a message being written to the console.
After this, we print two more messages and then reach the end of the function.
At this point <code>a</code> leaves the current scope and its value also gets dropped.</p>
<p>It’s important to note that <code>b</code> becomes invalid after losing ownership of its value.
This is the reason for the error we just encountered.
We moved the value of <code>x</code> into the <code>add</code> function.
After this, <code>x</code> becomes invalid so we can’t use it again in the next line.</p>
<p>The reason we haven’t encountered this problem sooner is that numerical values like floats and integers are so small that they can be copied just as fast as references to them can be created so they just get copied and no ownership transfer takes place.
(This is the <em>copy trait</em> the error message mentions.)</p>
<p>Giving up ownership to functions is obviously a huge problem if we want to work with any kind of function, because we want to reuse our values most of the time.
We could simply copy our values before moving them into a function, but this gets expensive fast with bigger structs.</p>
<p>Rust has another system called borrowing instead.
Borrowing a value lets us create a reference to said value without taking ownership of it.
The actual owner of our value gets disabled until all references to it get dropped.</p>
<p>There are two types of references in Rust.
Immutable <code>&</code> references that give read-only access to the value they reference.
Mutable <code>&mut</code> references that let you modify their referenced value.</p>
<p>You can either have arbitrarily many immutable references or only one mutable reference to a single value at any given point in time.
This is to ensure that there is never more than one variable in your program that can modify a given value, which prevents a lot of tricky errors and data races.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="n">Example</span>:
<span class="nc">fn</span><span class="w"> </span><span class="n">foo</span><span class="p">(</span><span class="n">x</span>: <span class="nc">DropMe</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"foo {}"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">val</span><span class="p">);</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span><span class="w"></span><span class="k">fn</span> <span class="nf">foo_immut</span><span class="p">(</span><span class="n">x</span>: <span class="kp">&</span><span class="nc">DropMe</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"foo_immut {}"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">val</span><span class="p">);</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span><span class="w"></span><span class="k">fn</span> <span class="nf">foo_mut</span><span class="p">(</span><span class="n">x</span>: <span class="kp">&</span><span class="nc">mut</span><span class="w"> </span><span class="n">DropMe</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"foo_mut {}"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">val</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">val</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="sc">'m'</span><span class="p">;</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span><span class="w"></span><span class="k">fn</span> <span class="nf">borrowing_example</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">DropMe</span><span class="p">{</span><span class="n">val</span>: <span class="o">'</span><span class="na">a</span><span class="o">'</span><span class="p">};</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">DropMe</span><span class="p">{</span><span class="n">val</span>: <span class="o">'</span><span class="na">b</span><span class="o">'</span><span class="p">};</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">c</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">DropMe</span><span class="p">{</span><span class="n">val</span>: <span class="o">'</span><span class="na">c</span><span class="o">'</span><span class="p">};</span><span class="w">
</span><span class="w"> </span><span class="n">foo</span><span class="p">(</span><span class="n">a</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="n">foo_immut</span><span class="p">(</span><span class="o">&</span><span class="n">b</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="n">foo_mut</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span><span class="w"> </span><span class="n">c</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"end."</span><span class="p">);</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>This outputs:</p>
<pre tabindex="0"><code>foo a
dropping a
foo_immut b
foo_mut c
end.
dropping m
dropping b
</code></pre><p>You can see that <code>a</code> gets dropped as soon as <code>foo</code> finishes, because it takes ownership of its arguments.
The other two values only get dropped at the end of the main example function because their functions did not take ownership.
(You can also see that Rust drops values in the opposite order they were created in to not break any possible dependencies between them.)</p>
<p>We can just change all the arguments of our class methods to immutable references, because we don’t need to modify them.
This step was also necessary to make our methods compatible with PyO3, because Rust can’t take ownership of Python values (because ownership doesn’t exist in Python).
So we had to either copy our method arguments or take references to them instead.</p>
<p>After adding the references and the necessary PyO3 macros, our code looks like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="cp">#[pyclass]</span><span class="w">
</span><span class="w"></span><span class="k">struct</span> <span class="nc">Complex</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="n">real</span>: <span class="kt">f64</span><span class="p">,</span><span class="w">
</span><span class="w"> </span><span class="n">imag</span>: <span class="kt">f64</span>
<span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="cp">#[pymethods]</span><span class="w">
</span><span class="w"></span><span class="k">impl</span><span class="w"> </span><span class="n">Complex</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="cp">#[new]</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">new</span><span class="p">(</span><span class="n">real</span>: <span class="kt">f64</span><span class="p">,</span><span class="w"> </span><span class="n">imag</span>: <span class="kt">f64</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="n">Complex</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="n">real</span>: <span class="nc">real</span><span class="p">,</span><span class="w">
</span><span class="w"> </span><span class="n">imag</span>: <span class="nc">imag</span><span class="w">
</span><span class="w"> </span><span class="p">};</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">add</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">other</span>: <span class="kp">&</span><span class="nc">Self</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="bp">Self</span>::<span class="n">new</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="p">,</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">sub</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">other</span>: <span class="kp">&</span><span class="nc">Self</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="bp">Self</span>::<span class="n">new</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="p">,</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">mul</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">,</span><span class="w"> </span><span class="n">other</span>: <span class="kp">&</span><span class="nc">Self</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">Self</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">new_real</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">new_imag</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">other</span><span class="p">.</span><span class="n">real</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="bp">Self</span>::<span class="n">new</span><span class="p">(</span><span class="n">new_real</span><span class="p">,</span><span class="w"> </span><span class="n">new_imag</span><span class="p">);</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="k">fn</span> <span class="nf">dist_from_origin</span><span class="p">(</span><span class="o">&</span><span class="bp">self</span><span class="p">)</span><span class="w"> </span>-> <span class="kt">f64</span> <span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">real</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">imag</span><span class="p">.</span><span class="n">powi</span><span class="p">(</span><span class="mi">2</span><span class="p">)).</span><span class="n">sqrt</span><span class="p">()</span><span class="w">
</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p><code>#[pyclass]</code> and <code>#[pymethods]</code> perform the usual PyO3 magic of making our code compatible with Python.
<code>#[new]</code> designates our <code>new</code> method as our class constructor, meaning it will be called if we try to create a new <code>Complex</code> object from Python.</p>
<p>We then add our new class to our Python module:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="cp">#[pymodule]</span><span class="w">
</span><span class="w"></span><span class="k">fn</span> <span class="nf">mandelbrot_module</span><span class="p">(</span><span class="n">_py</span>: <span class="nc">Python</span><span class="p">,</span><span class="w"> </span><span class="n">m</span>: <span class="kp">&</span><span class="nc">PyModule</span><span class="p">)</span><span class="w"> </span>-> <span class="nc">PyResult</span><span class="o"><</span><span class="p">()</span><span class="o">></span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">add_function</span><span class="p">(</span><span class="n">wrap_pyfunction</span><span class="o">!</span><span class="p">(</span><span class="n">simple_stability</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="p">)</span><span class="o">?</span><span class="p">)</span><span class="o">?</span><span class="p">;</span><span class="w">
</span><span class="w"> </span><span class="n">m</span><span class="p">.</span><span class="n">add_class</span>::<span class="o"><</span><span class="n">Complex</span><span class="o">></span><span class="p">()</span><span class="o">?</span><span class="p">;</span><span class="w">
</span><span class="w">
</span><span class="w"> </span><span class="nb">Ok</span><span class="p">(())</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></div><p>Once again you don’t need to understand what’s going on here.
Just copy and paste the <code>m.add_class(...)</code> line and replace <code>Complex</code> with the name you gave your struct.</p>
<p>Finally we run <code>maturin develop</code> once again and integrate our new class into our example Python program.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>
<span class="kn">from</span> <span class="nn">mandelbrot_module</span> <span class="kn">import</span> <span class="n">Complex</span>
<span class="k">def</span> <span class="nf">complex_stability</span><span class="p">(</span><span class="n">real</span><span class="p">:</span><span class="nb">float</span><span class="p">,</span> <span class="n">imag</span><span class="p">:</span><span class="nb">float</span><span class="p">,</span> <span class="n">max_iterations</span><span class="p">:</span><span class="nb">int</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">Complex</span><span class="p">(</span><span class="n">real</span><span class="p">,</span> <span class="n">imag</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">Complex</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iterations</span><span class="p">):</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">mul</span><span class="p">(</span><span class="n">z</span><span class="p">)</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="k">if</span> <span class="n">z</span><span class="o">.</span><span class="n">dist_from_origin</span><span class="p">()</span> <span class="o">></span> <span class="mi">2</span><span class="p">:</span>
<span class="k">return</span> <span class="n">i</span>
<span class="k">return</span> <span class="n">max_iterations</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">values</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="n">line</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="n">line</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">complex_stability</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="mi">100</span><span class="p">))</span>
<span class="n">values</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="n">values</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">values</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">'__main__'</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code></pre></div><p>This iteration of our program is actually much slower at about 2 minutes.
This is probably because we spend a lot of time on switching between Rust and Python and creating new <code>Complex</code> objects, while the original program just ran some floating point operations instead, which have presumably already been heavily optimised using C.</p>
<p>I can say from experience though that translating bigger classes with more involved methods can significantly speed up your programs.</p>
<p>This concludes our Rust tutorial for Python programmers.
I hope that this post has sparked your interest for Rust and has given you ideas on how to use it in your existing projects.
If you want to learn more about Rust, check out the <a
class="gblog-markdown__link"
href="https://doc.rust-lang.org/stable/book/"
>Rust book</a>.
If you want to learn more about PyO3, check out its <a
class="gblog-markdown__link"
href="https://pyo3.rs/v0.15.1/"
>official user guide</a>.
The code for this post and the project it was based on can be found on GitHub:</p>
<ol>
<li><a
class="gblog-markdown__link"
href="https://github.com/DrunkJon/Rust-for-Python-Example"
>https://github.com/DrunkJon/Rust-for-Python-Example</a></li>
<li><a
class="gblog-markdown__link"
href="https://github.com/DrunkJon/MandelbrotViewer"
>https://github.com/DrunkJon/MandelbrotViewer</a></li>
</ol>
Clang/LLVM overviewhttps://blog.parcio.de/posts/2022/07/clang-llvm-overview/Hannes Winkler2022-07-04T00:00:00+00:002022-07-04T00:00:00+00:00
<p>Compilers are complex programs with complex requirements.
The two most widespread C compilers, GCC and Clang/LLVM, are <strong>10–15 million</strong> lines of code behemoths, designed to produce optimal machine code for whatever arbitrary target the user desires.
In this blog post I’m going to give an overview of how the Clang/LLVM C compiler works; from processing the source code to writing native binaries.</p>
<div class="gblog-post__anchorwrap">
<h1 id="1-introduction">
1. Introduction
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#1-introduction" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 1. Introduction" href="#1-introduction">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h1>
</div>
<p>First of all, what is a compiler?</p>
<p>A computer (or CPU, rather) executes binary <strong>machine code</strong>.
The human-readable form of machine code is called <strong>assembly code</strong>.
However, assembly code is very low-level and very unnatural to write for humans.
So we write our programs in higher-level programming languages like <em>C</em>, <em>C++</em>, <em>Rust</em>, etc. instead and let a compiler translate that source code into machine code.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<p>
<div class="flex justify-center">
<figure
class="gblog-post__figure"
>
<a class="gblog-markdown__link--raw" href="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compiler-terminology.png">
<picture>
<source
srcset="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compiler-terminology_huce7e72b373dc32be6682a8c9068de3e7_41292_600x0_resize_box_3.png 600w, https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compiler-terminology_huce7e72b373dc32be6682a8c9068de3e7_41292_1200x0_resize_box_3.png 1200w" sizes="100vw"
/>
<img loading="lazy"
src="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compiler-terminology_huce7e72b373dc32be6682a8c9068de3e7_41292_1800x0_resize_box_3.png"
alt="Compiler terminology"
/>
</picture>
</a>
<figcaption>
Compiler terminology
(<a
class="gblog-markdown__link"
href="https://cs.lmu.edu/~ray/images/staticcompilation.png"
>Ray Toal, Intro to Compilers</a> (edited) (License: unknown))
</figcaption>
</figure>
</div>
<div class="flex justify-center">
<figure
class="gblog-post__figure"
>
<a class="gblog-markdown__link--raw" href="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compilation-example.png">
<picture>
<source
srcset="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compilation-example_hu3d5862152994eb6732f4cf05fbb2ce2f_28947_600x0_resize_box_3.png 600w, https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compilation-example_hu3d5862152994eb6732f4cf05fbb2ce2f_28947_1200x0_resize_box_3.png 1200w" sizes="100vw"
/>
<img loading="lazy"
src="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/compilation-example_hu3d5862152994eb6732f4cf05fbb2ce2f_28947_1800x0_resize_box_3.png"
alt="Example compilation (C source code to x86-64 assembly)"
/>
</picture>
</a>
<figcaption>
Example compilation (C source code to x86-64 assembly)
</figcaption>
</figure>
</div>
</p>
<p>There are many more compilers than GCC and Clang, for a wide variety of programming languages.</p>
<p>One can distinguish between two kinds of compilers:</p>
<ol>
<li><strong>AOT (ahead-of-time) compilers.</strong>
These are compilers where all of the source code is compiled to target code before the program is run.
Basically, every C/C++ compiler is an AOT compiler for example.</li>
<li><strong>JIT (just-in-time) compilers.</strong>
JIT compilers compile code even <em>while</em> the program is running.
Examples: <a
class="gblog-markdown__link"
href="https://v8.dev/"
>Chromium’s JavaScript engine</a>, <a
class="gblog-markdown__link"
href="https://dart.dev/overview"
>Dart</a>, <a
class="gblog-markdown__link"
href="https://luajit.org/"
>LuaJIT</a></li>
</ol>
<p>At first, this sounds like JIT is a lot slower than AOT compilation but that’s not necessarily true.
JIT compilers have more information about the machine/CPU they’re targetting and can take that into account when compiling.
AOT compilers, on the other hand, mostly produce code for the “lowest common denominator”<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, if you don’t explicitly tell it for what target it should tune the code.
So even if you have the very latest Intel 12th generation CPU with the very latest feature set, your compiler will not make use of those features when targetting “just any x86-64 machine”.</p>
<p>Additionally, JIT compilers have more information about the runtime behaviour of the program.
For example, if the JIT compilers sees “oh, this function is only called with an integer argument greater than 128”, it can use that to optimize the function.
An AOT compiler can deduce some information too, but in most cases that information is just very hard to find out without running the program.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<div class="gblog-post__anchorwrap">
<h1 id="2-overview-of-the-clangllvm-pipeline">
2. Overview of the Clang/LLVM pipeline
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#2-overview-of-the-clangllvm-pipeline" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 2. Overview of the Clang/LLVM pipeline" href="#2-overview-of-the-clangllvm-pipeline">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h1>
</div>
<div class="flex justify-center">
<figure
class="gblog-post__figure"
>
<a class="gblog-markdown__link--raw" href="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/clang-llvm-pipeline.png">
<picture>
<source
srcset="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/clang-llvm-pipeline_hu0e8c4ac227e93cbbb96ebb6f87c9a53c_77177_600x0_resize_box_3.png 600w, https://blog.parcio.de/posts/2022/07/clang-llvm-overview/clang-llvm-pipeline_hu0e8c4ac227e93cbbb96ebb6f87c9a53c_77177_1200x0_resize_box_3.png 1200w" sizes="100vw"
/>
<img loading="lazy"
src="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/clang-llvm-pipeline_hu0e8c4ac227e93cbbb96ebb6f87c9a53c_77177_1800x0_resize_box_3.png"
alt="Clang/LLVM pipeline"
/>
</picture>
</a>
<figcaption>
Clang/LLVM pipeline
</figcaption>
</figure>
</div>
<p>As you can see in the above picture, there’s roughly 3 phases of compilation:</p>
<ol>
<li><strong>Frontend</strong>
<ul>
<li>In this step, all the source code is processed and an intermediate representation (IR) is generated.</li>
<li>In our case, <code>Clang</code> is the frontend and <code>LLVM</code> is the middle- and backend.</li>
<li>There are many LLVM frontends for many programming languages, Clang is just the one for C/C++.</li>
</ul>
</li>
<li><strong>Middle-end</strong>
<ul>
<li>The middle-end is one of the great features of LLVM.
In this phase, the IR is optimized.
Overall, most of the optimizations are done here.
The cool thing is that LLVM IR is completely universal; all frontends produce IR and all backends consume IR.
That way, if you write an optimization pass for the middle end, it’ll work for many languages and many target CPUs.</li>
</ul>
</li>
<li><strong>Backend</strong>
<ul>
<li>The backend will now consume the optimized IR and produce machine code, which (after linking) can be executed on the target machine.
The backend will also apply some machine specific optimization passes.</li>
</ul>
</li>
</ol>
<p>I’ll now go a bit more into detail about how these 3 parts work.</p>
<div class="gblog-post__anchorwrap">
<h1 id="3-frontend">
3. Frontend
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#3-frontend" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 3. Frontend" href="#3-frontend">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h1>
</div>
<div class="gblog-post__anchorwrap">
<h2 id="31-lexer">
3.1. Lexer
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#31-lexer" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 3.1. Lexer" href="#31-lexer">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>The first thing the frontend does is read the source code character by character and produce so-called <strong>tokens</strong>.</p>
<p>For example, for the following C source code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"hello, world!</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div><p>The list of tokens could look like this:</p>
<pre tabindex="0"><code>int,
identifier(“main”),
lparen,
int,
identifier(“argc”),
comma,
// and so on ...
</code></pre><p>Additionally, each token will also have its source location (that is, its file, line and column) associated with it.</p>
<p>In this phase, you basically get rid of all whitespace, comments, and transform the source code into something that can more easily be processed to produce the the abstract syntax tree (AST) and, following that, the IR.</p>
<p>The lexer only does very simple recognition of the basic syntactical building blocks of the source code.
For example, if you forget the terminating <code>"</code> of a string, the lexer will complain; but not if you use a wrong type or an undefined symbol.</p>
<div class="gblog-post__anchorwrap">
<h2 id="32-parsersemantic-analyzer">
3.2. Parser/semantic analyzer
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#32-parsersemantic-analyzer" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 3.2. Parser/semantic analyzer" href="#32-parsersemantic-analyzer">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Using the list of tokens from the previous step, we’re now constructing a tree.
And not just any tree, we’re constructing a so-called <strong>abstract syntax tree</strong> (AST).</p>
<p>Basically, we’re now recognizing the language structures of the programming language, like definitions, declarations, control flow statements, expressions, type casts, etc.</p>
<div class="flex justify-center">
<figure
class="gblog-post__figure"
>
<a class="gblog-markdown__link--raw" href="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/ast-dump.png">
<picture>
<source
srcset="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/ast-dump_hudada9e4d69bef1f290b8b2db1f73f31c_128419_600x0_resize_box_3.png 600w, https://blog.parcio.de/posts/2022/07/clang-llvm-overview/ast-dump_hudada9e4d69bef1f290b8b2db1f73f31c_128419_1200x0_resize_box_3.png 1200w" sizes="100vw"
/>
<img loading="lazy"
src="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/ast-dump_hudada9e4d69bef1f290b8b2db1f73f31c_128419_1800x0_resize_box_3.png"
alt="Clang AST dump for the hello world C program"
/>
</picture>
</a>
<figcaption>
Clang AST dump for the hello world C program
</figcaption>
</figure>
</div>
<p>The above image shows the AST for the example C program in <a
class="gblog-markdown__link"
href="#31-lexer"
>3.1</a>.
<code><invalid sloc></code> means <code>invalid source location</code>.
The nodes with <code><invalid sloc></code> are “imaginary”, they don’t have any corresponding source location and were added by Clang after the fact.</p>
<p>In the AST, we can clearly recognize the structure of the hello world program above by looking at the nodes with valid source locations:</p>
<ul>
<li>The function declaration <code>int main(int, char **)</code> and the names of the two arguments, <code>argc</code> & <code>argv</code> (<code>FunctionDecl</code> and <code>ParmVarDecl</code>)</li>
<li>The compound statement <code>{ ... }</code> afterwards (<code>CompoundStmt</code>)</li>
<li>The call to <code>printf</code> with some implicit casts (<code>CallExpr</code>)</li>
<li>Finally, the <code>return 0</code> (<code>ReturnStmt</code>)</li>
</ul>
<div class="gblog-post__anchorwrap">
<h3 id="ambiguity">
Ambiguity
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#ambiguity" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Ambiguity" href="#ambiguity">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>The parser has some predefined rules, like <code>function-declaration = type identifier ( parameter-list ) ...</code>.
It tries to match those rules against the tokens and that way Clang will build the AST.
However, in practice it’s not that easy.
For example, in C there are two ways you can parse <code>a * b</code>.</p>
<ul>
<li>Either <code>a</code> and <code>b</code> are variables and that expression is a multiplication,</li>
<li>or <code>a</code> is a type name, and <code>a * b</code> is the declaration of a variable <code>b</code> with type <code>a*</code> (pointer to <code>a</code>)</li>
</ul>
<p>So to be able to parse this correctly, you need to know beforehand if <code>a</code> is a type or a variable.
In C++ it’s even more complicated.
There’s a saying that “parsing C is hard and parsing C++ is impossible.” The C++ grammar is <a
class="gblog-markdown__link"
href="https://en.wikipedia.org/wiki/Most_vexing_parse"
>ambiguous</a>, <a
class="gblog-markdown__link"
href="http://port70.net/~nsz/c/c%2B%2B/turing.pdf"
>C++ templates are turing complete</a> and parsing it is <a
class="gblog-markdown__link"
href="https://blog.reverberate.org/2013/08/parsing-c-is-literally-undecidable.html"
>literally undecidable</a>.
That’s one of the reasons why Clang has hand-written parsers for both C and C++.</p>
<p>The parser works very closely together with the <strong>semantic analyzer</strong> (sema).
The sema will do things like inferring types, adding type casts, doing validity checks or throwing warnings.
For example, warnings about unused code or infinite self-recursion will be thrown by the sema.</p>
<div class="gblog-post__anchorwrap">
<h2 id="33-ir-generator">
3.3. IR generator
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#33-ir-generator" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 3.3. IR generator" href="#33-ir-generator">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>The IR generator will now (surprise!) generate rough, unoptimized <strong>IR</strong> using the AST from the previous step.</p>
<p>LLVM IR is a full-fledged language with well-defined semantics.
The IR below was generated for the hello world program from Section <a
class="gblog-markdown__link"
href="#31-lexer"
>3.1</a>.
However, I’d say its workings are a bit out of scope for this blog post, so I’m not going to go into detail here.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-llvm" data-lang="llvm"><span class="vg">@.str</span> <span class="p">=</span> <span class="k">private</span> <span class="k">unnamed_addr</span> <span class="k">constant</span> <span class="p">[</span><span class="m">15</span> <span class="k">x</span> <span class="k">i8</span><span class="p">]</span> <span class="k">c</span><span class="s">"hello, world!\0A\00"</span><span class="p">,</span> <span class="k">align</span> <span class="m">1</span>
<span class="k">define</span> <span class="err">dso_local</span> <span class="k">i32</span> <span class="vg">@main</span><span class="p">(</span><span class="k">i32</span> <span class="n">%0</span><span class="p">,</span> <span class="k">i8</span><span class="p">**</span> <span class="n">%1</span><span class="p">)</span> <span class="vg">#0</span> <span class="nv">!dbg</span> <span class="n">!8</span> <span class="p">{</span>
<span class="n">%3</span> <span class="p">=</span> <span class="k">alloca</span> <span class="k">i32</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span>
<span class="n">%4</span> <span class="p">=</span> <span class="k">alloca</span> <span class="k">i32</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span>
<span class="n">%5</span> <span class="p">=</span> <span class="k">alloca</span> <span class="k">i8</span><span class="p">**,</span> <span class="k">align</span> <span class="m">8</span>
<span class="k">store</span> <span class="k">i32</span> <span class="m">0</span><span class="p">,</span> <span class="k">i32</span><span class="p">*</span> <span class="n">%3</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span>
<span class="k">store</span> <span class="k">i32</span> <span class="n">%0</span><span class="p">,</span> <span class="k">i32</span><span class="p">*</span> <span class="n">%4</span><span class="p">,</span> <span class="k">align</span> <span class="m">4</span>
<span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.dbg.declare</span><span class="p">(</span><span class="kt">metadata</span> <span class="k">i32</span><span class="p">*</span> <span class="n">%4</span><span class="p">,</span> <span class="kt">metadata</span> <span class="n">!17</span><span class="p">,</span> <span class="kt">metadata</span> <span class="nv">!DIExpression</span><span class="p">()),</span> <span class="nv">!dbg</span> <span class="n">!18</span>
<span class="k">store</span> <span class="k">i8</span><span class="p">**</span> <span class="n">%1</span><span class="p">,</span> <span class="k">i8</span><span class="p">***</span> <span class="n">%5</span><span class="p">,</span> <span class="k">align</span> <span class="m">8</span>
<span class="k">call</span> <span class="kt">void</span> <span class="vg">@llvm.dbg.declare</span><span class="p">(</span><span class="kt">metadata</span> <span class="k">i8</span><span class="p">***</span> <span class="n">%5</span><span class="p">,</span> <span class="kt">metadata</span> <span class="n">!19</span><span class="p">,</span> <span class="kt">metadata</span> <span class="nv">!DIExpression</span><span class="p">()),</span> <span class="nv">!dbg</span> <span class="n">!20</span>
<span class="n">%6</span> <span class="p">=</span> <span class="k">call</span> <span class="k">i32</span> <span class="p">(</span><span class="k">i8</span><span class="p">*,</span> <span class="p">...)</span> <span class="vg">@printf</span><span class="p">(</span><span class="k">i8</span><span class="p">*</span> <span class="k">getelementptr</span> <span class="k">inbounds</span> <span class="p">([</span><span class="m">15</span> <span class="k">x</span> <span class="k">i8</span><span class="p">],</span> <span class="p">[</span><span class="m">15</span> <span class="k">x</span> <span class="k">i8</span><span class="p">]*</span> <span class="vg">@.str</span><span class="p">,</span> <span class="k">i64</span> <span class="m">0</span><span class="p">,</span> <span class="k">i64</span> <span class="m">0</span><span class="p">)),</span> <span class="nv">!dbg</span> <span class="n">!21</span>
<span class="k">ret</span> <span class="k">i32</span> <span class="m">0</span><span class="p">,</span> <span class="nv">!dbg</span> <span class="n">!22</span>
<span class="p">}</span>
<span class="k">declare</span> <span class="kt">void</span> <span class="vg">@llvm.dbg.declare</span><span class="p">(</span><span class="kt">metadata</span><span class="p">,</span> <span class="kt">metadata</span><span class="p">,</span> <span class="kt">metadata</span><span class="p">)</span> <span class="vg">#1</span>
<span class="k">declare</span> <span class="err">dso_local</span> <span class="k">i32</span> <span class="vg">@printf</span><span class="p">(</span><span class="k">i8</span><span class="p">*,</span> <span class="p">...)</span> <span class="vg">#2</span>
<span class="k">attributes</span> <span class="vg">#0</span> <span class="p">=</span> <span class="p">{</span> <span class="k">noinline</span> <span class="k">nounwind</span> <span class="k">optnone</span> <span class="k">uwtable</span> <span class="s">"frame-pointer"</span><span class="p">=</span><span class="s">"all"</span> <span class="s">"min-legal-vector-width"</span><span class="p">=</span><span class="s">"0"</span> <span class="s">"no-trapping-math"</span><span class="p">=</span><span class="s">"true"</span> <span class="s">"stack-protector-buffer-size"</span><span class="p">=</span><span class="s">"8"</span> <span class="s">"target-cpu"</span><span class="p">=</span><span class="s">"x86-64"</span> <span class="s">"target-features"</span><span class="p">=</span><span class="s">"+cx8,+fxsr,+mmx,+sse,+sse2,+x87"</span> <span class="s">"tune-cpu"</span><span class="p">=</span><span class="s">"generic"</span> <span class="p">}</span>
<span class="k">attributes</span> <span class="vg">#1</span> <span class="p">=</span> <span class="p">{</span> <span class="err">no</span><span class="k">free</span> <span class="err">nosyn</span><span class="k">c</span> <span class="k">nounwind</span> <span class="k">readnone</span> <span class="err">speculatable</span> <span class="err">willreturn</span> <span class="p">}</span>
<span class="k">attributes</span> <span class="vg">#2</span> <span class="p">=</span> <span class="p">{</span> <span class="s">"frame-pointer"</span><span class="p">=</span><span class="s">"all"</span> <span class="s">"no-trapping-math"</span><span class="p">=</span><span class="s">"true"</span> <span class="s">"stack-protector-buffer-size"</span><span class="p">=</span><span class="s">"8"</span> <span class="s">"target-cpu"</span><span class="p">=</span><span class="s">"x86-64"</span> <span class="s">"target-features"</span><span class="p">=</span><span class="s">"+cx8,+fxsr,+mmx,+sse,+sse2,+x87"</span> <span class="s">"tune-cpu"</span><span class="p">=</span><span class="s">"generic"</span> <span class="p">}</span>
</code></pre></div><div class="gblog-post__anchorwrap">
<h1 id="4-middle-end">
4. Middle-end
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#4-middle-end" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 4. Middle-end" href="#4-middle-end">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h1>
</div>
<p>Now that we have the IR, we can <strong>optimize</strong> it.
What does “optimizing” even mean, though?
When we optimize a program, we want to make it <strong>faster</strong> or <strong>smaller</strong>.</p>
<p>Making it run faster is the usual goal that’s achievable for example by combining operations, reducing function calls, resolving recursions, etc.
Making the finished program binaries smaller is often done in embedded environments, where you don’t have too much space available.</p>
<p>A single “step” of optimization is called an <strong>optimization pass</strong>.
Usually when employing optimizations, you’ll bundle a bunch of these together in a chain.
The order is important too: Some optimization passes rely on the fact that some other optimization pass has run before them (that maybe annotated the IR with some analysis information), others produce better results when some other optimization pass has run before them.
But that’s mostly opaque to the user, Clang will do the right thing for you when you just use the <code>-O...</code> commandline argument.</p>
<p>There are three kinds of optimization passes:</p>
<ol>
<li><strong>Analysis passes</strong>
<ul>
<li>Those analyze the IR and try to deduce some information that’s useful (or required) for other optimization passes.</li>
<li>For example, one analysis pass will deduce information about the call graph, another will find memory dependencies, etc.</li>
</ul>
</li>
<li><strong>Transform passes</strong>
<ul>
<li>Transform passes are what actually optimizes the IR.
They use the information from the analysis passes to transform the IR in such a way that it’s either faster or smaller afterwards.</li>
<li>There <code>-inline</code> transform pass will do function inlining (more on that later), <code>-adce</code> will eliminate dead code, <code>-instcombine</code> combines redundant instructions, etc.</li>
</ul>
</li>
<li><strong>Utility passes</strong>
<ul>
<li>These are mostly used for debugging purposes.
These aren’t applied by Clang by default, only if you explicitly tell it to.</li>
<li>For example, the <code>-view-cfg</code> pass will use Graphviz to visualize the control flow graph.</li>
</ul>
</li>
</ol>
<div class="flex justify-center">
<figure
class="gblog-post__figure"
>
<a class="gblog-markdown__link--raw" href="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/optimization-passes.png">
<picture>
<source
srcset="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/optimization-passes_hua6c2e01a3ad55be301e16155216e0d4e_37763_600x0_resize_box_3.png 600w, https://blog.parcio.de/posts/2022/07/clang-llvm-overview/optimization-passes_hua6c2e01a3ad55be301e16155216e0d4e_37763_1200x0_resize_box_3.png 1200w" sizes="100vw"
/>
<img loading="lazy"
src="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/optimization-passes_hua6c2e01a3ad55be301e16155216e0d4e_37763_1800x0_resize_box_3.png"
alt="Pie chart of the optimization passes of LLVM"
/>
</picture>
</a>
<figcaption>
Pie chart of the optimization passes of LLVM
(Based on <a
class="gblog-markdown__link"
href="https://llvm.org/docs/Passes.html"
>https://llvm.org/docs/Passes.html</a> (fetched on 2021-11-05))
</figcaption>
</figure>
</div>
<p>To use these optimizations, you can just use the <code>-O...</code> commandline argument for Clang.</p>
<p>For example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="c1"># no optimizations at all (default)</span>
$ clang -O0 main.c
<span class="c1"># optimize for speed</span>
$ clang -O1 main.c
<span class="c1"># optimize even more for speed</span>
$ clang -O2 main.c
<span class="c1"># optimize even more more for speed</span>
$ clang -O3 main.c
<span class="c1"># optimize for fastest speed possible</span>
$ clang -Ofast main.c
<span class="c1"># optimize for size (basically -O2 with some optimizations that reduce size)</span>
$ clang -Os main.c
<span class="c1"># optimize even more for size</span>
$ clang -Oz main.c
</code></pre></div><p>As a developer, usually you want your programs to run fast.
So why don’t we always use <code>-Ofast</code>?</p>
<ul>
<li>The first reason is that it breaks strict standards compliance.
<code>-Ofast</code> is basically <code>-O3</code> with <a
class="gblog-markdown__link--code"
href="https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffast-math"
><code>-ffast-math</code></a>.
Fast-math will do a lot of things that are not compatible with the C or C++ standards.
For example, it’ll replace some operation that divides by a floating-point constant <code>a / 0.123</code> by a multiplication with the reciprocal (<code>a * (1 / 0.123)</code>) since multiplication is usually faster, some floating point errors won’t be reported anymore, some things are less accurate, etc.</li>
<li>For large projects, it’ll increase compile time.
In many cases, it might still be worth it, but others may not want to do that.</li>
<li>It might increase program binary size.</li>
</ul>
<p>Okay, so the problem with <code>-Ofast</code> is mainly non-compliant math.
So why don’t we just always use <code>-O3</code>?
Why is there a <code>-O2</code> then?
It turns out that’s a pretty good question.
<code>-O3</code> is basically the same as <code>-O2</code>.
In the LLVM I tested, <code>-O3</code> enables two more optimization passes than <code>-O2</code> and for one of them it says in the code “<a
class="gblog-markdown__link"
href="https://github.com/llvm/llvm-project/blob/2b46417aa2d42d5d2a14df1675cfee547fd46556/llvm/lib/Passes/PassBuilderPipelines.cpp#L755"
>FIXME: It isn’t at all clear why this should be limited to -O3</a>”.</p>
<p>Okay, now that we have optimized the IR, we can go on to the next step:</p>
<div class="gblog-post__anchorwrap">
<h1 id="5-backend">
5. Backend
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#5-backend" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 5. Backend" href="#5-backend">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h1>
</div>
<div class="gblog-post__anchorwrap">
<h2 id="51-instruction-selection">
5.1. Instruction Selection
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#51-instruction-selection" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 5.1. Instruction Selection" href="#51-instruction-selection">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>This is the first phase of the backend.
Everything (well, most) target-specific stuff happens here.
Now, we want to transform the optimized IR into something that’ll run on the target CPU, and as the first step we’re going to select the instructions for that.</p>
<p>LLVM has multiple instruction selectors:</p>
<ul>
<li><strong>SelectionDAG</strong> (produces the best results, best documented)</li>
<li><strong>FastIsel</strong> (produces poor results, but runs quickly)</li>
<li><strong>GlobalIsel</strong> (WIP, designed to combine the compilation speed of FastIsel with the quality of SelectionDAG)</li>
</ul>
<p>Since SelectionDAG is best documented and produces the best results, I’m going to use that as an example.
Actually, SelectionDAG is not only the name of this instruction selector but also the output of it.
That is, the output of this instruction selector is called <strong>SelectionDAG</strong> (selection directed, acyclic graph) too.
In this graph, each node is an instruction and the edges are data or control dependencies.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<p>A finished DAG (for the program from <a
class="gblog-markdown__link"
href="#31-lexer"
>3.1</a>) looks like this:</p>
<div class="flex justify-center">
<figure
class="gblog-post__figure"
>
<a class="gblog-markdown__link--raw" href="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/selectiondag-final.png">
<picture>
<source
srcset="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/selectiondag-final_hu6310ef32cd856d3dfd653fb57f03ae37_44586_600x0_resize_box_3.png 600w, https://blog.parcio.de/posts/2022/07/clang-llvm-overview/selectiondag-final_hu6310ef32cd856d3dfd653fb57f03ae37_44586_1200x0_resize_box_3.png 1200w" sizes="100vw"
/>
<img loading="lazy"
src="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/selectiondag-final_hu6310ef32cd856d3dfd653fb57f03ae37_44586_1800x0_resize_box_3.png"
alt="Final SelectionDAG"
/>
</picture>
</a>
<figcaption>
Final SelectionDAG
</figcaption>
</figure>
</div>
<p>But how is that graph built?
There are multiple steps involved here:</p>
<ol>
<li>
<p>First of all, we’re using static mappings from IR instruction to SelectionDAG node, and the control and dataflow dependencies we can infer from the IR to build an initial, naive SelectionDAG.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup></p>
</li>
<li>
<p>Now we’re applying some basic optimizations to it.</p>
</li>
<li>
<p>The (still naive) SelectionDAG we now have might not even be runnable on the target CPU.
Maybe it contains operations that aren’t supported or some type doesn’t work with the operation used, etc.
In other words, it might be an <strong>illegal</strong> SelectionDAG.
So now, as the first step of making it a legal graph, we’re going the <strong>legalize</strong> the types.
There are two kinds of modifications we can make to the types here:</p>
<ul>
<li><strong>Type promotion</strong> (converting a small type to a larger one)</li>
<li><strong>Type expansion</strong> (splitting up a larger type into multiple smaller ones)</li>
</ul>
<p>For example: If the target doesn’t support 16-bit integers, we’re just going to <em>promote</em> it to a 32-bit integer instead.
Likewise, if it doesn’t support 64-bit integers, we’re just going to <em>expand</em> it to two 32-bit ints instead.</p>
</li>
<li>
<p>Now, we optimize that again, mostly to get rid of redundant operations introduced by type promotion/expansion.</p>
</li>
<li>
<p>After that, we’re going to legalize the operations.
Targets sometimes have weird, arbitrary constraints for the types that can be used for some operations.
(x86 does not support byte-conditional moves, PowerPC does not support sign-extending loads from a 16-bit memory location.) So we’ll apply type promotion and type expansion or some custom, target-specific modifications here to make the SelectionDAG legal.</p>
</li>
<li>
<p>Optimize again.</p>
</li>
<li>
<p>Actually select the instructions.
This phase is a bit more complicated, but in a nutshell, LLVM will take the instructions in the SelectionDAG we have (which are target-independent instructions that just happen to be executable on the target machine) and translate them into target-specific instructions, while also using pattern-matching to combine instructions where possible.</p>
</li>
</ol>
<div class="gblog-post__anchorwrap">
<h2 id="52-scheduling-and-formation">
5.2. Scheduling and Formation
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#52-scheduling-and-formation" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 5.2. Scheduling and Formation" href="#52-scheduling-and-formation">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Now we have a SelectionDAG of machine instructions.
However, CPUs don’t run DAGs.
So we need to linearize the SelectionDAG, that is, form it into a list.
There are many ways to do that, LLVM will just use some heuristics so we, for instance, always have enough registers available, you can also take into account instruction latencies, etc.
You can print the linearized SelectionDAG for some LLVM IR using <code>llc -print-machineinstrs ...</code>.</p>
<p>After this, there’s a machine code (actually MIR) based optimization phase.</p>
<div class="gblog-post__anchorwrap">
<h2 id="53-register-allocation">
5.3. Register Allocation
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#53-register-allocation" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 5.3. Register Allocation" href="#53-register-allocation">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Up until this point, even if you might not have realized, we acted like the target machine has an infinite amount of one-time assignable registers.</p>
<p>In other words, the IR and SelectionDAG was in the so-called <strong>SSA</strong> (single static assignment) form.
The SSA form simplifies many analyses of the control flow graph of the IR.
At this point, we want to select the actual target registers we will use for the previously virtual SSA registers.
Most targets only have 16, maybe 32 registers and many of those are reserved for special purposes.
It’s possible we don’t have enough physical registers to accomodate all the virtual registers.
That means we have to put some of them into main memory instead, which is called <strong>spilling</strong>.</p>
<p>Now that we’ve selected the physical registers, we’re adding some prologue and epilogue instructions to the function, that is, push some registers on the stack and pop them again later.</p>
<p>After that comes a machine code based optimization phase.</p>
<div class="gblog-post__anchorwrap">
<h2 id="54-code-emission">
5.4. Code Emission
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#54-code-emission" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 5.4. Code Emission" href="#54-code-emission">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Now we can finally emit the optimized machine code, in whatever format the user desires.
Some targets support writing <code>.o</code> files directly, for others assembly will be written and assembled into an <code>.o</code> file as an intermediate step.
Note that to be able to run this file, we also need to link it, which Clang can do for you as well.
(Not Clang itself, Clang will just call a linker.)</p>
<div class="flex justify-center">
<figure
class="gblog-post__figure"
>
<a class="gblog-markdown__link--raw" href="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/codegen-feature-matrix.png">
<picture>
<source
srcset="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/codegen-feature-matrix_hu5ef6a93dc4609498d73a98f1ccec3ec0_49653_600x0_resize_box_3.png 600w, https://blog.parcio.de/posts/2022/07/clang-llvm-overview/codegen-feature-matrix_hu5ef6a93dc4609498d73a98f1ccec3ec0_49653_1200x0_resize_box_3.png 1200w" sizes="100vw"
/>
<img loading="lazy"
src="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/codegen-feature-matrix_hu5ef6a93dc4609498d73a98f1ccec3ec0_49653_1800x0_resize_box_3.png"
alt="Feature matrix of the different target code generators"
/>
</picture>
</a>
<figcaption>
Feature matrix of the different target code generators
(From <a
class="gblog-markdown__link"
href="https://llvm.org/docs/CodeGenerator.html#target-feature-matrix"
>https://llvm.org/docs/CodeGenerator.html#target-feature-matrix</a>)
</figcaption>
</figure>
</div>
<div class="gblog-post__anchorwrap">
<h1 id="6-optimizations">
6. Optimizations
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#6-optimizations" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 6. Optimizations" href="#6-optimizations">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h1>
</div>
<div class="gblog-post__anchorwrap">
<h2 id="loop-unrolling">
Loop unrolling
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#loop-unrolling" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Loop unrolling" href="#loop-unrolling">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>As an example function, take this piece of code which will just copy over 16 integers from <code>s</code> to <code>d</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="kt">void</span> <span class="nf">copy_16</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">16</span><span class="p">;</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">d</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div><p>When we execute this, we basically do:</p>
<pre tabindex="0"><code>i := 0,
check:
is i < 16? if no goto end
d[i] = s[i]
i++
goto check
end:
return
</code></pre><p>So we’ll check 16 times if <code>i < 16</code> and we’ll increment <code>i</code> 16 times, which is quite a bit of overhead, given that we know we want <em>exactly</em> 16 iterations.</p>
<p>Clang/LLVM will use loop unrolling to transform it into this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="kt">void</span> <span class="nf">copy_16</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">d</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="c1">// ...
</span><span class="c1"></span><span class="p">}</span>
</code></pre></div><p>So basically we just repeat the loop body 16 times and save the overhead.</p>
<div class="gblog-post__anchorwrap">
<h2 id="vectorizing">
Vectorizing
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#vectorizing" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Vectorizing" href="#vectorizing">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Even if we don’t have a fixed upper bound for the loop, LLVM can still do something about it.
In this example we do the same thing but iterate up to <code>n</code>, which is a parameter, so not known at compile time.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="kt">void</span> <span class="nf">copy_n</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">d</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div><p>Many CPUs have instructions that allow to copy over a lot more bytes at once, which is faster than only copying 4 bytes at once.
So the compiler will try to use those instructions for the bulk of the copying and do the rest one-by-one again.
That’s called vectorization.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="kt">void</span> <span class="nf">copy_n</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n</span><span class="o">-</span><span class="mi">127</span><span class="p">;</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="o">+</span><span class="mi">128</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// copy 128 bytes at once
</span><span class="c1"></span> <span class="p">}</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">d</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div><div class="gblog-post__anchorwrap">
<h2 id="function-inlining-and-loop-unrolling">
Function inlining and loop unrolling
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#function-inlining-and-loop-unrolling" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Function inlining and loop unrolling" href="#function-inlining-and-loop-unrolling">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>In this example, we have a mixture of the above.
There’s <code>copy_n</code>, which still takes the upper loop limit as a parameter, and it is used in <code>copy_16_v2</code>, which unconditionally calls copy_n with <code>n=16</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="kt">void</span> <span class="nf">copy_n</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">int</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">i</span><span class="o">=</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="n">d</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">copy_16_v2</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="n">copy_n</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="mi">16</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div><p>The compiler will now basically copy and paste the implementation of <code>copy_n</code> into the other function, which is called function inlining.
Then it can deduce that the for loop limit is 16, so it’ll make use of loop unrolling again.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="kt">void</span> <span class="nf">copy_16_v2</span><span class="p">(</span><span class="kt">int</span> <span class="o">*</span><span class="n">d</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
<span class="n">d</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">d</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="c1">// ...
</span><span class="c1"></span><span class="p">}</span>
</code></pre></div><div class="gblog-post__anchorwrap">
<h1 id="7-sources">
7. Sources
<a data-clipboard-text="https://blog.parcio.de/posts/2022/07/clang-llvm-overview/#7-sources" class="gblog-post__anchor clip flex align-center" aria-label="Anchor 7. Sources" href="#7-sources">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h1>
</div>
<ul>
<li>[Ray Toal, Intro to Compilers] Toal, R. Intro to Compilers. <a
class="gblog-markdown__link"
href="https://www.cs.cornell.edu/~asampson/blog/llvm.html"
>https://www.cs.cornell.edu/~asampson/blog/llvm.html</a></li>
<li>[Finkel, 2017] Finkel, H. and Horváth, G. (2017). Code Transformation and analysis using Clang and LLVM. <a
class="gblog-markdown__link"
href="https://llvm.org/devmtg/2017-06/2-Hal-Finkel-LLVM-2017.pdf"
>https://llvm.org/devmtg/2017-06/2-Hal-Finkel-LLVM-2017.pdf</a></li>
<li><a
class="gblog-markdown__link"
href="https://stackoverflow.com/questions/6319086/are-gcc-and-clang-parsers-really-handwritten"
>https://stackoverflow.com/questions/6319086/are-gcc-and-clang-parsers-really-handwritten</a></li>
<li><a
class="gblog-markdown__link"
href="https://stackoverflow.com/questions/11510792/is-the-semantic-analysis-step-in-clang-an-essential-part-of-the-compiler"
>https://stackoverflow.com/questions/11510792/is-the-semantic-analysis-step-in-clang-an-essential-part-of-the-compiler</a></li>
<li><a
class="gblog-markdown__link"
href="https://cppdepend.com/blog/?p=321"
>https://cppdepend.com/blog/?p=321</a></li>
<li><a
class="gblog-markdown__link"
href="https://llvm.org/docs/CodeGenerator.html#legalize-operations"
>https://llvm.org/docs/CodeGenerator.html#legalize-operations</a></li>
<li><a
class="gblog-markdown__link"
href="https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1"
>https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1</a></li>
<li><a
class="gblog-markdown__link"
href="https://stackoverflow.com/questions/845355/do-programming-language-compilers-first-translate-to-assembly-or-directly-to-mac"
>https://stackoverflow.com/questions/845355/do-programming-language-compilers-first-translate-to-assembly-or-directly-to-mac</a></li>
<li><a
class="gblog-markdown__link"
href="https://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm/"
>https://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm/</a></li>
<li><a
class="gblog-markdown__link"
href="https://llvm.org/docs/CodeGenerator.html#target-feature-matrix"
>https://llvm.org/docs/CodeGenerator.html#target-feature-matrix</a></li>
<li><a
class="gblog-markdown__link"
href="https://blog.regehr.org/archives/1603"
>https://blog.regehr.org/archives/1603</a></li>
<li><a
class="gblog-markdown__link"
href="https://github.com/llvm/llvm-project/blob/7175886a0f612aded1430ae240ca7ffd53d260dd/llvm/lib/Passes/PassBuilderPipelines.cpp#L717"
>https://github.com/llvm/llvm-project/blob/7175886a0f612aded1430ae240ca7ffd53d260dd/llvm/lib/Passes/PassBuilderPipelines.cpp#L717</a></li>
<li><a
class="gblog-markdown__link"
href="https://clang.llvm.org/docs/CommandGuide/clang.html"
>https://clang.llvm.org/docs/CommandGuide/clang.html</a></li>
</ul>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>Some compilers translate into machine code directly (LLVM, mostly), other translate into assembly and use an assembler to compile it into machine code (GCC). <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Not necessarily, there’s something called <a
class="gblog-markdown__link"
href="https://hannes.hauswedell.net/post/2017/12/09/fmv/"
>Function Multiversioning</a>, but that’s not automatic. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p><a
class="gblog-markdown__link"
href="https://en.wikipedia.org/wiki/Profile-guided_optimization"
>Profile guided optimization</a> might help in that case, though. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>There’s one other type of edge called <code>glue</code>, that’ll make the instructions stick together through scheduling (see <a
class="gblog-markdown__link"
href="https://stackoverflow.com/questions/33005061/what-are-glue-and-chain-dependencies-in-an-llvm-dag"
>https://stackoverflow.com/questions/33005061/what-are-glue-and-chain-dependencies-in-an-llvm-dag</a>). <a href="#fnref:4" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>This mapping is not entirely static. Target-specific interfaces are used to map things like returns, calls, varargs, etc. (see <a
class="gblog-markdown__link"
href="https://llvm.org/docs/CodeGenerator.html#initial-selectiondag-construction"
>https://llvm.org/docs/CodeGenerator.html#initial-selectiondag-construction</a>). <a href="#fnref:5" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</section>
An introduction to performance analysis and understanding profilershttps://blog.parcio.de/posts/2022/06/performance-analysis/Kevin Kulot2022-06-14T00:00:00+00:002022-06-14T00:00:00+00:00
<p>Where performance matters, we want to make sure we know what to look for and what to optimize.
For that we need to measure our code by analyzing its performance with tools either provided by the language or external ones called “profilers”.
This post intends to give an overview on both of these methods while introducing some tools – like <code>Google Benchmark</code> and <code>perf</code> – and explaining their functionality.</p>
<blockquote>
<p>“<em>Premature optimization is the root of all evil.</em>”</p>
</blockquote>
<p>A famous quote by Tony Hoare, later popularised by Donald Knuth, shows why measuring performance is so important.
If we try to optimize our code before knowing <em>where</em> it may be needed, it might end up hindering us in the long run.
Trying to be unnecessarily clever by, for example, replacing divisions and multiplications by powers of two with bit shifts, might just end up hurting readability while not providing any gains in performance as most compilers nowadays are able to do such trivial optimizations.</p>
<p>We want to do <strong>benchmarks</strong>, more precisely <strong>micro benchmarks</strong>.
Micro benchmarks are for measuring small parts, like single functions or routines of our code, inspecting hot loops and investigate small things such as cache misses or assembly code generation.
Before we perform these micro benchmarks, let’s introduce a dichotomy<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> of tools.</p>
<p><strong>In-Code Benchmarking:</strong> Measuring/performing benchmarks within the language by leveraging existing functions and libraries.</p>
<p><strong>Profilers:</strong> External, language-agnostic tools which measure/perform benchmarks while utilizing the compiled binary and system calls.</p>
<div class="gblog-post__anchorwrap">
<h2 id="in-code-benchmarking">
In-Code Benchmarking
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#in-code-benchmarking" class="gblog-post__anchor clip flex align-center" aria-label="Anchor In-Code Benchmarking" href="#in-code-benchmarking">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Let’s look at and compare three different languages and what they natively provide for benchmarking and measuring performance.
C, C++ and Rust – compiled, systems programming languages – all provide at least some tools necessary for capturing the current time (with varying precision), something we definitely need to begin measuring our code.
By going from an older, relatively low-level language<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>, like C, to a newer one like Rust with more high-level abstractions, we will quickly notice a difference in ease of use and amount of options available to us while coding (micro) benchmarks.</p>
<div class="gblog-post__anchorwrap">
<h3 id="as-old-as-timeh-c-and-posix">
As old as <code><time.h></code>: C and POSIX
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#as-old-as-timeh-c-and-posix" class="gblog-post__anchor clip flex align-center" aria-label="Anchor As old as <time.h>: C and POSIX" href="#as-old-as-timeh-c-and-posix">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>Let’s begin with C and its <a
class="gblog-markdown__link--code"
href="https://en.cppreference.com/w/c/chrono"
><code>time.h</code></a> facilities:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="hl"><span class="lnt"> 5
</span></span><span class="lnt"> 6
</span><span class="hl"><span class="lnt"> 7
</span></span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="cp">#include</span> <span class="cpf"><stdio.h></span><span class="cp">
</span><span class="cp">#include</span> <span class="cpf"><time.h></span><span class="cp">
</span><span class="cp"></span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="hl"> <span class="n">time_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
</span> <span class="c1">// expensive operation
</span><span class="hl"><span class="c1"></span> <span class="n">time_t</span> <span class="n">end</span> <span class="o">=</span> <span class="n">time</span><span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%.2f seconds</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span> <span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p><code>time(...)</code> returns the current calendar time, which is almost always represented as the number of seconds since <a
class="gblog-markdown__link"
href="https://en.wikipedia.org/wiki/Unix_time"
>the Epoch (00:00:00 UTC 01/01/1970)</a>, as a <code>time_t</code> object.
The full function signature is <code>time_t time(time_t* arg)</code> where <code>arg</code> acts as an out-parameter storing the same information as the return value, which is why we pass in <code>NULL</code>.
Since <code>time_t</code> is just a typedef for an unspecified (implementation-defined) real type, we can compute the difference between <code>start</code> and <code>end</code> to get the elapsed time in seconds.
This essentially measures so-called “wall clock time”, something we will come back to later.</p>
<p>One alternative time measuring facility is the <code>clock_t clock(void)</code> function.
Unlike <code>time()</code>, it returns the approximate processor time, or “CPU time”, of the current process.
Similar to <code>time_t</code>, the returned value is also an implementation-defined real type<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> from which we can calculate a difference.
This “CPU time” may differ from “wall clock time” as it may advance faster or slower depending on the resources allocated by the operating system.</p>
<p>If we want a little more precision while still using C, the C POSIX library offers additional functionality.
For example, <a
class="gblog-markdown__link--code"
href="https://man7.org/linux/man-pages/man2/gettimeofday.2.html"
><code>gettimeofday()</code></a>, provided by the <code>sys/time.h</code> header, lets us measure with microsecond accuracy:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="hl"><span class="lnt"> 8
</span></span><span class="lnt"> 9
</span><span class="hl"><span class="lnt">10
</span></span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="cp">#include</span> <span class="cpf"><stdio.h></span><span class="cp">
</span><span class="cp">#include</span> <span class="cpf"><sys/time.h></span><span class="cp">
</span><span class="cp"></span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv_start</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">timeval</span> <span class="n">tv_end</span><span class="p">;</span>
<span class="hl"> <span class="n">gettimeofday</span><span class="p">(</span><span class="o">&</span><span class="n">tv_start</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span> <span class="c1">// expensive operation
</span><span class="hl"><span class="c1"></span> <span class="n">gettimeofday</span><span class="p">(</span><span class="o">&</span><span class="n">tv_end</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%.2f µs</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span> <span class="n">tv_end</span><span class="p">.</span><span class="n">tv_usec</span> <span class="o">-</span> <span class="n">tv_start</span><span class="p">.</span><span class="n">tv_usec</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>We declare two structs of type <code>timeval</code> as out-parameters for <code>gettimeofday()</code> which takes two arguments: <code>struct timeval* tv</code> and <code>struct timezone* tz</code>.
The <code>timeval</code> struct holds the following information:</p>
<div class="gblog-columns gblog-columns--regular flex flex-gap flex-mobile-column">
<div class="gblog-columns__content gblog-markdown--nested flex-even">
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="k">struct</span> <span class="n">timeval</span> <span class="p">{</span>
<span class="n">time_t</span> <span class="n">tv_sec</span><span class="p">;</span> <span class="cm">/* seconds */</span>
<span class="n">suseconds_t</span> <span class="n">tv_usec</span><span class="p">;</span> <span class="cm">/* microseconds */</span>
<span class="p">};</span>
</code></pre></div>
</div>
<div class="gblog-columns__content gblog-markdown--nested flex-even">
This allows us to also get the amount of microseconds after the Epoch for much greater measuring precision.
The type used to represent the microseconds, <code>suseconds_t</code>, is usually defined as a <code>long</code> which can hold at least 32 bits.
</div>
</div>
<p>The second argument of <code>gettimeofday()</code> can be used to get information about the timezone of the system though it is obsolete and flawed, hence why just passing in <code>NULL</code> is recommended<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>.</p>
<div class="gblog-post__anchorwrap">
<h3 id="a-new-dawn-c-and-chrono">
A new dawn: C++ and <code><chrono></code>
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#a-new-dawn-c-and-chrono" class="gblog-post__anchor clip flex align-center" aria-label="Anchor A new dawn: C++ and <chrono>" href="#a-new-dawn-c-and-chrono">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>With the arrival of C++11, a more flexible collection of types for time tracking was added to the standard.
Including the most recent changes to <a
class="gblog-markdown__link--code"
href="https://en.cppreference.com/w/cpp/chrono"
><code><chrono></code></a> as part of C++20, the following clock types are available:</p>
<div class="gblog-columns gblog-columns--regular flex flex-gap flex-mobile-column">
<div class="gblog-columns__content gblog-markdown--nested flex-even">
<p><strong>C++11</strong></p>
<ul>
<li><code>std::chrono::system_clock</code></li>
<li><code>std::chrono::steady_clock</code></li>
<li><code>std::chrono::high_resolution_clock</code></li>
</ul>
</div>
<div class="gblog-columns__content gblog-markdown--nested flex-even">
<p><strong>C++20</strong></p>
<ul>
<li><code>std::chrono::utc_clock</code></li>
<li><code>std::chrono::tai_clock</code></li>
<li><code>std::chrono::gps_clock</code></li>
<li><code>std::chrono::file_clock</code></li>
<li><code>std::chrono::local_t</code></li>
</ul>
</div>
</div>
<p>Now this might seem overwhelming at first but when it comes to benchmarking, the only clock types we need to look at are the ones added in C++11.
Out of these, <code>std::chrono::steady_clock</code> is the most suitable for measuring intervals.
To understand why, let’s take a quick detour and talk about what a <code>Clock</code> is according to the C++ standard and the differences between the aforementioned three types of clocks.
In its most basic form, a clock type needs to have a starting point and a tick rate.
A more precise definition of the requirements needed to satisfy being a <code>Clock</code> type can be found <a
class="gblog-markdown__link"
href="https://en.cppreference.com/w/cpp/named_req/Clock"
>here</a>.</p>
<p>Now, let’s compare the different clocks:</p>
<table>
<thead>
<tr>
<th><code>std::chrono::system_clock </code></th>
<th><code>std::chrono::steady_clock </code></th>
<th><code>std::chrono::high_resolution_clock </code></th>
</tr>
</thead>
<tbody>
<tr>
<td>- system wide <em>wall clock time</em></td>
<td>- <strong>monotonic</strong> clock</td>
<td>- clock with the smallest tick period</td>
</tr>
<tr>
<td>- system time can be adjusted</td>
<td>- tick frequency constant</td>
<td>- alias of one of the other two</td>
</tr>
<tr>
<td>- maps to C-style time</td>
<td>- not related to wall clock time</td>
<td>- should be avoided (implementation-defined)</td>
</tr>
</tbody>
</table>
<p><strong>Wall clock time</strong> is the actual, real time a physical clock (be it a watch or an actual wall clock) would measure.
A wall clock may be subject to unexpected changes which would invalidate any measurements taken with it.
It has the ability to jump backward or forward in time through manual adjustments or automatic synchronization with NTP (Network Time Protocol).
This makes <code>std::chrono::system_clock</code> biased and unfit for anything but giving us the current time.</p>
<p><strong>Monotonic clocks</strong>, like <code>std::chrono::steady_clock</code>, on the other hand cannot jump forward or backward in time and their tick rate is constant.
<code>std::chrono::steady_clock</code> uses the system startup time as its Epoch and will never be adjusted.
It acts like a stopwatch, perfect for measuring intervals but not for telling time.</p>
<p>Let’s look at an example:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="hl"><span class="lnt"> 8
</span></span><span class="lnt"> 9
</span><span class="hl"><span class="lnt">10
</span></span><span class="lnt">11
</span><span class="hl"><span class="lnt">12
</span></span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="cp">#include</span> <span class="cpf"><chrono></span><span class="cp">
</span><span class="cp">#include</span> <span class="cpf"><iostream></span><span class="cp">
</span><span class="cp"></span>
<span class="c1">// to save us from typing std::chrono everytime
</span><span class="c1"></span><span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="p">;</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="hl"> <span class="k">auto</span> <span class="n">start</span> <span class="o">=</span> <span class="n">steady_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>
</span> <span class="c1">// expensive operation
</span><span class="hl"><span class="c1"></span> <span class="k">auto</span> <span class="n">end</span> <span class="o">=</span> <span class="n">steady_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>
</span>
<span class="hl"> <span class="k">auto</span> <span class="n">duration</span> <span class="o">=</span> <span class="n">duration_cast</span><span class="o"><</span><span class="n">milliseconds</span><span class="o">></span><span class="p">(</span><span class="n">end</span> <span class="o">-</span> <span class="n">start</span><span class="p">).</span><span class="n">count</span><span class="p">();</span>
</span> <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o"><<</span> <span class="n">duration</span> <span class="o"><<</span> <span class="s">"ms</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="k">return</span> <span class="n">EXIT_SUCCESS</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>Just like before, we define a start and end time point with the <code>now()</code> function of the clock.
This static member function returns a <code>std::chrono::time_point</code> with the current time.
In line 12, we first compute the difference between these two, returning a <code>std::chrono::duration</code> type, which we can then cast to actual time units with <code>std::chrono::duration_cast()</code>.
The availabe units range from nanoseconds to years and are passed in as a template parameter.
Finally, <code>count()</code> converts the chosen time unit to the underlying arithmetic type which we can then output.</p>
<div class="gblog-post__anchorwrap">
<h3 id="a-tale-of-abstractions-rust-et-al">
A tale of abstractions: Rust et al.
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#a-tale-of-abstractions-rust-et-al" class="gblog-post__anchor clip flex align-center" aria-label="Anchor A tale of abstractions: Rust et al." href="#a-tale-of-abstractions-rust-et-al">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>Now that we know how C and C++ handle time, how clocks work and the differences between wall clock time and monotonic clocks, let’s look at one final systems programming language with an even higher level of abstraction.
Unlike C++, Rust hides most of the implementation details (and spares us from typing <code>std::chrono</code> or verbose casts) while still providing the same level of precision:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="hl"><span class="lnt">4
</span></span><span class="lnt">5
</span><span class="hl"><span class="lnt">6
</span></span><span class="lnt">7
</span><span class="lnt">8
</span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-rust" data-lang="rust"><span class="k">use</span><span class="w"> </span><span class="n">std</span>::<span class="n">time</span>::<span class="n">Instant</span><span class="p">;</span><span class="w">
</span><span class="w">
</span><span class="w"></span><span class="k">fn</span> <span class="nf">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="hl"><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Instant</span>::<span class="n">now</span><span class="p">();</span><span class="w">
</span></span><span class="w"> </span><span class="c1">// expensive operation
</span><span class="hl"><span class="c1"></span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">start</span><span class="p">.</span><span class="n">elapsed</span><span class="p">();</span><span class="w">
</span></span><span class="w">
</span><span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"{}ms"</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">.</span><span class="n">as_millis</span><span class="p">());</span><span class="w">
</span><span class="w"></span><span class="p">}</span><span class="w">
</span></code></pre></td></tr></table>
</div>
</div><p>All that is needed, is to take the current time with <code>Instant::now()</code> as a start point, then define an end point.
An <a
class="gblog-markdown__link--code"
href="https://doc.rust-lang.org/std/time/struct.Instant.html"
><code>Instant</code></a> type in Rust always represents a non-decreasing monotonic clock.
This end point could be – just like in C or C++ – defined as another <code>Instant::now()</code> and we could compute the difference but Rust also allows us to just call the <code>elapsed()</code> method on the start point.
This returns a <a
class="gblog-markdown__link--code"
href="https://doc.rust-lang.org/std/time/struct.Duration.html"
><code>Duration</code></a> and is arguably more readable and declarative than a minus sign between two non-arithmetic types while also being shorter.</p>
<p>Finally, we can convert the <code>Duration</code> to time units with the corresponding methods similar to <code>duration_cast<>()</code> in C++.</p>
<div class="gblog-post__anchorwrap">
<h3 id="going-from-measuring-to-benchmarking-google-benchmark">
Going from measuring to benchmarking: Google Benchmark
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#going-from-measuring-to-benchmarking-google-benchmark" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Going from measuring to benchmarking: Google Benchmark" href="#going-from-measuring-to-benchmarking-google-benchmark">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>So far, we only looked at what C, C++ and Rust offer us in terms of measuring time.
Of course, most languages offer similar features and functions though each with a slightly different syntax.
The goal is not to present all of these different ways of calling such functions – that’s what the docs are for – but rather show different levels of abstractions while also highlighting similarities in the methodology.</p>
<p>Things like defining start and end points, computing their difference, being aware of clock types might be something to keep in mind regardless of the language used.
But what we did so far was not <em>benchmarking</em>.
In order for us to proclaim we performed a successful benchmark, we need to measure not only once but many more times.
We need to calculate the mean or median of these multiple measurements.
We want to rule out any random or statistical errors.
Ideally, we also want to not have to define start and end points manually like we did before for every single measurement.</p>
<p>This is where (micro) benchmarking libraries come in handy.
While many exist for every language, we are going to stick with C++ for now and take a look at <a
class="gblog-markdown__link"
href="https://github.com/google/benchmark"
><strong>Google Benchmark</strong></a>.
This open-source micro benchmarking library from Google makes timing of small code snippets much easier and allows us to get good statistical averages through repeated sampling of said snippets.</p>
<p>The example in their <code>README.md</code> shows the basic idea:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span><span class="lnt">19
</span><span class="lnt">20
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="cp">#include</span> <span class="cpf"><benchmark/benchmark.h></span><span class="cp">
</span><span class="cp"></span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">BM_StringCreation</span><span class="p">(</span><span class="n">benchmark</span><span class="o">::</span><span class="n">State</span><span class="o">&</span> <span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="nl">_</span> <span class="p">:</span> <span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">empty_string</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// Register the function as a benchmark
</span><span class="c1"></span><span class="n">BENCHMARK</span><span class="p">(</span><span class="n">BM_StringCreation</span><span class="p">);</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">BM_StringCopy</span><span class="p">(</span><span class="n">benchmark</span><span class="o">::</span><span class="n">State</span><span class="o">&</span> <span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">x</span> <span class="o">=</span> <span class="s">"hello"</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="nl">_</span> <span class="p">:</span> <span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">copy</span><span class="p">(</span><span class="n">x</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// Register another function as a benchmark
</span><span class="c1"></span><span class="n">BENCHMARK</span><span class="p">(</span><span class="n">BM_StringCopy</span><span class="p">);</span>
<span class="n">BENCHMARK_MAIN</span><span class="p">();</span>
</code></pre></td></tr></table>
</div>
</div><p>Any method that we wish to benchmark has to be marked as <code>static</code>.
Furthermore, they also need to have a mutable reference to a <code>benchmark::State</code> as an argument.
By iterating over the state object with the code we wish to benchmark, we “add” to the sampling process.
The <code>BENCHMARK</code> macro registers the functions as a benchmark while the <code>BENCHMARK_MAIN()</code> macro generates an appropriate <code>main()</code> function.</p>
<p>If we compile the code as follows:</p>
<pre tabindex="0"><code>$ g++ main.cpp -std=c++11 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -o main
</code></pre><p>We get this output:</p>
<pre tabindex="0"><code>2022-01-06T00:26:34+01:00
Running ./main
Run on (6 X 3696 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_StringCreation 4.27 ns 4.33 ns 165925926
BM_StringCopy 7.84 ns 7.85 ns 89600000
</code></pre><p>Google Benchmark creates a table displaying wall clock time, CPU time and how often each function was sampled for us.
It is able to sample a function up to a billion times.
Now, to make this code faster, the first step would be to turn on optimizations.
Currently, we compile with no optimizations (<code>-O0</code>), so let’s use <code>-O3</code> and see what happens:</p>
<pre tabindex="0"><code>2022-01-07T21:27:42+01:00
Running ./main
Run on (6 X 3696 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_StringCreation 0.000 ns 0.000 ns 1000000000
BM_StringCopy 4.14 ns 4.17 ns 172307692
</code></pre><p>At first glance we seem to have created the world’s fastest string creation function though sadly, that is not what happened.
Taking a look at line 5 of the example code reveals the problem.
<code>std::string empty_string</code> is declared but we do not use it anywhere else in the code, so the compiler is smart and sees that removing it would have no side effects to the program and does so.
Most of the time, this is the behavior we expect and want from our compiler but in this case we actually do want to keep this unused variable around.</p>
<p>Luckily, Google Benchmark has functions that can pretend to use a variable so the compiler can’t just remove it anymore:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="hl"><span class="lnt">4
</span></span><span class="lnt">5
</span><span class="lnt">6
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c++" data-lang="c++"><span class="k">static</span> <span class="kt">void</span> <span class="nf">BM_StringCreation</span><span class="p">(</span><span class="n">benchmark</span><span class="o">::</span><span class="n">State</span><span class="o">&</span> <span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="nl">_</span> <span class="p">:</span> <span class="n">state</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">empty_string</span><span class="p">;</span>
<span class="hl"> <span class="n">benchmark</span><span class="o">::</span><span class="n">DoNotOptimize</span><span class="p">(</span><span class="n">empty_string</span><span class="p">);</span>
</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>This allows us to still benchmark the creation of an empty string while using <code>-O3</code>.
There are many other ways to prevent certain optimizations from happening that would invalidate a benchmark though this and <code>benchmark::ClobberMemory()</code> are the ones Google Benchmark provides.<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>
Now, our table looks much more sensible:</p>
<pre tabindex="0"><code>2022-01-07T21:26:18+01:00
Running ./main
Run on (6 X 3696 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_StringCreation 0.681 ns 0.688 ns 1000000000
BM_StringCopy 4.04 ns 3.99 ns 172307692
</code></pre><p>Unwanted optimizations are just one thing to be aware of when doing benchmarks.
Depending on what is tested, we might want clear our cache before a run, do warmup runs if I/O is involved or compare the same function with differing parameters – so called “parameterized benchmarks”.
Many benchmarking libraries will have all of these advanced features and more but they are out of scope for this post.
For quick tests and comparisons, <a
class="gblog-markdown__link"
href="https://quick-bench.com/#"
><strong>Quick Bench</strong></a>, an online compiler using Google Benchmark, is a great alternative.</p>
<div class="gblog-post__anchorwrap">
<h2 id="profilers----more-than-just-time">
Profilers – more than just time
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#profilers----more-than-just-time" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Profilers – more than just time" href="#profilers----more-than-just-time">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>So far, we only measured the time it took for a program or a certain function to run.
Going back to the quote presented at the beginning: in order for us to optimize our code, we need to know <em>what</em> to optimize.
Just benchmarking some functions will not tell us where a potential bottleneck might be – we need more information about our program.</p>
<p>This is where <strong>profilers</strong> shine brightest.
Profilers are usually external, language-agnostic tools that operate on a binary and not on source code.
They usually offer:</p>
<ul>
<li>(relative) timing of every function call</li>
<li>generating call graphs (who called what) and flamegraphs</li>
<li>frequency of instruction calls</li>
<li>frequency of generated assembly function calls</li>
<li>in-depth performance counter stats, e.g., branch misses, CPU cycles and many more</li>
</ul>
<p>There are many different profilers out there, each specialized for their own use-case<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup> and it can help to know how to classify different types of profilers.
The most common types are:</p>
<ul>
<li>Flat profilers – computes average call times</li>
<li>Call-graph profilers – shows call times, function frequencies and creates a call-chain graph</li>
<li>Input-sensitive profilers – generate profiles and charts based on different inputs and how a function scales based on it</li>
<li>Event-based profilers – only collect statistics when certain pre-defined events happen</li>
<li>Statistical profilers – operate via sampling by probing the call stack through interrupts</li>
</ul>
<p>This classification is not mutually exclusive – a profiler can offer any one or all of these features.
We are going to take a look at one tool in particular: <code>perf</code>.</p>
<div class="gblog-post__anchorwrap">
<h3 id="perf-jack-of-all-trades">
<code>perf</code>: jack of all trades
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#perf-jack-of-all-trades" class="gblog-post__anchor clip flex align-center" aria-label="Anchor perf: jack of all trades" href="#perf-jack-of-all-trades">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>The reason why <code>perf</code> is a good first choice is that almost everyone (that uses Linux<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup>) already has it as it is part of the kernel.
It also does many of the aforementioned things just out of the box.
Let’s start by creating a statistical profile:</p>
<pre tabindex="0"><code>$ perf stat ./peekng encode 123.png teSt secret
</code></pre><pre tabindex="0"><code> Performance counter stats for './peekng encode 123.png teSt secret':
2,54 msec task-clock:u # 0,287 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
187 page-faults:u # 73,490 K/sec
4.724.914 cycles:u # 1,857 GHz
5.977.214 instructions:u # 1,27 insn per cycle
1.103.539 branches:u # 433,686 M/sec
18.837 branch-misses:u # 1,71% of all branches
0,008852233 seconds time elapsed
0,003009000 seconds user
0,000000000 seconds sys
</code></pre><p><a
class="gblog-markdown__link--code"
href="https://github.com/xkevio/peekng"
><code>peekng</code></a> here just acts as an example program, all it does is encode the word <code>secret</code> into <code>123.png</code>.
As we can see, we received a lot of additional information about our program which may help with identifying optimization opportunities.
For example, we observe 187 page faults meaning <code>peekng</code> might have tried to access a memory page which was not loaded into RAM and had to be loaded from a disk 187 times (this is just one reason why a page fault might occur).
This may lead us to look at how we handle file reads and writes in our program.
Additionally, <code>perf stat</code> shows us the time it took to execute the program as wall clock time and CPU time (user + sys).</p>
<p>Another thing <code>perf</code> offers us, is creating an interactive call graph.
This happens with a combination of <code>perf record</code> and <code>perf report</code> though we need to be careful to not eliminate certain debug symbols.
Earlier, we looked at how we might want to prevent certain optimizations from happening while still using <code>-O3</code> (or your preferred language’s equivalent) in source code – now we need to fiddle with some compiler flags.</p>
<p>GCC, for example, has <code>-Og</code> as an additional optimization level which optimizes for debugging experience.
It enables some compiler passes for collecting debug information while only optimizing at a level close to <code>-O1</code>.
Having this additional information still in the binary will make reading and following the call graph much easier.
Another important thing is to keep the frame pointer register.
The frame pointer stores the current stack frame pointer in a reserved register if needed for a function.
It allows us to get additional information about how the stack was used during runtime.
By default, most compilers omit the frame pointer to get one additional register but this can be disabled via <code>-fno-omit-frame-pointer</code> in GCC.</p>
<p>Languages that do not use the GCC backend may have similar options though under different names.
Keeping the frame pointer under Rust, for example, actually requires us to modify the <code>perf record</code> command slightly.
Let’s look at how it works:</p>
<pre tabindex="0"><code>$ perf record [-g / --call-graph=dwarf] <COMMAND>
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0,135 MB perf.data (16 samples) ]
</code></pre><p>Usually, the <code>-g</code> flag will suffice to create the graph but to keep the frame pointer intact for languages like Rust, the second flag is required instead.
To generate the call graph from the <code>perf.data</code> file, <code>perf report [-g / -G]</code> is used.
Depending on if we want the function hierarchy to go from callee to caller or vice versa, either <code>-g</code> or <code>-G</code> is required.
The call graph, going from caller to callee, looks like this:</p>
<center>
<img src="perf-example.png" width="75%" height="75%">
</center>
<p>We can now see relative timings for every function call made, see which functions call other functions and get a more general idea of where a bottleneck might be.
Plus, <code>perf</code> also allows us to look at the assembly of a chosen function call by pressing the <kbd>A</kbd> key annotated with the frequency each instruction is called with.
Had we not done the previous steps of preventing certain optimizations and keeping debug symbols, this graph would be full of mangled function names and call hierarchies going deep into system calls.</p>
<p><code>perf</code> provides many other options like setting tracepoints and even doing kernel microbenchmarks, which we aren’t going to look at in this post, though it is certainly worth looking at what else it has to offer.</p>
<div class="gblog-post__anchorwrap">
<h3 id="the-underlying-kernel-interface-perf_events">
The underlying kernel interface: <code>perf_events</code>
<a data-clipboard-text="https://blog.parcio.de/posts/2022/06/performance-analysis/#the-underlying-kernel-interface-perf_events" class="gblog-post__anchor clip flex align-center" aria-label="Anchor The underlying kernel interface: perf_events" href="#the-underlying-kernel-interface-perf_events">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>Since the profiler tool is part of the Linux kernel, it has mostly direct access to any events the kernel picks up on.
This is done via something called <code>perf_events</code> – an interface exported by the Linux kernel.
It can measure events from different sources depending on the subcommand that was run.</p>
<p><strong>Software events</strong> are pure kernel counters, utilized in part for <code>perf stat</code>.
They include such things as context-switches, page faults, etc.</p>
<p><strong>Hardware events</strong> are events stemming from the processor itself and its PMU (Performance Monitoring Unit).
The PMU provides a list of michro-architectural events like CPU cycles, cache misses and some others.</p>
<p><strong>Tracepoint events</strong>, implemented via the <code>ftrace</code> kernel infrastructure, provide a way to interface with certain syscalls when tracing is required.</p>
<p>For a full list of possible events, see the <a
class="gblog-markdown__link"
href="https://perf.wiki.kernel.org/index.php/Tutorial"
>perf wiki</a>.
The statistical profile we generated earlier came together through <code>perf</code> keeping a running count during execution of these supported events.
<code>perf_events</code> uses, as the name suggests, <em>event-based sampling</em>.
This means that every time a certain event happens, the sampling counter is increased.
Which event is chosen depends on how we intend to use <code>perf</code>.
The <code>record</code> subcommand, for example, uses something called the <code>cycles</code> event as its sampling event.
The kernel maps this event to a hardware event on the PMU which depends on the manufacturer of the processor.</p>
<p>Once the sampling counter overflows, a sample is recorded.
The instruction pointer then stores where the program was interrupted.
Unfortunately, the instruction pointer may not point at where the overflow happened but rather at where the PMU was interrupted, making it possible that the wrong instructions get counted.
This is why one always needs to be cautious when looking at graphs such as generated assembly annotated with the frequency of its execution as it might just be one or two instructions off.</p>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>“A division or contrast between two things that are or are represented as being opposed or entirely different.” <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a
class="gblog-markdown__link"
href="https://queue.acm.org/detail.cfm?id=3212479"
>https://queue.acm.org/detail.cfm?id=3212479</a> <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Represented as clock ticks, not seconds – convert via division by <code>CLOCKS_PER_SEC</code> <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p><a
class="gblog-markdown__link"
href="https://man7.org/linux/man-pages/man2/gettimeofday.2.html#NOTES"
>https://man7.org/linux/man-pages/man2/gettimeofday.2.html#NOTES</a> <a href="#fnref:4" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>For more details on that, see this great talk: <a
class="gblog-markdown__link"
href="https://www.youtube.com/watch?v=nXaxk27zwlk"
>https://www.youtube.com/watch?v=nXaxk27zwlk</a> <a href="#fnref:5" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>See <a
class="gblog-markdown__link"
href="http://pramodkumbhar.com/2017/04/summary-of-profiling-tools/"
>here</a> for a list of over a hundred different profilers and when to use which one <a href="#fnref:6" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p><code>perf</code> is not available under Windows and also does not work with WSL (Windows Subsystem for Linux) <a href="#fnref:7" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</section>
Lossless data compressionhttps://blog.parcio.de/posts/2022/05/lossless-data-compression/Yolanda Thiel2022-05-24T00:00:00+00:002022-05-24T00:00:00+00:00
<p>This post is an introduction to lossless data compression in which we will explore the approaches of entropy-based/statistical as well as dictionary-based compression and explain some of the most common algorithms.</p>
<div class="gblog-post__anchorwrap">
<h2 id="but-first-of-all-why-do-we-even-need-data-compression">
But first of all, why do we even need data compression?
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#but-first-of-all-why-do-we-even-need-data-compression" class="gblog-post__anchor clip flex align-center" aria-label="Anchor But first of all, why do we even need data compression?" href="#but-first-of-all-why-do-we-even-need-data-compression">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Audio and video communication as well as large multimedia platforms as we know them today are only possible because of data compression.
Usually, every single photo or video posted on social media or video/streaming platforms has to be compressed.
Otherwise, the size of the data would be too large to deal with effectively.
Of course, in scientific research, there are also fields of application where we generate or measure large amounts of data.
To store all this data, we need data compression.</p>
<div class="gblog-post__anchorwrap">
<h2 id="basics">
Basics
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#basics" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Basics" href="#basics">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>A data compression technique usually contains two algorithms:</p>
<ol>
<li>One compression algorithm which takes the original input A and generates a representation of this original input A' which (ideally) requires less bits than A.</li>
<li>One reconstruction/decoding algorithm which operates on the compressed representation A' and generates the reconstruction B.</li>
</ol>
<p>If B is identical to A, the compression is called lossless.
If B differs from A, the compression is called lossy.
To compare different compression algorithms it is possible to use the data compression ratio which can be calculated by dividing the uncompressed size by the compressed size of the data.</p>
<link
rel="stylesheet"
href="/katex-e4de31b5.min.css"
/>
<script defer src="/js/katex-3c86c25a.bundle.min.js"></script>
<span class="gblog-katex ">
\(\text{data compression ratio} = \frac{\text{uncompressed size of data}}{\text{compressed size of data}} = \frac{\text{size of A}}{\text{size of A'}}\)</span>
<p>Of course, this is only one of different useful measurements and the performance of compression algorithms is highly dependent on the input data.
But if there are several algorithms that are suitable for the data which is to be compressed, comparing the compression ratio could be sensible.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup></p>
<div class="gblog-post__anchorwrap">
<h2 id="a-first-compression-algorithm-run-length-encoding-rle">
A first compression algorithm: Run-Length Encoding (RLE)
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#a-first-compression-algorithm-run-length-encoding-rle" class="gblog-post__anchor clip flex align-center" aria-label="Anchor A first compression algorithm: Run-Length Encoding (RLE)" href="#a-first-compression-algorithm-run-length-encoding-rle">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Let us have a look at this rather easy compression algorithm, called <strong>run-length encoding</strong>:
It stores <strong>runs</strong> of data as single data value and count.
A <strong>run</strong> is a sequence in which the same data value occurs in consecutive data elements.</p>
<p>Let’s consider a line of 10 pixels, where the pixels can either be white or black.
If W stands for a white pixel and B for a black pixel, we could have data which looks like this: <code>BBBBBWWWWW</code>.
A run-length encoding algorithm could compress this input as following <code>5B5W</code>, because there are 5 black pixels followed by 5 white pixels.
So instead of saving 10 characters, the output of RLE would only need 4 characters.</p>
<p><em>Of course, we do not need to use chars but also could use other data types to save and compress our data. We use chars in this example because this should make it easier to understand the concept.</em></p>
<p>This approach works best if there are many longer runs in the data.
Therefore, the best case scenario of the input for our example would be <code>WWWWWWWWWW</code> or <code>BBBBBBBBBB</code>, because this input can be compressed as <code>10W</code> or <code>10B</code>, which is the shortest possible output for this example.
In this case we would have a compression ratio of
<span class="gblog-katex ">
\(\frac{10}{3} = 3.\overline{3}\)</span>(also sometimes displayed as 10:3).</p>
<p>But if there aren’t many runs in the file, which is to be compressed, the output file of the algorithm might be larger than the input file.
In the case of our example the worst case would be one of these input files: <code>WBWBWBWBWB</code> <code>BWBWBWBWBW</code>.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<hr>
<details style="cursor: pointer;">
<summary>What would be the compression rate in this worst case? <i>Click to show the answer.</i></summary>
<p>0.5, because the uncompressed size is 10 chars and the 'compressed' size is 20 chars.</p>
</details>
<hr>
<div class="gblog-post__anchorwrap">
<h2 id="entropy-basedstatistical-compression">
Entropy-based/statistical compression
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#entropy-basedstatistical-compression" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Entropy-based/statistical compression" href="#entropy-basedstatistical-compression">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>The next approaches, we want to get to know, are called entropy-based because they use the entropy of the given data.
The entropy of data depends on the probabilities of certain symbols to occur in the given data.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>
While Run-Length Encoding assigns a fixed-size code to the symbols it operates on, entropy-based approaches have variable-sized codes.
Entropy-based approaches work by replacing unique symbols within the input data with a unique and shorter prefix code.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
To ensure a good compression ratio, the used prefix should be shorter the more often a symbol occurs.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> <sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup>
Examples of this approach are arithmetic coding, Shannon-Fano coding and Huffman coding.</p>
<div class="gblog-post__anchorwrap">
<h3 id="huffman-coding">
Huffman coding
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#huffman-coding" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Huffman coding" href="#huffman-coding">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>We will explore some of the previously mentioned properties of entropy-based compression algorithms on the example of Huffman coding.</p>
<div align="center">
<figure>
<img src="huffman-coding-toennies.png" alt="Huffman Coding example Tönnies">
<figcaption>This example is taken from page 136 of Tönnies' book "Grundlagen der Bildverarbeitung".<sup>6</sup></figcaption>
</figure>
</div>
<p>In this example we want to compress the image shown on the top.
To do so, we create a normed histogram of all values as the first step.
In the image itself but also in its histogram we can see that the darkest possible greyscale value occurs quite often while the other lighter greyscale values have lower frequencies.
The algorithm now merges the symbols according to their frequency until there are only 2 symbols left.
So in the case of the example, the two least frequent greyscale values are merged in every step.
Then the original symbols are given new prefix codes.
Symbols which were previously merged are broken down into segments and for every segment the code is extended.
Therefore, the most occuring symbol gets the shortest prefix code and the least occuring symbol gets the longest prefix code.
The prefix code assignment is represented in a binary tree, which is also traversed to decode the information.<sup id="fnref:6"><a href="#fn:6" class="footnote-ref" role="doc-noteref">6</a></sup>
This approach produces the best code when the probabilities of symbols are negative powers of 2.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<div class="gblog-post__anchorwrap">
<h2 id="dictionary-based-compression">
Dictionary-based compression
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#dictionary-based-compression" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Dictionary-based compression" href="#dictionary-based-compression">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Dictionary-based approaches are the last group of lossless data compression algorithms we will cover in this article.
Unlike entropy-based approaches, dictionary-based ones do <strong>not</strong> use a statistical model or a variable-sized code.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>Dictionary-based algorithms partition the data into phrases which are non-overlapping subsets of the original data.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
Each phrase is then encoded as a token using a dictionary.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>
Accordingly, there are two stages in a dictionary-based compression algorithm:</p>
<ol>
<li>The dictionary construction stage: In this stage the algorithm finds phrases and codewords.</li>
<li>The parsing stage: In this stage the phrases are replaced by codewords.<sup id="fnref:7"><a href="#fn:7" class="footnote-ref" role="doc-noteref">7</a></sup></li>
</ol>
<p>There are <strong>static dictionary codes</strong> and <strong>dynamic/adaptive dictionary codes</strong>.
<strong>Static</strong> dictionaries are created before the input processing and stay the same for the complete run, while <strong>dynamic</strong> dictionaries are updated during parsing which means that in this case the two stages (dictionary construction and parsing) are interleaved.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
After these rather theoretical basics about dictionary-based compression, we will now dive into different LZ-family algorithms to explain these things on a few examples.</p>
<div class="gblog-post__anchorwrap">
<h3 id="lz-family">
LZ family
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#lz-family" class="gblog-post__anchor clip flex align-center" aria-label="Anchor LZ family" href="#lz-family">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>These algorithms are named after their creators Abraham Lempel and Jacob Ziv and are some of the most known dictionary compression methods.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
The algorithms we will go into detail about are LZ77 and LZ78, which are the two original algorithms developed by Lempel and Ziv, as well as a few of their variants.</p>
<div class="gblog-post__anchorwrap">
<h3 id="lz77-and-its-variants">
LZ77 and its variants
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#lz77-and-its-variants" class="gblog-post__anchor clip flex align-center" aria-label="Anchor LZ77 and its variants" href="#lz77-and-its-variants">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>LZ77 assumes and exploits that data is most likely to be repeated.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
The principle is to use a part of the previously-seen input stream as the dictionary.
Thus the input is analyzed through a sliding window:</p>
<div align="center">
<figure>
<img src="lz77-sliding-window-salomon.png" alt="LZ77 Sliding Window Salomon">
<figcaption>This example is taken from page 176 of Salomon's book "Data Compression: The Complete Reference".<sup>3</sup></figcaption>
</figure>
</div>
<p>As seen in the figure above, the window is divided in two parts:</p>
<ul>
<li>The search buffer: The current dictionary which includes symbols that have previously been input and encoded.</li>
<li>The look-ahead buffer which contains data yet to be encoded.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> When a word is repeated, it can be replaced by a pointer to the last occurrence accompanied by the number of matched characters.<sup id="fnref:8"><a href="#fn:8" class="footnote-ref" role="doc-noteref">8</a></sup></li>
</ul>
<p>I will explain this further on the example shown in the image above:</p>
<ul>
<li>The encoder scans the search buffer from <strong>right to left</strong>.</li>
<li>It looks for a match in the dictionary (search buffer) for the first symbol <strong>e</strong> which is in the front of the look-ahead buffer.</li>
<li>It finds an <strong>e</strong> in “<strong>easily</strong>” at a distance of <strong>8</strong> from the end of the search buffer (you have to count from right to left, distance of 1 would be the symbol left of the currently selected symbol).</li>
<li>The encoder then matches as many symbols following those 2 e’s as possible which are in this case the 3 symbols “<strong>eas</strong>”.</li>
<li>The length of the match is therefore <strong>3</strong>.</li>
<li>The encoder then continues its backward scan to find a longer match.</li>
<li>In this case there is no longer match, but a same length match in “<strong>eastman</strong>”.</li>
</ul>
<p>Generally, the encoder selects the longest match or the last one found and prepares the token.
Why does it use the last one found?
The answer is quite simple: The algorithm then doesn’t have to keep track of all found matches and can save memory space.
In practical implementations the search buffer is some thousands of bytes long whereas the look-ahead buffer is some tens of bytes long.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p><strong>Here you can see what the first 5 steps and tokens look like for the example in the image above:</strong><sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<table>
<thead>
<tr>
<th style="text-align:right">Search Buffer</th>
<th style="text-align:left">Look-Ahead Buffer</th>
<th style="text-align:center">Token</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right"></td>
<td style="text-align:left">sir_sid_eastman_</td>
<td style="text-align:center">(0,0,“s”)</td>
</tr>
<tr>
<td style="text-align:right">s</td>
<td style="text-align:left">ir_sid_eastman_e</td>
<td style="text-align:center">(0,0,“i”)</td>
</tr>
<tr>
<td style="text-align:right">si</td>
<td style="text-align:left">r_sid_eastman_ea</td>
<td style="text-align:center">(0,0,“r”)</td>
</tr>
<tr>
<td style="text-align:right">sir</td>
<td style="text-align:left">_sid_eastman_eas</td>
<td style="text-align:center">(0,0,"_")</td>
</tr>
<tr>
<td style="text-align:right">sir_</td>
<td style="text-align:left">sid_eastman_easi</td>
<td style="text-align:center">(4,2,“d”)</td>
</tr>
</tbody>
</table>
<p>The token always consists of 3 elements:</p>
<ul>
<li>The first element is the distance of the found match. If there is no match, this element is 0.</li>
<li>The second element is the length of the found match. If there is no match, this is again 0.</li>
<li>The third and last element is the new symbol which is to be appended.</li>
</ul>
<p>This approach is suffix-complete, meaning that any suffix of a phrase is a phrase itself. So if the phrase “cold” is in the dictionary, so are “old”, “ld” and “d”.
The performance of this algorithm is limited by the number of comparisons needed for finding a matching pattern.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
While the encoder is a bit more complicated, the decoder is rather simple meaning that LZ77 and its variants are useful in cases where data has to be compressed once but decompressed very often.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<hr>
<p>Let’s try to decode some data compressed by LZ77:
The 3 tokens (from left <em>[first]</em> to right <em>[last]</em>) are: (0,0,“y”), (0,0,“a”) and
(2,1,"!").</p>
<details style="cursor: pointer;">
<summary><i>Click to show a tip.</i></summary>
<p>You have to "fill up" a buffer from right to left, using one token at a time and "pushing" the entries one space to the left in every step.</p>
</details>
<details style="cursor: pointer;">
<summary>Did you find out the decoded text? <i>Click to show the solution.</i></summary>
<p>yay!</p>
</details>
<hr>
<p>Let’s have a short look at a few <strong>LZ77 variants</strong>:</p>
<p>(Please note that this part is just a small overview and does not fully explain
how these variants work since that would go beyond the scope of this article.)</p>
<div class="gblog-post__anchorwrap">
<h4 id="lzss">
LZSS
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#lzss" class="gblog-post__anchor clip flex align-center" aria-label="Anchor LZSS" href="#lzss">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h4>
</div>
<p>This derivative algorithm was developed by Storer and Szymanski.
The look-ahead buffer is in this case improved by storing it in a circular queue and the algorithm holds the search buffer in a binary search tree.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
Because of that, the tokens have only 2 fields instead of 3.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<div class="gblog-post__anchorwrap">
<h4 id="deflate">
DEFLATE
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#deflate" class="gblog-post__anchor clip flex align-center" aria-label="Anchor DEFLATE" href="#deflate">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h4>
</div>
<p>This algorithm – developed by Philip Katz – was originally used in Zip and Gzip software and has been adopted by many applications including HTTP and PNG.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>
It is based on LZSS, but uses a chained hash table to find duplicates.
The matched lengths and distances are further compressed with two Huffman trees.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<div class="gblog-post__anchorwrap">
<h4 id="lzma">
LZMA
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#lzma" class="gblog-post__anchor clip flex align-center" aria-label="Anchor LZMA" href="#lzma">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h4>
</div>
<p>The last LZ77 variant of this short overview is the Lempel-Ziv-Markov chain-Algorithm which is the default compression algorithm of 7-zip.
Its principle is similar to that of DEFLATE but it doesn’t use Huffman coding and instead uses range encoding which is an integer-based version of arithmetic coding (an entropy-based compression algorithm).
This does complicate the encoder but also results in better compression.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<div class="gblog-post__anchorwrap">
<h3 id="lz78-and-its-variants">
LZ78 and its variants
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#lz78-and-its-variants" class="gblog-post__anchor clip flex align-center" aria-label="Anchor LZ78 and its variants" href="#lz78-and-its-variants">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h3>
</div>
<p>LZ78 constructs its dictionary differently than LZ77 and does therefore not use any search buffer, look-ahead buffer or sliding window.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
Compared to LZ77’s three-field tokens the LZ78 encoder outputs two-field tokens which each consist of a pointer to the dictionary and the code of a symbol.
Since the length of the phrases are implied in the dictionary, it doesn’t need to be part of the token.</p>
<p>Each token corresponds to a phrase of input symbols.
That phrase is added to the dictionary after the token is written on the compressed stream.
The size of LZ78’s dictionary is only limited by the amount of available memory, because unlike in LZ77 nothing is ever deleted from the dictionary in LZ78.
On the one hand, this can be an advantage since future pharses can be compressed by dictionary phrases which occured a lot earlier.
On the other hand, this can also be a disadvantage because the dictionary tends to grow fast and can fill up the entire available memory.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>The LZ78 algorithm begins with a single symbol entry in its dictionary, which is the null string at position zero.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> <sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
Then it concatenates the first symbol of the following input after every parsing step.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<p>Let’s try to further understand the algorithm by going through an example:
We want to compress the input <code>a_b_a</code>.
The current state could be displayed like this:</p>
<table>
<tr>
<th></th>
<th scope="col">Dictionary</th>
<th scope="col">Token</th>
</tr>
<tr>
<td>0</td>
<td>null</td>
<td></td>
</tr>
</table>
<p>As previously mentioned the algorithm starts with the null pointer as the dictionary’s entry at position 0.</p>
<p>At first the dictionary is searched for “a”.
If “a” is not found, the algorithm adds “a” to the dictionary at position 1 and outputs the token (0, “a”) since “a” is the concatenation of the null string and the symbol “a”.</p>
<p>a<code>_b_a</code></p>
<table>
<tr>
<th></th>
<th scope="col">Dictionary</th>
<th scope="col">Token</th>
</tr>
<tr>
<td>0</td>
<td>null</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>"a"</td>
<td>(0, "a")</td>
</tr>
</table>
<p>Now the dictionary is searched for “_”, since this symbol is also not yet part of the dictionary, it is added analogously to position 2 and the output token is (0, “_").
This then happens again for “b” at position 3.</p>
<p>a_b<code>_a</code></p>
<table>
<tr>
<th></th>
<th scope="col">Dictionary</th>
<th scope="col">Token</th>
</tr>
<tr>
<td>0</td>
<td>null</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>"a"</td>
<td>(0, "a")</td>
</tr>
<tr>
<td>2</td>
<td>"_"</td>
<td>(0, "_")</td>
</tr>
<tr>
<td>3</td>
<td>"b"</td>
<td>(0, "b")</td>
</tr>
</table>
<p>Now that the first 3 symbols of our input <code>a_b_a</code> are put in the dictionary, the encoder finds a dictionary entry for the next symbol “_”, but not for “_a” and therefore adds “_a” to the dictionary at position 4 and the output token is (2, “a”), because 2 is the position of “_”.</p>
<p>a_b_a</p>
<table>
<tr>
<th></th>
<th scope="col">Dictionary</th>
<th scope="col">Token</th>
</tr>
<tr>
<td>0</td>
<td>null</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>"a"</td>
<td>(0, "a")</td>
</tr>
<tr>
<td>2</td>
<td>"_"</td>
<td>(0, "_")</td>
</tr>
<tr>
<td>3</td>
<td>"b"</td>
<td>(0, "b")</td>
</tr>
<tr>
<td>4</td>
<td>"_a"</td>
<td>(2, "a")</td>
</tr>
</table>
<p>This is another longer example taken from Salomon’s “Data Compression: The Complete Reference”<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>:
It shows the first 14 steps for this string which is to be compressed: <code>sir_sid_eastman_easily_teases_sea_sick_seals</code></p>
<table>
<tr>
<th></th>
<th scope="col">Dictionary</th>
<th scope="col">Token</th>
</tr>
<tr>
<td>0</td>
<td>null</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>"s"</td>
<td>(0, "s")</td>
</tr>
<tr>
<td>2</td>
<td>"i"</td>
<td>(0, "i")</td>
</tr>
<tr>
<td>3</td>
<td>"r"</td>
<td>(0, "r")</td>
</tr>
<tr>
<td>4</td>
<td>"_"</td>
<td>(0, "_")</td>
</tr>
<tr>
<td>5</td>
<td>"si"</td>
<td>(1, "i")</td>
</tr>
<tr>
<td>6</td>
<td>"d"</td>
<td>(0, "d")</td>
</tr>
<tr>
<td>7</td>
<td>"_e"</td>
<td>(4, "e")</td>
</tr>
<tr>
<td>8</td>
<td>"a"</td>
<td>(0, "a")</td>
</tr>
<tr>
<td>9</td>
<td>"st"</td>
<td>(1, "t")</td>
</tr>
<tr>
<td>10</td>
<td>"m"</td>
<td>(0, "m")</td>
</tr>
<tr>
<td>11</td>
<td>"an"</td>
<td>(8, "n")</td>
</tr>
<tr>
<td>12</td>
<td>"_ea"</td>
<td>(7, "a")</td>
</tr>
<tr>
<td>13</td>
<td>"sil"</td>
<td>(5, "l")</td>
</tr>
<tr>
<td>14</td>
<td>"y"</td>
<td>(0, "y")</td>
</tr>
</table>
<p>So let us go through the procedure of LZ78 again quickly:
Generally, the current symbol is read and becomes a one-symbol phrase.
Then the encoder tries to find it in the dictionary.
If the symbol is found in the dictionary, the next symbol is read and concatenated to the first symbol and then this two-symbol phrase is being searched for in the dictionary.
As long as those phrases are found the process repeats.
At some point the phrase is not found in the dictionary and thus added to it, while the output is a token consisting of the last dictionary match and the last symbol of the phrase which could not be found in the search.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<p>This approach is called greedy parsing because the longest phrase with a prefix match is replaced by a codeword.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup>
Therefore, LZ78 is prefix-complete, meaning any prefix of a phrase is a phrase itself.
So if “hello” is part of the dictionary, so are “hell”, “hel”, “he” and “h”.</p>
<p>Since we’ve now gone through the base algorithm, let us have a look at variants of LZ78.</p>
<div class="gblog-post__anchorwrap">
<h4 id="lzw">
LZW
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#lzw" class="gblog-post__anchor clip flex align-center" aria-label="Anchor LZW" href="#lzw">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h4>
</div>
<p>This variant was developed by Terry Welch.
Its main feature is that it eliminates the second field of a token.
The LZW token consists only of a pointer to the dictionary.
This is possible because the dictionary is initialized with all the symbols in the alphabet.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>
The GIF encoding algorithm is based on LZW.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup></p>
<div class="gblog-post__anchorwrap">
<h4 id="lzmw">
LZMW
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#lzmw" class="gblog-post__anchor clip flex align-center" aria-label="Anchor LZMW" href="#lzmw">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h4>
</div>
<p>The second variant we will shortly mention is LZMW which was developed by V. Miller and M. Wegman.
It is based on two principles:</p>
<ul>
<li>When the dictionary is full, the least-recently-used dictionary phrase is deleted.</li>
<li>Each phrase which is added to the dictionary is a concatenation of two phrases. This means that a dictionary phrase can grow by more than one symbol at a time (unlike in the base LZ78 algorithm).</li>
</ul>
<p>A drawback of this implementation is that it complicates the choice of the data structure for the dictionary because the principles of LZMW lead to a non-prefix-complete dictionary and because a phrase may be added twice due to the deletion of the least-recently-used dictionary phrase if the dictionary is full.<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup></p>
<div class="gblog-post__anchorwrap">
<h2 id="limitations-of-lossless-data-compression">
Limitations of lossless data compression
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#limitations-of-lossless-data-compression" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Limitations of lossless data compression" href="#limitations-of-lossless-data-compression">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>After all these different algorithms, there is one last topic to cover:
The limitations of lossless data compression.
Generally, there are of course many more lossless compression algorithms, some of which are very specialised for a specific area like image or audio compression.
These algorithms could perform badly if they would be used outside of their designated area.
Which brings us to the question if a perfect compression algorithm could exist.
Perfect in this case meaning that the compressed file will <em><strong>always</strong></em> be smaller than the original file.</p>
<p>We can find this out by using a counting argument, the pigeonhole principle.<sup id="fnref:9"><a href="#fn:9" class="footnote-ref" role="doc-noteref">9</a></sup>
The pigeonhole principle states that if <strong>n items</strong> are put into <strong>m containers</strong> while <strong>n is greather than m</strong>, then <strong>at least one container</strong> must contain <strong>more than one item</strong> (n and m are natural numbers).<sup id="fnref:10"><a href="#fn:10" class="footnote-ref" role="doc-noteref">10</a></sup>
So if we consider that there are 10 pigeons but only 9 holes, at least one hole must contain more than one pigeon.</p>
<div align="center">
<figure style="max-width: 50%;">
<img src="TooManyPigeons.jpg" alt="Pigeon_Hole_Principle_Image">
<figcaption>
<a href="https://commons.wikimedia.org/wiki/File:TooManyPigeons.jpg">
Pigeons-in-holes.jpg by en:User:BenFrantzDale; this image by en:User:McKay</a>,
<a href="http://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0</a>,
via Wikimedia Commons
</figcaption>
</figure>
</div>
<p>Let’s go through this proof from a Stanford University lecture<sup id="fnref:11"><a href="#fn:11" class="footnote-ref" role="doc-noteref">11</a></sup>:
We already know that in lossless data compression, we have compression function C and a decompression function D.
To ensure that we can uniquely encode or decode a bitstring, these functions must be the inverses of each other:
<span class="gblog-katex ">
\(D(C(x)) = x\)</span>.
This means that C must be injective.</p>
<div align= "center">
<figure style="max-width: 200px; text-align: center;">
<img src="injection.png" alt="Injective Function" style="mix-blend-mode: difference;">
<figcaption>An injective function is a function, where distinct inputs map to distinct outputs.</figcaption>
</figure>
</div>
<p>Ideally, the compressed version of a bitstring would always be shorter than the input bitstring.</p>
<p>Let
<span class="gblog-katex ">
\(B^n\)</span> be the set of bitstrings of length n and
<span class="gblog-katex ">
\(B^{<n}\)</span> be the set of bitstrings of length less than n.
There are
<span class="gblog-katex ">
\(2^n\)</span> bitstrings of length n and there are
<span class="gblog-katex ">
\(2^0 + 2^1 + ... + 2^{n-1} = 2^n - 1\)</span> bitstrings of length less than n.
Since
<span class="gblog-katex ">
\(B^{<n}\)</span> has less elements than
<span class="gblog-katex ">
\(B^n\)</span>, there cannot be an injection from
<span class="gblog-katex ">
\(B^n\)</span> to
<span class="gblog-katex ">
\(B^{<n}\)</span>.</p>
<p>And because a perfect compression function would have to be an injection from
<span class="gblog-katex ">
\(B^n\)</span> to
<span class="gblog-katex ">
\(B^{<n}\)</span>, there is no perfect compression function and every lossless compression function will produce a larger output file given certain input data, to ensure that the produced output file is unique.
Otherwise we would have a lossy compression.<sup id="fnref:11"><a href="#fn:11" class="footnote-ref" role="doc-noteref">11</a></sup></p>
<p>This means that for every lossless data compression algorithm there is input data which cannot be compressed.
Therefore, a check if the compressed file is in fact smaller than input file is necessary.
Furthermore, it is always useful to know what kind of data is to be compressed, so the algorithm can be chosen based on this information.</p>
<div class="gblog-post__anchorwrap">
<h2 id="further-reading">
Further reading
<a data-clipboard-text="https://blog.parcio.de/posts/2022/05/lossless-data-compression/#further-reading" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Further reading" href="#further-reading">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>I hope this article provided a good overview of some of the most common algorithms in lossless data compression and maybe even sparked your interest in data compression.
This article did not nearly exhaust the topics covered in the sources it used.
Especially “Data Compression: The Complete Reference”<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup> is a book which explains a lot of different data compression algorithms quite thoroughly.</p>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>Sayood, K. (2006). Introduction to Data Compression. Third Edition. p. 1-5. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Run-length encoding. <a
class="gblog-markdown__link"
href="https://en.wikipedia.org/wiki/Run-length_encoding"
>https://en.wikipedia.org/wiki/Run-length_encoding</a>. Accessed on: 2021-11-02. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Salomon, D. (2007) Data Compression: The Complete Reference. Fourth Edition. p. 47-51, 74, 174-179, 189-190, 199, 209-210, 230, 242-243. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Duwe, K., Lüttgau, J., Mania, G., Squar, J., Fuchs, A., Kuhn, M., Betke, E., & Ludwig, T. (2020). State of the Art and Future Trends in Data Reduction for High-Performance Computing. Supercomputing Frontiers and Innovations, 7(1), p. 4–36. <a
class="gblog-markdown__link"
href="https://doi.org/10.14529/jsfi200101"
>https://doi.org/10.14529/jsfi200101</a> <a href="#fnref:4" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>Lu, Z.M., Guo, S.Z.: Chapter 1 - Introduction. In: Lu, Z.M., Guo, S.Z. (eds.) Lossless Information Hiding in Images, pp. 1–68. Syngress (2017), DOI: 10.1016/B978-0-12-812006-4.00001-2 <a href="#fnref:5" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Tönnies, K. (2005). Grundlagen der Bildverarbeitung. Chapter 6 - Bildkompression. p. 136. <a href="#fnref:6" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>Sahinalp, S.C., Rajpoot, N.M.: Chapter 6 - Dictionary-Based Data Compression: An Algorithmic Perspective. In: Sayood, K. (ed.) Lossless Compression Handbook, pp. 153–167. Communications, Networking and Multimedia, Academic Press, San Diego (2003),DOI: 10.1016/B978-012620861-0/50007-3 <a href="#fnref:7" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:8" role="doc-endnote">
<p>Shanmugasundaram, S. , Lourdusamy, R. (2011). A Comparative Study Of Text Compression Algorithms. ICTACT Journal on Communication Technology 1(3), p. 68–76. <a href="#fnref:8" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>Lossless Compression. <a
class="gblog-markdown__link"
href="https://en.wikipedia.org/wiki/Lossless_compression"
>https://en.wikipedia.org/wiki/Lossless_compression</a>. Accessed on 2021-11-04. <a href="#fnref:9" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:10" role="doc-endnote">
<p>Pigeonhole Principle. <a
class="gblog-markdown__link"
href="https://en.wikipedia.org/wiki/Pigeonhole_principle"
>https://en.wikipedia.org/wiki/Pigeonhole_principle</a>. Accessed on 2021-11-04. <a href="#fnref:10" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:11" role="doc-endnote">
<p>The Pigeonhole Principle. <a
class="gblog-markdown__link"
href="https://web.stanford.edu/class/archive/cs/cs103/cs103.1132/lectures/08/Small08.pdf"
>https://web.stanford.edu/class/archive/cs/cs103/cs103.1132/lectures/08/Small08.pdf</a>. Accessed on: 2021-11-04 <a href="#fnref:11" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</section>
Libfabric: A generalized way for fabric communicationhttps://blog.parcio.de/posts/2022/04/libfabric/Julian Benda2022-04-25T00:00:00+00:002022-04-25T00:00:00+00:00
<p>In this post, we will look at the challenges of efficient communication between processes and how Libfabric abstracts them.
We will see how OFI (Open Fabrics Interfaces) enables a fast and generalized communication.</p>
<style>
@media(prefers-color-scheme: dark) {
html.color-toggle-auto .light-only {
display: none;
}
}
@media(prefers-color-scheme: light) {
html.color-toggle-auto .dark-only {
display: none;
}
}
html.color-toggle-dark .light-only {
display: none;
}
html.color-toggle-light .dark-only {
display: none;
}
</style>
<div class="gblog-post__anchorwrap">
<h2 id="what-is-a-fabric-and-how-to-communicate-in-it">
What is a fabric and how to communicate in it?
<a data-clipboard-text="https://blog.parcio.de/posts/2022/04/libfabric/#what-is-a-fabric-and-how-to-communicate-in-it" class="gblog-post__anchor clip flex align-center" aria-label="Anchor What is a fabric and how to communicate in it?" href="#what-is-a-fabric-and-how-to-communicate-in-it">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>A fabric is nothing more or less than several, more or less uniform, nodes connected via links or, in other words, the typical HPC or cloud computing landscape.</p>
<p>Nodes can be linked via different physical media (e.g., copper or optical fiber) and various communication protocols.
While the physical medium is hidden behind the network cards, the communication protocol is something we still need to manage in user-space because different protocols require other interactions with the network to function.</p>
<p>To have a unified interface for the typical messaging data transfer would be nice, while not necessarily being a game changer.
But in perspective to RDMA, it differs.</p>
<div class="gblog-post__anchorwrap">
<h2 id="rdma">
RDMA
<a data-clipboard-text="https://blog.parcio.de/posts/2022/04/libfabric/#rdma" class="gblog-post__anchor clip flex align-center" aria-label="Anchor RDMA" href="#rdma">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Remote direct memory access (RDMA) sounds counter intuitive at first, because how would you access remote memory directly?
Directly in this context means without involving the operating system and CPU.
Instead, the data transfer is entirely managed by the NIC.
Therefore, we only need to signal we want to read data X from source Y to the memory segment Z, and the NIC does the rest.</p>
<p>In contrast, for normal kernel mode networking, we will copy the buffer multiple times and run it through various layers of code (e.g., socket, TCP protocol implementation, and driver).
This will cause a load on the CPU and bus, while RDMA, thanks to kernel bypass to the NIC, can offload a huge part from the network stack.</p>
<p>This opens many questions, to name a few:</p>
<ul>
<li>When is the memory transfer finished?</li>
<li>How to avoid inconsistency due to invalidated caches?</li>
<li>Is RDMA even possible with this NIC?</li>
<li>How to queue RDMA requests?</li>
</ul>
<p>The answers to these questions depend strongly on the implementation and the network protocol.
Therefore, a unified solution is quite welcome if you want the flexibility to change your link type.</p>
<p>A short reminder: RDMA still uses the same network as typical network messages, therefore the bandwidth and latency will not change much, but it will reduce the work done by the CPU, which leads to fewer interrupts and more processing time for your calculation running.</p>
<div class="gblog-post__anchorwrap">
<h2 id="libfabric-abstraction">
Libfabric abstraction
<a data-clipboard-text="https://blog.parcio.de/posts/2022/04/libfabric/#libfabric-abstraction" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Libfabric abstraction" href="#libfabric-abstraction">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Libfabric offers a unified interface to use different communication types over different communication protocols, and each time tries to minimize the overhead.</p>
<p>The supported communication types are:</p>
<ul>
<li>Message Queue: Message-based FIFO queue</li>
<li>Tagged Message Queue: Similar to Message Queue but enables operations based on a 64-bit tag attached to each message</li>
<li>RMA (remote memory access): Abstraction of RDMA to enable it also on systems that are not RDMA-capable</li>
<li>Atomic: Allow atomic operations at the network level</li>
</ul>
<p><a
class="gblog-markdown__link"
href="https://github.com/parcio/julea"
>JULEA</a> is a flexible storage framework for clusters that allows offering arbitrary I/O interfaces to applications.
It runs completely in user space, which eases development and debugging.
Because it runs on a cluster, a lot of network communication must be handled.
Until now, it used TCP (via <code>GSocket</code>).
While TCP connections normally work everywhere, the cluster may provide better fabrics, which we were unable to use.
Now, with Libfabric, we can use a huge variety of other fabrics like InfiniBand.</p>
<p>For JULEA, Message Queue and RMA are the most interesting.
Message Queue fits the communication structure currently used in JULEA.
RMA enables processing many data transfers in parallel.
With RMA, we can, for example, process a message with multiple read access and tell the link that the data have no specific order.</p>
<p>To achieve this, Libfabric uses different abstracted modules, where each of them is equipped with an optional argument to even use it only for one protocol or just let Libfabric decide what is best.</p>
<p>Each module enables us to create the next in the chain until we archive the connection we want.
The modules of interest are:</p>
<ul>
<li>Fabric information: List of available networks, which can be filtered and is sorted by performance</li>
<li>Fabric: All resources needed to use a network</li>
<li>Domain: Represents a connection in a fabric (e.g., a port or a NIC)</li>
<li>Endpoint: Communication portal to a domain</li>
<li>Event queue: Reports asynchronous meta events for an endpoint, like connection established/shutdown</li>
<li>Completion queue/counter: High-performance queue reports completed data transfers or just a counter</li>
</ul>
<p>If we want, for example, to build a connection to a server (with a known address), we can use <code>fi_getinfo</code> to request all available fabrics which are capable of connecting to the server.</p>
<p>Then we pick the first of them (because this is likely the most performant) and construct a fabric.
After this because we do not have special requirements (and have already defined our communication destination), we just create a domain at that fabric and then an endpoint with event and completion counter at that.</p>
<p>With the endpoint, we get a connect request that needs to be accepted from the server and confirmed via a <code>FI_CONNECTED</code> in the event queue.</p>
<p>Now each time the completion counter increases, we know something has happened; for simple communication, this is enough.
We can bind different counters or queues to this if we want to differ between incoming and outgoing completion.
Queues enable us also to keep track of an action based on a context we may freely choose (it is basically an ID).</p>
<p>If you want a more detailed explanation, the official introduction to the interface can be found <a
class="gblog-markdown__link"
href="https://ofiwg.github.io/libfabric/v1.13.2/man/fabric.7.html"
>here</a>.</p>
<div class="gblog-post__anchorwrap">
<h2 id="conclusion-and-first-measurements">
Conclusion and first measurements
<a data-clipboard-text="https://blog.parcio.de/posts/2022/04/libfabric/#conclusion-and-first-measurements" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Conclusion and first measurements" href="#conclusion-and-first-measurements">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Libfabric allows using different fabrics with the same interface.
This way, you can write RDMA-compatible code, and Libfabric makes it also work on a system that does not support RDMA.</p>
<p><figure class="light-only"><img src="julea-gsocket-vs-libfabric-operations.png"
alt="Comparing the performance of JULEA with GSocket using the operations per second for object creation and deletion. This shows that the performance via TCP is slightly in favor of Libfabric and that InfiniBand is multiple orders of magnitude faster than TCP, but impossible to use with GSocket."/><figcaption>
<p>Comparing the performance of JULEA with GSocket using the operations per second for object creation and deletion. This shows that the performance via TCP is slightly in favor of Libfabric and that InfiniBand is multiple orders of magnitude faster than TCP, but impossible to use with GSocket.</p>
</figcaption>
</figure>
<figure class="light-only"><img src="julea-gsocket-vs-libfabric-throughput.png"
alt="Comparing performance of JULEA with GSocket and Libfabric network code using the througput of read and write operations. Shows that performance via TCP is similar, while performance via InfiniBand with Libfabric is multiple orders of mangitude faster, while impossible to use with GSocket."/><figcaption>
<p>Comparing performance of JULEA with GSocket and Libfabric network code using the througput of read and write operations. Shows that performance via TCP is similar, while performance via InfiniBand with Libfabric is multiple orders of mangitude faster, while impossible to use with GSocket.</p>
</figcaption>
</figure>
</p>
<p><figure class="dark-only"><img src="julea-gsocket-vs-libfabric-operations-dark.png"
alt="Comparing the performance of JULEA with GSocket using the operations per second for object creation and deletion. This shows that the performance via TCP is slightly in favor of Libfabric and that InfiniBand is multiple orders of magnitude faster than TCP, but impossible to use with GSocket."/><figcaption>
<p>Comparing the performance of JULEA with GSocket using the operations per second for object creation and deletion. This shows that the performance via TCP is slightly in favor of Libfabric and that InfiniBand is multiple orders of magnitude faster than TCP, but impossible to use with GSocket.</p>
</figcaption>
</figure>
<figure class="dark-only"><img src="julea-gsocket-vs-libfabric-throughput-dark.png"
alt="Comparing performance of JULEA with GSocket and Libfabric network code using the througput of read and write operations. Shows that performance via TCP is similar, while performance via InfiniBand with Libfabric is multiple orders of mangitude faster, while impossible to use with GSocket."/><figcaption>
<p>Comparing performance of JULEA with GSocket and Libfabric network code using the througput of read and write operations. Shows that performance via TCP is similar, while performance via InfiniBand with Libfabric is multiple orders of mangitude faster, while impossible to use with GSocket.</p>
</figcaption>
</figure>
</p>
<p>We already tested it in JULEA.
We rewrote the <code>GSocket</code> network code with Libfabric.
This resulted in working InfiniBand and RDMA support.
But also without RDMA, its performance is still similar to the <code>GSocket</code> implementation.</p>
<p>Therefore, Libfabric enables to use the most efficient fabric available without having to modify the code.</p>
heimdallr: Compile time correctness checking for message passing in Rusthttps://blog.parcio.de/posts/2021/11/heimdallr/Michael Blesel2021-11-18T00:00:00+00:002021-11-18T00:00:00+00:00
<p>In this post we will look at how the Rust programming language and its built-in correctness features can be applied to the message passing parallelization method.
We will see how Rust’s memory safety features can be leveraged to design a message passing library which we call heimdallr.
It is able to detect parallelization errors at compile time that would go unnoticed by the compiler when using the prevalent message passing interface MPI.</p>
<p>For readers who are new to this topic we will start with a very brief synopsis of message passing.
In the field of high performance computing (HPC), parallel programs are executed on large computing clusters with often hundreds of computing nodes.
Running an application in parallel on more than one computing node requires different parallelization techniques than multi-threading because the computing nodes do not have shared memory.
Therefore a mechanism for sharing data between processes running on different nodes is needed.
In HPC, the standard method of achieving this is called message passing.
The applications have to explicitly send and receive the data that needs to be shared over a network.
The most commonly used library for this is called MPI which stands for Message Passing Interface.</p>
<p>At the start of an MPI application every participating process is given an ID (often called rank) that can be used to differentiate between them in the code.
MPI then provides many different send and receive functions with varying semantics such as blocking/non-blocking and synchronous/asynchronous.
Additionally collective operations such as barriers for synchronization or broadcast/gather operations are provided.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="hl"><span class="lnt"> 1
</span></span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="hl"><span class="lnt"> 4
</span></span><span class="hl"><span class="lnt"> 5
</span></span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="lnt">11
</span><span class="lnt">12
</span><span class="hl"><span class="lnt">13
</span></span><span class="lnt">14
</span><span class="lnt">15
</span><span class="hl"><span class="lnt">16
</span></span><span class="lnt">17
</span><span class="lnt">18
</span><span class="hl"><span class="lnt">19
</span></span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="hl"><span class="n">MPI_Init</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span>
<span class="kt">int</span> <span class="n">rank</span><span class="p">,</span><span class="n">size</span><span class="p">;</span>
<span class="hl"><span class="n">MPI_Comm_rank</span><span class="p">(</span><span class="n">MPI_COMM_WORLD</span><span class="p">,</span> <span class="o">&</span><span class="n">rank</span><span class="p">);</span>
</span><span class="hl"><span class="n">MPI_Comm_size</span><span class="p">(</span><span class="n">MPI_COMM_WORLD</span><span class="p">,</span> <span class="o">&</span><span class="n">size</span><span class="p">);</span>
</span>
<span class="kt">double</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">double</span><span class="p">)</span> <span class="o">*</span> <span class="n">BUF_SIZE</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">BUF_SIZE</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mf">42.0</span><span class="p">;</span>
<span class="p">}</span>
<span class="hl"> <span class="n">MPI_Send</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">BUF_SIZE</span><span class="p">,</span> <span class="n">MPI_FLOAT</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MPI_COMM_WORLD</span><span class="p">);</span>
</span><span class="p">}</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">rank</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="hl"> <span class="n">MPI_Recv</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">BUF_SIZE</span><span class="p">,</span> <span class="n">MPI_FLOAT</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MPI_COMM_WORLD</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
</span><span class="p">}</span>
<span class="hl"><span class="n">MPI_Finalize</span><span class="p">();</span>
</span></code></pre></td></tr></table>
</div>
</div><p>Here we can see a simple MPI program.
After MPI’s initialization in line 1 each process asks for the values of their own <code>rank</code> and the number of overall participating processes (here called <code>size</code>) in lines 4-5.
The goal of the program is to send a message containing the contents of the <code>buf</code> array from process 0 to process 1.
This message exchange happens in lines 13 and 16, where process 0 uses the <code>MPI_Send</code> function to send the message and process 1 receives it with the <code>MPI_Recv</code> function.</p>
<p>As we can see, the MPI functions take a lot of arguments but only the first four are important to follow this example.
First comes a pointer to the buffer that is being sent from and received into.
The next two arguments specify the number of elements that are sent and their data type, which is needed to calculate the correct number of bytes that will be sent.
Lastly, the target or source process rank for the operation is specified.
As mentioned in this example, process 0 targets process 1 with its send operation and process 1 tries to receive the data from process 0.</p>
<p>An avid reader might already have spotted that there is a problem in the code of the example.
The data type of the <code>buf</code> array is <code>double</code> but in the MPI function calls <code>MPI_FLOAT</code> is specified.
This is in fact a bug and leads to the result that not all of the array’s data is transmitted but only half of it.</p>
<p>These kinds of parallelization errors can be hard to track down in real programs because no crash will occur here but the results of the program will be wrong.
Furthermore, the C compiler and the MPI library are not able to detect this error and give the user a warning.
Programming with MPI has many such pitfalls which are often due to MPI’s low-level nature combined with the dangers of C memory management with <code>void</code> pointers.</p>
<div class="gblog-post__anchorwrap">
<h2 id="compile-time-correctness-through-rust">
Compile time correctness through Rust
<a data-clipboard-text="https://blog.parcio.de/posts/2021/11/heimdallr/#compile-time-correctness-through-rust" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Compile time correctness through Rust" href="#compile-time-correctness-through-rust">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>Rust is a modern system programming language that focuses on memory and concurrency safety with strong compile time correctness checks.
In recent times Rust has garnered more and more attention in circles where C is the current predominant language but a more safe solution is desired.
In the field of HPC, C/C++ and Fortran are by far the most used languages.
They provide great performance, have been around for a long time and there exists a lot of infrastructure in the form of libraries and tools for them.
However, these languages do come with their drawbacks which can often be found in aspects like usability, programmability and a general lack of modern features.</p>
<p>Developing massive parallel programs for HPC is a complicated task and in our opinion the languages and libraries used should provide the developers with as much help as possible.
Therefore we asked ourselves whether a language like Rust could provide an easier programming experience for message passing applications by avoiding and detecting as many errors in parallel code as possible at compile time.</p>
<p>Out of this research a Rust message passing library called <a
class="gblog-markdown__link"
href="https://github.com/parcio/heimdallr"
>heimdallr</a> was developed.
heimdallr should currently be seen as a prototype implementation but it already has good examples of correctness checks that are currently nonexistent for MPI.</p>
<div class="gblog-post__anchorwrap">
<h2 id="eliminating-type-safety-errors-with-generics">
Eliminating type safety errors with generics
<a data-clipboard-text="https://blog.parcio.de/posts/2021/11/heimdallr/#eliminating-type-safety-errors-with-generics" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Eliminating type safety errors with generics" href="#eliminating-type-safety-errors-with-generics">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>In the previously given example one might ask themselves why it is necessary for the user to manually specify the concrete data type of a buffer when this is information that a compiler should absolutely be able to derive by itself.
The type safety problems with MPI stem from the fact that the whole API works on untyped memory addresses for data buffers via the use of C’s <code>void</code> pointers to allow the MPI functions to work with any type of data.
The type information is therefore explicitly discarded and must be manually passed to a MPI function call by the user.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="hl"><span class="lnt"> 1
</span></span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="hl"><span class="lnt"> 8
</span></span><span class="lnt"> 9
</span><span class="hl"><span class="lnt">10
</span></span><span class="lnt">11
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="hl"><span class="n">let</span> <span class="n">client</span> <span class="o">=</span> <span class="n">HeimdallrClient</span><span class="o">::</span><span class="n">init</span><span class="p">(</span><span class="n">env</span><span class="o">::</span><span class="n">args</span><span class="p">()).</span><span class="n">unwrap</span><span class="p">();</span>
</span><span class="n">let</span> <span class="n">mut</span> <span class="n">buf</span> <span class="o">=</span> <span class="n">vec</span><span class="o">!</span><span class="p">[</span><span class="mf">0.0</span><span class="p">;</span><span class="n">BUF_SIZE</span><span class="p">];</span>
<span class="k">if</span> <span class="n">client</span><span class="p">.</span><span class="n">id</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
<span class="k">for</span> <span class="n">i</span> <span class="n">in</span> <span class="mf">0.</span><span class="p">.</span><span class="n">BUF_SIZE</span> <span class="p">{</span>
<span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mf">42.0</span><span class="p">;</span>
<span class="p">}</span>
<span class="hl"> <span class="n">client</span><span class="p">.</span><span class="n">send</span><span class="p">(</span><span class="o">&</span><span class="n">buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
</span><span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">client</span><span class="p">.</span><span class="n">id</span> <span class="o">==</span> <span class="mi">1</span> <span class="p">{</span>
<span class="hl"> <span class="n">buf</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">receive</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
</span><span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>Here we see an equivalent program written in Rust with our heimdallr message passing library.
First of all, it is apparent that the message passing code is less verbose when compared to its MPI counterpart.
Our design principles with heimdallr are safety and usability.
From the usability perspective we can see that some of the boilerplate code that is necessary in MPI, like for example manually asking for and storing a process’s rank variable, is not required with heimdallr.</p>
<p>More importantly, the previously discussed type safety issue for sending a data buffer does not come up with heimdallr.
We are making use of the language’s generic programming features to let the compiler handle the type deduction of a transmitted variable.
This does not only make it more safe but also easier to use for a developer.</p>
<p>Of course Rust is by far not the only modern language to provide generic programming features and this interface change to the <code>send</code> and <code>receive</code> functions could have been done in a myriad of languages.
Therefore we should go on to an example where some of Rust’s unique features allow us to provide a safer message passing interface to the users.</p>
<div class="gblog-post__anchorwrap">
<h2 id="ensuring-buffer-safety-for-non-blocking-communication">
Ensuring buffer safety for non-blocking communication
<a data-clipboard-text="https://blog.parcio.de/posts/2021/11/heimdallr/#ensuring-buffer-safety-for-non-blocking-communication" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Ensuring buffer safety for non-blocking communication" href="#ensuring-buffer-safety-for-non-blocking-communication">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>As previously mentioned, MPI provides multiple send and receive functions with varying semantics.
The most basic form of message passing is called <em>blocking</em>.
When a message passing function is called in this context the sender process is blocked until the data buffer that is being sent is guaranteed to have been processed by the message passing library.
The receiving process is also blocked until the contents of the incoming message have been safely copied into the receiving data buffer.
This form of message passing is the most intuitive from a user’s perspective but it can also be subpar from a performance perspective due to the resulting idle times for both processes.</p>
<p>A solution that is often better suited from the performance perspective is the use of so called <em>non-blocking</em> communication.
Here the process of passing the message is handled in the background and the program can continue with its execution almost immediately.
This type of message passing however does not come without dangers, as we will see in the following code snippet.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="hl"><span class="lnt">2
</span></span><span class="lnt">3
</span><span class="hl"><span class="lnt">4
</span></span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="hl"><span class="lnt">8
</span></span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="k">if</span> <span class="p">(</span><span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="hl"> <span class="n">MPI_Isend</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">BUF_SIZE</span><span class="p">,</span> <span class="n">MPI_DOUBLE</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MPI_COMM_WORLD</span><span class="p">,</span> <span class="o">&</span><span class="n">req</span><span class="p">);</span>
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">BUF_SIZE</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="hl"> <span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mf">42.0</span><span class="p">;</span>
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">rank</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="hl"> <span class="n">MPI_Recv</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">BUF_SIZE</span><span class="p">,</span> <span class="n">MPI_DOUBLE</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">MPI_COMM_WORLD</span><span class="p">,</span> <span class="o">&</span><span class="n">status</span><span class="p">);</span>
</span><span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>In this example process 0 tries to send a buffer to process 1 using MPI’s non-blocking send function <code>MPI_Isend</code>.
The non-blocking send operation in line 2 allows process 0 to continue its execution before the sending of the message has concluded.
The problem arises in lines 3-4 where process 0 also modifies the contents of the data buffer that is being sent.
Since the message passing process might still be running this may also modify the contents of the sent message and thereby cause a program error because this behavior was not intended by the programmer.</p>
<p>This is a known safety issue with the use of non-blocking communication in MPI.
A data buffer that is used in a non-blocking operation is in an <em>unsafe</em> state until it has been made sure that the message passing operation on it has concluded.
To check the status of a non-blocking operation and thereby the safety status of its data buffer, MPI provides functions like <code>MPI_Wait</code> that block the current process until the referenced message passing operation is confirmed to be finished.
The MPI standard requires such a function to be called before accessing a data buffer again that has been used in non-blocking communication.
Adding a <code>MPI_Wait</code> call between lines 2-3 of the example code would make this program work correctly.</p>
<p>The problem with all of this is that MPI requires the programmer to always remember this behavior and neither the library nor the compiler are able to detect and warn users of potential errors with buffer safety for non-blocking communication.</p>
<div class="gblog-post__anchorwrap">
<h2 id="leveraging-rusts-ownership-for-buffer-safety">
Leveraging Rust’s ownership for buffer safety
<a data-clipboard-text="https://blog.parcio.de/posts/2021/11/heimdallr/#leveraging-rusts-ownership-for-buffer-safety" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Leveraging Rust’s ownership for buffer safety" href="#leveraging-rusts-ownership-for-buffer-safety">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>The core concept of Rust’s memory management is the so called <em>ownership</em> feature.
Ownership works in a way that every data object in Rust has exactly one owner.
Once the owner variable goes out of scope the data is automatically deallocated.
There can be references to an object but only within a limited rule-set.
A variable can either have an unlimited number of immutable (read-only) references or exactly <strong>one</strong> mutable reference.
These limitations allow the Rust compiler to reason about correct memory usage.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="hl"><span class="lnt">2
</span></span><span class="hl"><span class="lnt">3
</span></span><span class="lnt">4
</span><span class="lnt">5
</span><span class="lnt">6
</span><span class="lnt">7
</span><span class="hl"><span class="lnt">8
</span></span><span class="lnt">9
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="k">if</span> <span class="n">client</span><span class="p">.</span><span class="n">id</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
<span class="hl"> <span class="n">let</span> <span class="n">handle</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">send_nb</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
</span><span class="hl"> <span class="n">buf</span> <span class="o">=</span> <span class="n">handle</span><span class="p">.</span><span class="n">data</span><span class="p">()</span><span class="o">?</span><span class="p">;</span>
</span> <span class="k">for</span> <span class="n">i</span> <span class="n">in</span> <span class="mf">0.</span><span class="p">.</span><span class="n">BUF_SIZE</span> <span class="p">{</span>
<span class="n">buf</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mf">42.0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="n">client</span><span class="p">.</span><span class="n">id</span> <span class="o">==</span> <span class="mi">1</span> <span class="p">{</span>
<span class="hl"> <span class="n">buf</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">receive</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
</span><span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>This is the heimdallr equivalent of the non-blocking MPI code that we have seen previously.
The send operation in line 2 makes use of Rust’s ownership concept to protect the data buffer that is being sent.
Since there can be only one owner of the <code>buf</code> variable, passing it directly to a function call means that the ownership is moved into the function.
This has the side effect that <code>buf</code> is no longer accessible from outside the function.
Therefore it is impossible to modify the data buffer while the message passing operation is running.
Trying to do so would lead to a compilation error.
For a user to access the data again they need to request ownership back from the message passing operation, which happens in line 3.
The <code>data</code> function called there on the <code>handle</code> that was returned by the non-blocking send function is an equivalent to <code>MPI_Wait</code>.
It blocks until the used data buffer is safe to be accessed again and then returns the ownership to the caller.</p>
<p>So in essence it is the same workflow as for an MPI application, but Rust’s ownership rules allow the library to be designed in a way where correct and safe usage of non-blocking communication can be enforced at compile time.
This is a big step up in usability and correctness because it is no longer the users task to remember the implicit rules of non-blocking communication but instead it is a detected program error if the correct procedure is not followed.</p>
<p>This is of course just one small example on how the safety features of Rust can be used to design safer interfaces but in our opinion in showcases the possibilities very well.</p>
<div class="gblog-post__anchorwrap">
<h2 id="conclusion-and-further-reading">
Conclusion and further reading
<a data-clipboard-text="https://blog.parcio.de/posts/2021/11/heimdallr/#conclusion-and-further-reading" class="gblog-post__anchor clip flex align-center" aria-label="Anchor Conclusion and further reading" href="#conclusion-and-further-reading">
<svg class="gblog-icon gblog_link"><use xlink:href="#gblog_link"></use></svg>
</a>
</h2>
</div>
<p>This blog post is supposed to give a brief overview on the challenges of message passing parallelization and how the programming interfaces used for it could be designed in a safer way.
Parallel programming is a complex topic and introduces a variety of new error classes.
Therefore we find it very important that the libraries and tools used for it offer as much help as possible to developers by enforcing correctness and detecting possible errors.</p>
<p>The heimdallr library introduced in this post is a prototype implementation of a message passing library that concentrates on the compile time correctness aspects.
It is not yet feature complete and is mainly supposed to show some of the possibilities for better usability and safety in MPI.</p>
<p>To keep this post brief, we have not gone into too much detail about the implementation and some of the open problems with this solution.
heimdallr does have some open problems which we could not go over here without making this blog way too long.
We also did not talk about the performance aspects, which is quite an important topic in the context of using it for HPC.</p>
<p>If your interest was piqued, a more detailed discussion about the pros and cons of heimdallr can be found in our <a
class="gblog-markdown__link"
href="https://doi.org/10.1007/978-3-030-90539-2_13"
>heimdallr paper</a>.
There, we also discuss some of the problems with the current implementation and show benchmark results where heimdallr’s performance is compared to MPI.</p>
<p>If you would like to try out heimdallr or have a look at the code, you can visit our <a
class="gblog-markdown__link"
href="https://github.com/parcio/heimdallr"
>GitHub</a> repository.</p>
Performance of conditional operator vs. fabshttps://blog.parcio.de/posts/2021/09/conditional-vs-fabs/Michael Kuhn2021-09-21T00:00:00+00:002021-09-21T00:00:00+00:00
<p>Today, we will take a look at potential performance problems when using the conditional operator <code>?:</code>.
Specifically, we will use it to calculate a variable’s absolute value and compare its performance with that of the function <code>fabs</code>.</p>
<p>Assume the following numerical code written in C, where we need to calculate the absolute value of a <code>double</code> variable called <code>residuum</code>.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>
Since we want to perform this operation within the inner loop, we will have to keep performance overhead as low as possible.
To reduce dependencies on math libraries and avoid function call overhead, we manually get the absolute value by first checking whether <code>residuum</code> is less than <code>0</code> and, if it is, negating it using the <code>-</code> operator.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="hl"><span class="lnt"> 7
</span></span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">k</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="hl"> <span class="n">residuum</span> <span class="o">=</span> <span class="p">(</span><span class="n">residuum</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="o">?</span> <span class="o">-</span><span class="nl">residuum</span> <span class="p">:</span> <span class="n">residuum</span><span class="p">;</span>
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>This looks easy enough and, in theory, should provide satisfactory performance.
Just to be sure, let’s do the same using the <code>fabs</code> function from the math library, which returns the absolute value of a floating-point number.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="hl"><span class="lnt"> 7
</span></span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">k</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">;</span> <span class="n">k</span><span class="o">++</span><span class="p">)</span>
<span class="p">{</span>
<span class="hl"> <span class="n">residuum</span> <span class="o">=</span> <span class="n">fabs</span><span class="p">(</span><span class="n">residuum</span><span class="p">);</span>
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></td></tr></table>
</div>
</div><p>Let’s compare the two implementations using <a
class="gblog-markdown__link"
href="https://github.com/sharkdp/hyperfine"
>hyperfine</a>.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="lnt"> 7
</span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="lnt">10
</span><span class="hl"><span class="lnt">11
</span></span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-plain" data-lang="plain">Benchmark #1: ./conditional
Time (mean ± σ): 476.3 ms ± 0.4 ms [User: 474.5 ms, System: 0.7 ms]
Range (min … max): 475.6 ms … 476.8 ms 10 runs
Benchmark #2: ./fabs
Time (mean ± σ): 243.8 ms ± 2.0 ms [User: 242.2 ms, System: 0.8 ms]
Range (min … max): 242.1 ms … 249.0 ms 12 runs
Summary
'./fabs' ran
<span class="hl"> 1.95 ± 0.02 times faster than './conditional'
</span></code></pre></td></tr></table>
</div>
</div><p>As we can see, the <code>fabs</code> implementation ran faster by more than a factor of 1.9!
Where does this massive performance difference come from?
Let’s use <code>perf stat</code> to analyze the two implementations in a bit more detail.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="hl"><span class="lnt"> 7
</span></span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="hl"><span class="lnt">10
</span></span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-plain" data-lang="plain">Performance counter stats for './conditional':
478,51 msec task-clock:u # 0,998 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
55 page-faults:u # 114,940 /sec
<span class="hl"> 2.035.211.626 cycles:u # 4,253 GHz (83,28%)
</span> 1.592.587 stalled-cycles-frontend:u # 0,08% frontend cycles idle (83,28%)
223.899 stalled-cycles-backend:u # 0,01% backend cycles idle (83,28%)
<span class="hl"> 4.009.332.175 instructions:u # 1,97 insn per cycle
</span> # 0,00 stalled cycles per insn (83,32%)
2.001.712.079 branches:u # 4,183 G/sec (83,49%)
1.503.325 branch-misses:u # 0,08% of all branches (83,34%)
0,479296441 seconds time elapsed
0,474423000 seconds user
0,001996000 seconds sys
</code></pre></td></tr></table>
</div>
</div><p>The most important metrics here are the number of instructions and the number of cycles.
Our processor can run around 4,250,000,000 cycles per second, resulting in a runtime of 0.48 seconds to process the roughly 4,000,000,000 instructions at 1.97 instructions per cycle.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt"> 1
</span><span class="lnt"> 2
</span><span class="lnt"> 3
</span><span class="lnt"> 4
</span><span class="lnt"> 5
</span><span class="lnt"> 6
</span><span class="hl"><span class="lnt"> 7
</span></span><span class="lnt"> 8
</span><span class="lnt"> 9
</span><span class="hl"><span class="lnt">10
</span></span><span class="lnt">11
</span><span class="lnt">12
</span><span class="lnt">13
</span><span class="lnt">14
</span><span class="lnt">15
</span><span class="lnt">16
</span><span class="lnt">17
</span><span class="lnt">18
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-plain" data-lang="plain">Performance counter stats for './fabs':
245,48 msec task-clock:u # 0,997 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
51 page-faults:u # 207,757 /sec
<span class="hl"> 1.039.265.407 cycles:u # 4,234 GHz (83,31%)
</span> 1.720.716 stalled-cycles-frontend:u # 0,17% frontend cycles idle (83,30%)
356.067 stalled-cycles-backend:u # 0,03% backend cycles idle (83,30%)
<span class="hl"> 3.007.112.338 instructions:u # 2,89 insn per cycle
</span> # 0,00 stalled cycles per insn (83,29%)
1.003.303.373 branches:u # 4,087 G/sec (83,46%)
1.662.984 branch-misses:u # 0,17% of all branches (83,34%)
0,246272015 seconds time elapsed
0,243024000 seconds user
0,000977000 seconds sys
</code></pre></td></tr></table>
</div>
</div><p>The reduction from 2,000,000,000 to 1,000,000,000 cycles corresponds to the performance improvement of 1.95.
Using the <code>fabs</code> function reduced the number of instructions by roughly 25% and, at the same time, increased the number of instructions per cycle to 2.89 (a factor of 1.47).
Getting rid of the conditional operator reduced the number of branches by half, allowing the processor to process more instructions per cycle.
The conditional operator is more or less a short-hand version of the <code>if</code> statement and introduced a significant number of branches into our inner loop.</p>
<p>Running three nested loops with 1,000 iterations each resulted in 1,000,000,000 inner loop iterations, that is, we saved one instruction per inner loop iteration.
These branch and instruction differences can be checked in even more detail using <code>objdump -S</code>; this is left as an exercise for the reader.</p>
<p>The magnitude of these performance differences is rather surprising and shows that it makes sense to check even seemingly simple code for potential performance problems.</p>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>The code shown is only an excerpt, the full code is available <a
class="gblog-markdown__link"
href="conditional-vs-fabs.c"
>here</a>. It was compiled with GCC 11.2 using the <code>-O2 -Wall -Wextra -Wpedantic</code> flags and the <code>-lm</code> library. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>hyperfine performs a statistical performance analysis. It runs the provided commands multiple times to reduce the influence of random errors and calculates derived metrics such as the mean and standard deviation. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</section>