Packaging for Arch Linux

In Arch, a recap I elaborated a bit on my reasons for getting involved with Arch Linux. In this post I would like to highlight a few technical details and give a "behind the scenes" when it comes to packaging on and for Arch Linux. This post is written from the viewpoint of a distribution packager, but it is likely to contain information also useful to people packaging on different distributions or for private purposes.

Arch Linux is a Linux distribution, that offers binary packages in software repositories (aka. repos). To achieve this, packages are built from source files using tooling that is developed by the distribution and various volunteers. The resulting binary packages are then provided to users on mirrors of the distribution (i.e. package files and their cryptographic signatures are provided by web servers) and are downloaded, verified, validated and installed using a package manager.

The upsides of a central software package system facilitating binary repos are

  • users do not have to build the software on their systems themselves, which e.g. for web browsers can take a very long time and eat a lot of energy

  • software for the entire system can be updated with one command and only takes as long as download and extraction of a given set of packages

Packages

When looking at the concept of binary software packages it probably helps to consider the point of view from e.g. Windows and macOS, which both provide software to users in different ways and give a good case for comparison. In case you already know how binary packages function and compare, skip this section.

For brevity I will skip the proprietary app stores in the below examples as they abstract the concept of software installation to the point where this is opaque to the user and delivers no direct comparison in the context of packages (while under the hood most app stores use the below mentioned technologies).

Windows

On Windows software is usually provided by the means of an installer (e.g. shipped as a .exe or .msi file). An installer usually needs to be downloaded from thirdparty websites (often without verification) and then executed one-by-one. The installer often already contains the (prebuilt) files to be installed (sometimes files are also downloaded by the installer application on-the-fly), offers some form of modification (e.g. the installation location), installs the bundled or downloaded files and modifies the system's registry (e.g. for auto-start or other features). Although Microsoft has attempted to consolidate its installation backends, the user experience is usually still a mixed bag. System updates (those modifying the operating system) are handled by the OS itself and the user usually has not much of a say when/ how that happens (this can be modified to some extent). Additionally, some hardware may not use the latest version of Windows due to software-based planned obsolescence.

MacOS

On macOS software can be installed using images or installers (shipped as e.g. .dmg and .pkg respectively). The download of the files in question usually functions in the same way as it does on Windows (unverified downloads from thirdparty websites). Where with images the user experience is usually "drag and drop" from a mounted image to the list of applications, installers on the other hand offer similar functionality to how installers work on Windows (e.g. setup auto-starting). System updates, similar to those on Windows are handled by macOS itself and also here software enforced planned obsolescence is a thing.

Looking at the above examples it becomes clear, that automation on both platforms is quite terrible: The distinction between OS updates and "other software" leads to a mix and match approach towards updates, that is (if at all) only partly remedied by externally developed and provided package managers for some of the "other software" (e.g. Homebrew or Chocolatey), but at best remains a fragmented experience for the user.

Linux

On distributions that offer binary package repositories, users use a package manager to install packages and to upgrade all software [1] on their system. Packages are essentially archive files, that are downloaded, verified and extracted by the package manager. As the files contained in (distribution) packages follow a well-defined location schema (e.g. filesystem hierarchy standard or file-hierarchy), the system can check for file conflicts and users can usually have reasonable assumptions about where files of a package are located (package managers usually also track the files of all packages). Additional functionality, such as post install scripts (e.g. to create users or to change ownership on files) are usually contained in package files and executed after installation. However, on systemd based distributions, much of the post installation tasks have been streamlined with the help of sysusers.d and tmpfiles.d (more on that later). Some distributions also make use of non-standardized hooks (see alpm-hooks for how this is implemented for pacman), that are used by the package manager for certain tasks on files that are not owned by one specific package (e.g. update font cache).

Build tooling

The most basic build tooling for Arch Linux - makepkg - is bundled with Pacman (the package manager used to install all software packages on the distribution). It is used in conjunction with a PKGBUILD, which as a package source file describes where/ how to get a package's source files (and in which version), how to build and test it (if applicable) and how/ where to install it.

In case you have experience with Bash: Both makepkg and PKGBUILDs are written in it.

When building packages with plain makepkg, the built package will be created in the context of the user's system and as such will make use of the software available on the user's system. While this works (given all dependencies are met) it is not recommended to do so, as the user's system may use custom packages or settings to makepkg (see makepkg.conf), that can alter the outcome of the build, which may make the package unusable on another system.

Clean chroots

To enable builds, that are done in a clean environment (i.e. one that only has official distribution packages installed and does not depend on configuration or custom packages on a local system), Arch Linux and various contributors have created special build tooling, which is contained in the devtools project.

With the help of makechrootpkg one can run makepkg in a systemd-nspawn based chroot, which will only have the packages installed, that are required for building, testing and running a given package.

Using makechrootpkg and its various repository-specific symlinks is how Arch Linux packagers build all packages in the official repositories.

An implict upside of using makechrootpkg to build packages is, that checkpkg and namcap are being run on the resulting package, which can give valuable hints at possible improvements of the package.

Building packages

First off: There are sometimes a lot of subtleties involved with packaging and especially producing packages that are of good quality. In the following sections I will discuss a few tools and packaging specifics, that may seem quite overwhelming or complicated at first. Luckily, a lot of the tooling is fairly well documented and it is probably always good to remember, that everyone is a learner and that as the tooling and the best practices evolve, this is an open-ended topic.

A good starting point is always to use makechrootpkg and to adhere to the Arch package guidelines.

Getting package build sources

The act of getting the sources for a binary package is described in the context of the Arch Build System (ABS). While users without write access to the Arch Linux source repositories can rely upon asp to get to the package build sources, the official packagers rely on a rather organically grown packaging workflow, that is described in HOWTO be a packager.

At the time of writing, Arch Linux still relies on two monolithic svn repositories for the package build sources (one for the [core] and [extra] repositories and one for the [community] and [multilib] repositories) which are exported to git via git-svn on the official Arch Linux Github organization (svntogit-packages and svntogit-community, respectively).

PKGBUILDs

As mentioned earlier, PKGBUILD files are really just Bash scripts, that are being evaluated by makepkg. As such they define a few variables and functions (some of which are required, others only being optional).

The below example shows a bare minimum example, derived from the prototype files, that can be found in /usr/share/pacman/:

# Maintainer: Your Name <youremail@domain.com>
pkgname=dummy-package
pkgver=0.1.0
pkgrel=1
pkgdesc="A dummy package"
arch=(any)
url="https://my-upstream.link/to/dummy-package"
license=(GPL3)
depends=(another-package)
optdepends=('some-additional: for additional feature X')
source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz)
b2sums=('THISISADUMMYCHECKSUM')

package() {
  make DESTDIR="$pkgdir" install -C $pkgname-$pkgver
}

To go through the essentials of this very minimalistic example, which assumes that we have a project using make to install a few files:

  • While the Maintainer comment is technically not required, it is always helpful for others trying to contact the author of a given package build source

  • pkgname: The name of the package. Refer to the wiki section PKGBUILD#pkgname for further info (e.g. restrictions)

  • pkgver: The (upstream) version of the package. Refer to the wiki section PKGBUILD#pkgver for further info (e.g. restrictions)

  • pkgrel: The release version of the package, which identifies the build of the particular package in version pkgver. This is a string specific to Arch Linux (see PKGBUILD#pkgrel) and is not related to the upstream version of the software.

  • pkgdesc: A short description of what this package provides

  • arch: The architecture of the resulting package. As this is an array, it can contain several entries (makepkg will envoke a build for each architecture). At the time of writing Arch Linux only supports the x86_64 and any architectures.

  • url: The URL of the upstream project (e.g. a website or a link to the version control sources)

  • license: The licenses that apply to the project. This again is an array and may contain several licenses. In case licenses that are not covered by the licenses package are encountered, their license files must be installed in the package() function (refer to the wiki section PKGBUILD#licenses for further information).

  • depends: An array of runtime dependencies for the package. They will be installed automatically during build when building with makechrootpkg or makepkg -s.

  • optdepends: An array of optional dependencies and a short description about their purpose. These packages will not be installed during build time (for this makedepends needs to be used).

  • source: An array of resources for makepkg to retrieve. As makepkg is able to handle various version control systems, local and remote files, as well as to rename files, it is advisable to read the relevant man page section for makepkg.

  • b2sums: An array of checksums for all resources in the source array. It is advisable to use either (or all of) sh256sums, sha512sums or b2sums as older hashing mechanisms are by now unsafe (see PKGBUILD#Integrity). The checksums are used to guard against changing (and potentially malicious) upstream resources. The resources and checksums for a new version of a given package may be retrieved and updated using updpkgsums (contained in the pacman-contrib package).

  • package(): This function defines all steps necessary to install the files of the upstream project to an empty location (represented by the magic variable "$pkgdir"), that will contain all installable files of the package. This function is called using fakeroot, which means that to the installing processes it looks like they are being executed by root.

PGP validation

Upstream project resources (e.g. signed source tarballs or git tags/ commits) can be validated using PGP.

Technically all that is required for this is, that the validpgpkeys array in the PKGBUILD contains at least one retrievable PGP key ID and that the source array contains either a .sig or .asc file valid for one of the resources, or that a git object to be checked is targetted using the ?signed identifier (see makepkg#signature_checking and PKGBUILD#USING_VCS_SOURCES).

Although it is advisable to have cryptographic signature validation (e.g. using PGP) for releases, this should only be considered under the following circumstances in regards to an upstream project:

  • there is a track record of signing releases with the same key ID and the project specifically provides the expectable key ID publicly (e.g. on the website)

  • keeps a chain of trust between multiple and/or successive key IDs

  • no key easily used by multiple users is used (e.g. Github's PGP key, which can be used by multiple users of a given Github project and is not handled by the users themselves)

The first point is usually easy to check up on, while the 2nd might require getting in touch with the project developers if it happens (or happened in the past) - this is the case more often than you would think and does block package updates, as a new key ID must not be trusted without investigating the cause for a missing chain of trust to prevent a potential supply chain attack!

The 3rd point practically provides a false sense of security: A PGP key signed a release of a project, but in actuality multiple members of a project may have access to this functionality. From the outside it is impossible to tell who triggered a release and signed off on it (it could easily be malicious because someone's Github account has been hacked).

Reproducibility

Arch Linux as a distribution is committed to packages becoming bit-for-bit reproducible (have a look at the overarching Reproducible Builds project for more background information on the general topic). The status of the current packages in the official repositories is tracked on https://reproducible.archlinux.org, which is backed by rebuilderd.

After building a package it can be rebuilt using makerepropkg, which may use diffoscope on the resulting package in case it is not reproducible.

As the use of makerepropkg requires the PKGBUILD used to build the initial package, it can not be used when only a package file is available. However, for that use-case repro may be used.

Dealing with the strange

In building packages we have looked at some of the more basic use-cases. The following sub-sections will deal with more uncommon or very specific ones as well as problems at the intersection of build tooling and binary repository management.

Split packages

There are situations, in which one wants to build several packages from a single PKGBUILD. Those are usually:

  • the documentation of the project is very large

  • certain features (e.g. language bindings) are not required by the main application or use-case of the project

  • specific functionality would require a large tree of dependencies but is not required for the main application or use-case

In all three cases this can be handled using a split package setup in which the extra functionality (as a package) is declared an optional dependency of the main package.

To create a split package, the pkgname variable of the PKGBUILD is turned into an array, containing multiple package names, while the pkgbase variable (see PKGBUILD#pkgbase) should be set. Additionally, the generic package() function needs to be split up into specific functions for each package (prepare(), build() and check() are shared).

Using the example from PKGBUILDs, this is how it would look like when e.g. splitting out documentation (assuming that the upstream project provides separate install targets for the components).

# Maintainer: Your Name <youremail@domain.com>
pkgbase=dummy-package
pkgname=(dummy-package dummy-package-docs)
pkgver=0.1.0
pkgrel=1
pkgdesc="A dummy package"
arch=(any)
url="https://my-upstream.link/to/dummy-package"
license=(GPL3)
makedepends=(another-package)
source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz)
b2sums=('THISISADUMMYCHECKSUM')

package_dummy-package() {
  depends=(another-package)
  optdepends=(
    'dummy-package-docs: for documentation'
    'some-additional: for additional feature X'
  )

  make DESTDIR="$pkgdir" install-scripts -C $pkgname-$pkgver
}

package_dummy-package-docs() {
  make DESTDIR="$pkgdir" install-docs -C $pkgname-$pkgver
}

Binary repository management

The resulting packages of a build process can be installed on a local machine, but are often of course more useful, if more systems can install them as well. For this purpose the repository sync databases exist, which pacman uses (see libalpm_databases) to retrieve the difference between a remote package repository and a local machine's state and to figure out which packages to upgrade.

The most rudimentary actions (adding and removing packages, optionally signing a database) on a binary repository can be done using repo-add and repo-remove, which are shipped with pacman. As the tooling is very basic, it does not offer any form of state tracking (i.e. a log of actions, such as additions or removals done to a sync database by a specific user).

At the time of writing Arch Linux packagers make use of dbscripts for the binary repository management, which also does (a form of) state tracking by interacting with and using the the two svn-based monorepos for package build sources for this purpose. The tooling consists of a set of shell scripts (making use of repo-add and repo-remove internally), that are being called by authorized users on a specific host over ssh. The user authentification is therefore done using ssh while the user authorization is implemented using plain unix groups (different sets of packagers have access to [core] and [extra] vs. [community] and [multilib] - often only for historical reasons). However, this setup is showing its age and comes with its own set of pitfalls:

  • changes to repositories are not externally auditable

  • package data is only checked rudimentarily

  • integrity of repository sync databases can not be guaranteed

  • repository sync databases can not be rebuilt to a specific state

  • setting the target binary repository for a package is a manual operation

  • due to the blocking nature of dbscripts, it is possible to brick the state of a repository if e.g. connection to the host running dbscripts is lost during the move of packages between two repositories

  • it is not possible to setup rebuild-specific staging repositories on the fly

  • many users need ssh access to a machine

This all being said, work is underway with arch-repo-management to provide a more manageable and easy to configure solution that runs as a service and does not rely on multiple users to have direct access on a target system. One of the project's main focusses is to be able to verify incoming package data and to fully decouple the state from the repository sync databases (to be able to rebuild them whenever needed). Going forward it should become more easy to setup ephemeral staging repositories to build against and safer to move data due to more atomic repository operations, while allowing externals to audit each repository's history. Currently the project is still far from being usable though and there are quite a few things left to be implemented. Switching from the current setup in which both package build sources and binary repository state are handled by one version control system, to one where these concerns are separated is a hard problem, especially when one wants to get this right. I hope that going forward we will end up with a solution that can be easily contributed to and reused also outside of Arch Linux. I will write another post in the coming months, that highlights work and concepts of arch-repo-management.

Distributing trust

Packagers use the makepkg.conf variables PACKAGER and GPGKEY to set the packager user ID (i.e. name and e-mail address) and the PGP key ID used for signing created packages.

Other users that wish to use packages signed by someone else need to import that other user's PGP public key using pacman-key.

Arch Linux maintains a web of trust between a set of main signing keys and all packagers and between all packagers amongst themselves (see the main keys page for an extensive overview). This setup allows for user systems to evaluate whether a given package signature done by a packager is considered trusted (see pacman.conf#PACKAGE_AND_DATABASE_SIGNATURE_CHECKING for further info). These constructs are system-wide PGP keyrings for the use with pacman and can be handled with pacman-key.

In the archlinux-keyring project the distribution trust of Arch Linux is maintained as a set of decomposed PGP public keys and the signatures on them. The custom tooling keyringctl (which uses sequoia's sq under the hood) is used to maintain (e.g. import public keys and signatures) a PGP keyring that is packaged in the archlinux-keyring package and which is automatically added and updated upon install.

More than or equal to three main signing key holders are required to uphold the web of trust. More than or equal to three valid main key signatures are required for a packager key (if it is itself still valid) to be allowed for distributing packages in the official Arch Linux repositories.

Sonames

Linux distributions mostly build C and C++ libraries and executables using dynamic linking. This implies, that shared libraries usually provide a soname (e.g. libexample.so.1), which is in turn used (i.e. linked against) by one or more other libraries or executables. If the application binary interface (ABI) of the library in question changes, its soname should be increased as well (e.g. libexample.so.2). If a package with an updated soname is released and installed, without rebuilding any of the packages depending on it, those will fail to load (the now non-existent) libexample.so.1 shared object.

A common task as a packager is therefore to do rebuilds for libraries and executables when a soname change is introduced. Depending on the library introducing the soname change or the library/executable being affected by it, this is sometimes a bit of a painful and time consuming experience. While it is not unheard of that projects either forget to introduce a soname change (silently breaking consumers) or accidentally downgrade their soname, consumers are more likely to run into trouble because of not yet implementing changes introduced by the ABI change (requiring patches not yet included in a stable release).

To safeguard against cases in which soname changes went unnoticed and packages are pushed to the repositories, it is possible to make use of makepkg's builtin dependency resolution. Extending upon the example in PKGBUILDs and assuming that libexample is the package providing libexample.so:

# Maintainer: Your Name <youremail@domain.com>
pkgname=libexample
pkgver=1.0.0
pkgrel=1
pkgdesc="A dummy library"
arch=(any)
url="https://my-upstream.link/to/libexample"
license=(GPL3)
depends=(glibc)
provides=(libexample.so)
source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz)
b2sums=('THISISADUMMYCHECKSUM')

build() {
  make -C $pkgname-$pkgver
}

package() {
  make DESTDIR="$pkgdir" install -C $pkgname-$pkgver
}
# Maintainer: Your Name <youremail@domain.com>
pkgname=dummy-package
pkgver=0.1.0
pkgrel=1
pkgdesc="A dummy package"
arch=(any)
url="https://my-upstream.link/to/dummy-package"
license=(GPL3)
depends=(another-package libexample libexample.so)
optdepends=('some-additional: for additional feature X')
source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz)
b2sums=('THISISADUMMYCHECKSUM')

build() {
  make -C $pkgname-$pkgver
}

package() {
  make DESTDIR="$pkgdir" install -C $pkgname-$pkgver
}

If during build time libexample provided libexample.so.1, the resulting dummy-package will now depend on libexample and libexample.so=1-64, which libexample provides.

If the libexample package is then updated while accidentally including a soname bump to libexample.so.2, pacman will prevent this package from being upgraded on a user's system, because it can no longer provide libexample.so.1, which is required by its consumers (i.e. dummy-package). This only helps against immediate breakage on already installed systems. On systems that are about to be installed it would lead to pacman not being able to resolve the dependencies and bailing out. It is therefore to be considered a stop-gap solution which allows for fixing the package(s) in question, while not immediately breaking consumers of libexample.

In the future this feature will be directly built into makepkg, removing the manual process of identifying shared libraries (and their sonames) which are provided by packages.

Debug packages

The ability to debug software using e.g. gdb is very powerful, as it allows users to provide vital information about failing software to the packagers and upstream projects. For this to work, a package's debug symbols need to be provided to the debugger. In February 2022 Arch Linux has started using debug packages and debuginfod, which allows just that.

Creating a package and additionally also building its debug symbols has now become as easy as adding debug to the options array in a PKGBUILD (until this option eventually is added to the default for packagers of the distribution).

Creating users

Historically, system users and groups for packages have been created using .install scripts (see PKGBUILD#install). This had the downside of requiring a specific user identifier (UID) and/or group identifier (GID) (see UID/ GID database for specific assignments) if file ownerships also needed to be handled in the context of a package. Additionally, the user and group creation was not standardized and required a script (run as root), which was only run after creating and installing the package and therefore not easily testable.

With the adoption of systemd and specifically sysusers.d the workflow has changed to installing a single file in the context of a package to the vendor location /usr/lib/sysusers.d/. Based on the systemd alpm-hooks setup the configuration is applied using systemd-sysusers.

Changing files after package installation

Similar to how system users and groups have been created in the past, file modifications (e.g. ownership, extended file attributes or setuid) have been done using .install scripts or directly in PKGBUILDs. The problem with this approach was, that it required the specific assignment and pinning of UIDs and GIDs when creating the required users and groups, before doing the file modifications (e.g. using chown).

This task has been made less complex with tmpfiles.d, which allows for packaging a single file in a package to the vendor location /usr/lib/tmpfiles.d/. Due to the ordering of alpm-hooks first users and groups are created and only afterwards the file configuration is applied using systemd-tmpfiles, allowing for diverse scenarios.

Packaging

Working on packages for software written in different languages (e.g. PHP, Python, Ruby, C, C++ or Rust) using various build systems surely makes for a very interesting Github profile eventually (due to providing issue reports and fixes to many projects).

You can find the list of my packages amongst the official repositories. Moreover I currently maintain two unofficial user repositories: [realtime] and [pro-audio-legacy]

Packaging can be a fun but also a very time consuming and frustrating pastime. As such there are many many more examples and specifics I could list (but this article is already quite dense I am afraid).

At any rate, I hope I could spark your curiosity! If you are interested in finding out more about packaging for specific languages or best practices, the following are some good starting points: