Packaging for Arch Linux
In Arch, a recap I elaborated a bit on my reasons for getting involved with Arch Linux. In this post I would like to highlight a few technical details and give a "behind the scenes" when it comes to packaging on and for Arch Linux. This post is written from the viewpoint of a distribution packager, but it is likely to contain information also useful to people packaging on different distributions or for private purposes.
Arch Linux is a Linux distribution, that offers binary packages in software repositories (aka. repos). To achieve this, packages are built from source files using tooling that is developed by the distribution and various volunteers. The resulting binary packages are then provided to users on mirrors of the distribution (i.e. package files and their cryptographic signatures are provided by web servers) and are downloaded, verified, validated and installed using a package manager.
The upsides of a central software package system facilitating binary repos are
users do not have to build the software on their systems themselves, which e.g. for web browsers can take a very long time and eat a lot of energy
software for the entire system can be updated with one command and only takes as long as download and extraction of a given set of packages
Packages
When looking at the concept of binary software packages it probably helps to consider the point of view from e.g. Windows and macOS, which both provide software to users in different ways and give a good case for comparison. In case you already know how binary packages function and compare, skip this section.
For brevity I will skip the proprietary app stores in the below examples as they abstract the concept of software installation to the point where this is opaque to the user and delivers no direct comparison in the context of packages (while under the hood most app stores use the below mentioned technologies).
Windows
On Windows software is usually provided by the means of an installer (e.g.
shipped as a .exe
or .msi
file). An installer usually needs to be
downloaded from thirdparty websites (often without verification) and then
executed one-by-one. The installer often already contains the (prebuilt) files
to be installed (sometimes files are also downloaded by the installer
application on-the-fly), offers some form of modification (e.g. the
installation location), installs the bundled or downloaded files and modifies
the system's registry (e.g. for auto-start or other features). Although
Microsoft has attempted to consolidate its installation backends, the user
experience is usually still a mixed bag.
System updates (those modifying the operating system) are handled by the OS
itself and the user usually has not much of a say when/ how that happens (this
can be modified to some extent). Additionally, some hardware may not use the
latest version of Windows due to software-based planned obsolescence.
MacOS
On macOS software can be installed using images or installers (shipped as e.g.
.dmg
and .pkg
respectively). The download of the files in question
usually functions in the same way as it does on Windows (unverified downloads
from thirdparty websites). Where with images the user experience is usually
"drag and drop" from a mounted image to the list of applications, installers
on the other hand offer similar functionality to how installers work on Windows
(e.g. setup auto-starting).
System updates, similar to those on Windows are handled by macOS itself and
also here software enforced planned obsolescence is a thing.
Looking at the above examples it becomes clear, that automation on both platforms is quite terrible: The distinction between OS updates and "other software" leads to a mix and match approach towards updates, that is (if at all) only partly remedied by externally developed and provided package managers for some of the "other software" (e.g. Homebrew or Chocolatey), but at best remains a fragmented experience for the user.
Linux
On distributions that offer binary package repositories, users use a package manager to install packages and to upgrade all software [1] on their system. Packages are essentially archive files, that are downloaded, verified and extracted by the package manager. As the files contained in (distribution) packages follow a well-defined location schema (e.g. filesystem hierarchy standard or file-hierarchy), the system can check for file conflicts and users can usually have reasonable assumptions about where files of a package are located (package managers usually also track the files of all packages). Additional functionality, such as post install scripts (e.g. to create users or to change ownership on files) are usually contained in package files and executed after installation. However, on systemd based distributions, much of the post installation tasks have been streamlined with the help of sysusers.d and tmpfiles.d (more on that later). Some distributions also make use of non-standardized hooks (see alpm-hooks for how this is implemented for pacman), that are used by the package manager for certain tasks on files that are not owned by one specific package (e.g. update font cache).
Build tooling
The most basic build tooling for Arch Linux - makepkg - is bundled with Pacman (the package manager used to install all software packages on the distribution). It is used in conjunction with a PKGBUILD, which as a package source file describes where/ how to get a package's source files (and in which version), how to build and test it (if applicable) and how/ where to install it.
In case you have experience with Bash: Both makepkg
and PKGBUILDs
are
written in it.
When building packages with plain makepkg
, the built package will be
created in the context of the user's system and as such will make use of the
software available on the user's system. While this works (given all
dependencies are met) it is not recommended to do so, as the user's
system may use custom packages or settings to makepkg
(see makepkg.conf),
that can alter the outcome of the build, which may make the package unusable on
another system.
Clean chroots
To enable builds, that are done in a clean environment (i.e. one that only has official distribution packages installed and does not depend on configuration or custom packages on a local system), Arch Linux and various contributors have created special build tooling, which is contained in the devtools project.
With the help of makechrootpkg one can run makepkg
in a systemd-nspawn
based chroot, which will only have the packages installed, that are required
for building, testing and running a given package.
Using makechrootpkg and its various repository-specific symlinks is how Arch Linux packagers build all packages in the official repositories.
An implict upside of using makechrootpkg
to build packages is, that
checkpkg and namcap are being run on the resulting package, which can give
valuable hints at possible improvements of the package.
Building packages
First off: There are sometimes a lot of subtleties involved with packaging and especially producing packages that are of good quality. In the following sections I will discuss a few tools and packaging specifics, that may seem quite overwhelming or complicated at first. Luckily, a lot of the tooling is fairly well documented and it is probably always good to remember, that everyone is a learner and that as the tooling and the best practices evolve, this is an open-ended topic.
A good starting point is always to use makechrootpkg
and to adhere to the
Arch package guidelines.
Getting package build sources
The act of getting the sources for a binary package is described in the context of the Arch Build System (ABS). While users without write access to the Arch Linux source repositories can rely upon asp to get to the package build sources, the official packagers rely on a rather organically grown packaging workflow, that is described in HOWTO be a packager.
At the time of writing, Arch Linux still relies on two monolithic svn
repositories for the package build sources (one for the [core]
and
[extra]
repositories and one for the [community]
and [multilib]
repositories) which are exported to git via git-svn on the official Arch
Linux Github organization (svntogit-packages and svntogit-community,
respectively).
PKGBUILDs
As mentioned earlier, PKGBUILD files are really just Bash scripts, that are being evaluated by makepkg. As such they define a few variables and functions (some of which are required, others only being optional).
The below example shows a bare minimum example, derived from the prototype
files, that can be found in /usr/share/pacman/
:
# Maintainer: Your Name <youremail@domain.com> pkgname=dummy-package pkgver=0.1.0 pkgrel=1 pkgdesc="A dummy package" arch=(any) url="https://my-upstream.link/to/dummy-package" license=(GPL3) depends=(another-package) optdepends=('some-additional: for additional feature X') source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz) b2sums=('THISISADUMMYCHECKSUM') package() { make DESTDIR="$pkgdir" install -C $pkgname-$pkgver }
To go through the essentials of this very minimalistic example, which assumes that we have a project using make to install a few files:
While the
Maintainer
comment is technically not required, it is always helpful for others trying to contact the author of a given package build sourcepkgname
: The name of the package. Refer to the wiki section PKGBUILD#pkgname for further info (e.g. restrictions)pkgver
: The (upstream) version of the package. Refer to the wiki section PKGBUILD#pkgver for further info (e.g. restrictions)pkgrel
: The release version of the package, which identifies the build of the particular package in versionpkgver
. This is a string specific to Arch Linux (see PKGBUILD#pkgrel) and is not related to the upstream version of the software.pkgdesc
: A short description of what this package providesarch
: The architecture of the resulting package. As this is an array, it can contain several entries (makepkg
will envoke a build for each architecture). At the time of writing Arch Linux only supports thex86_64
andany
architectures.url
: The URL of the upstream project (e.g. a website or a link to the version control sources)license
: The licenses that apply to the project. This again is an array and may contain several licenses. In case licenses that are not covered by the licenses package are encountered, their license files must be installed in thepackage()
function (refer to the wiki section PKGBUILD#licenses for further information).depends
: An array of runtime dependencies for the package. They will be installed automatically during build when building withmakechrootpkg
ormakepkg -s
.optdepends
: An array of optional dependencies and a short description about their purpose. These packages will not be installed during build time (for thismakedepends
needs to be used).source
: An array of resources formakepkg
to retrieve. As makepkg is able to handle various version control systems, local and remote files, as well as to rename files, it is advisable to read the relevant man page section formakepkg
.b2sums
: An array of checksums for all resources in thesource
array. It is advisable to use either (or all of)sh256sums
,sha512sums
orb2sums
as older hashing mechanisms are by now unsafe (see PKGBUILD#Integrity). The checksums are used to guard against changing (and potentially malicious) upstream resources. The resources and checksums for a new version of a given package may be retrieved and updated using updpkgsums (contained in the pacman-contrib package).package()
: This function defines all steps necessary to install the files of the upstream project to an empty location (represented by the magic variable"$pkgdir"
), that will contain all installable files of the package. This function is called using fakeroot, which means that to the installing processes it looks like they are being executed byroot
.
PGP validation
Upstream project resources (e.g. signed source tarballs or git tags/ commits) can be validated using PGP.
Technically all that is required for this is,
that the validpgpkeys
array in the PKGBUILD contains at least one
retrievable PGP key ID and that the source
array contains either a .sig
or .asc
file valid for one of the resources, or that a git object to be
checked is targetted using the ?signed
identifier (see
makepkg#signature_checking and PKGBUILD#USING_VCS_SOURCES).
Although it is advisable to have cryptographic signature validation (e.g. using PGP) for releases, this should only be considered under the following circumstances in regards to an upstream project:
there is a track record of signing releases with the same key ID and the project specifically provides the expectable key ID publicly (e.g. on the website)
keeps a chain of trust between multiple and/or successive key IDs
no key easily used by multiple users is used (e.g. Github's PGP key, which can be used by multiple users of a given Github project and is not handled by the users themselves)
The first point is usually easy to check up on, while the 2nd might require getting in touch with the project developers if it happens (or happened in the past) - this is the case more often than you would think and does block package updates, as a new key ID must not be trusted without investigating the cause for a missing chain of trust to prevent a potential supply chain attack!
The 3rd point practically provides a false sense of security: A PGP key signed a release of a project, but in actuality multiple members of a project may have access to this functionality. From the outside it is impossible to tell who triggered a release and signed off on it (it could easily be malicious because someone's Github account has been hacked).
Reproducibility
Arch Linux as a distribution is committed to packages becoming bit-for-bit reproducible (have a look at the overarching Reproducible Builds project for more background information on the general topic). The status of the current packages in the official repositories is tracked on https://reproducible.archlinux.org, which is backed by rebuilderd.
After building a package it can be rebuilt using makerepropkg, which may use diffoscope on the resulting package in case it is not reproducible.
As the use of makerepropkg
requires the PKGBUILD
used to build the
initial package, it can not be used when only a package file is available.
However, for that use-case repro may be used.
Dealing with the strange
In building packages we have looked at some of the more basic use-cases. The following sub-sections will deal with more uncommon or very specific ones as well as problems at the intersection of build tooling and binary repository management.
Split packages
There are situations, in which one wants to build several packages from a
single PKGBUILD
. Those are usually:
the documentation of the project is very large
certain features (e.g. language bindings) are not required by the main application or use-case of the project
specific functionality would require a large tree of dependencies but is not required for the main application or use-case
In all three cases this can be handled using a split package setup in which the extra functionality (as a package) is declared an optional dependency of the main package.
To create a split package, the pkgname
variable of the PKGBUILD
is
turned into an array, containing multiple package names, while the pkgbase
variable (see PKGBUILD#pkgbase) should be set. Additionally, the generic
package()
function needs to be split up into specific functions for each
package (prepare()
, build()
and check()
are shared).
Using the example from PKGBUILDs, this is how it would look like when e.g. splitting out documentation (assuming that the upstream project provides separate install targets for the components).
# Maintainer: Your Name <youremail@domain.com> pkgbase=dummy-package pkgname=(dummy-package dummy-package-docs) pkgver=0.1.0 pkgrel=1 pkgdesc="A dummy package" arch=(any) url="https://my-upstream.link/to/dummy-package" license=(GPL3) makedepends=(another-package) source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz) b2sums=('THISISADUMMYCHECKSUM') package_dummy-package() { depends=(another-package) optdepends=( 'dummy-package-docs: for documentation' 'some-additional: for additional feature X' ) make DESTDIR="$pkgdir" install-scripts -C $pkgname-$pkgver } package_dummy-package-docs() { make DESTDIR="$pkgdir" install-docs -C $pkgname-$pkgver }
Binary repository management
The resulting packages of a build process can be installed on a local machine, but are often of course more useful, if more systems can install them as well. For this purpose the repository sync databases exist, which pacman uses (see libalpm_databases) to retrieve the difference between a remote package repository and a local machine's state and to figure out which packages to upgrade.
The most rudimentary actions (adding and removing packages, optionally signing a database) on a binary repository can be done using repo-add and repo-remove, which are shipped with pacman. As the tooling is very basic, it does not offer any form of state tracking (i.e. a log of actions, such as additions or removals done to a sync database by a specific user).
At the time of writing Arch Linux packagers make use of dbscripts for the
binary repository management, which also does (a form of) state tracking by
interacting with and using the the two svn-based monorepos for package build
sources for this purpose.
The tooling consists of a set of shell scripts (making use of repo-add and
repo-remove internally), that are being called by authorized users on a
specific host over ssh. The user authentification is therefore done using
ssh while the user authorization is implemented using plain unix groups
(different sets of packagers have access to [core]
and [extra]
vs.
[community]
and [multilib]
- often only for historical reasons).
However, this setup is showing its age and comes with its own set of pitfalls:
changes to repositories are not externally auditable
package data is only checked rudimentarily
integrity of repository sync databases can not be guaranteed
repository sync databases can not be rebuilt to a specific state
setting the target binary repository for a package is a manual operation
due to the blocking nature of dbscripts, it is possible to brick the state of a repository if e.g. connection to the host running dbscripts is lost during the move of packages between two repositories
it is not possible to setup rebuild-specific staging repositories on the fly
many users need ssh access to a machine
This all being said, work is underway with arch-repo-management to provide a more manageable and easy to configure solution that runs as a service and does not rely on multiple users to have direct access on a target system. One of the project's main focusses is to be able to verify incoming package data and to fully decouple the state from the repository sync databases (to be able to rebuild them whenever needed). Going forward it should become more easy to setup ephemeral staging repositories to build against and safer to move data due to more atomic repository operations, while allowing externals to audit each repository's history. Currently the project is still far from being usable though and there are quite a few things left to be implemented. Switching from the current setup in which both package build sources and binary repository state are handled by one version control system, to one where these concerns are separated is a hard problem, especially when one wants to get this right. I hope that going forward we will end up with a solution that can be easily contributed to and reused also outside of Arch Linux. I will write another post in the coming months, that highlights work and concepts of arch-repo-management.
Distributing trust
Packagers use the makepkg.conf variables PACKAGER
and GPGKEY
to set
the packager user ID (i.e. name and e-mail address) and the PGP key ID used for
signing created packages.
Other users that wish to use packages signed by someone else need to import that other user's PGP public key using pacman-key.
Arch Linux maintains a web of trust between a set of main signing keys and all packagers and between all packagers amongst themselves (see the main keys page for an extensive overview). This setup allows for user systems to evaluate whether a given package signature done by a packager is considered trusted (see pacman.conf#PACKAGE_AND_DATABASE_SIGNATURE_CHECKING for further info). These constructs are system-wide PGP keyrings for the use with pacman and can be handled with pacman-key.
In the archlinux-keyring project the distribution trust of Arch Linux is
maintained as a set of decomposed PGP public keys and the signatures on them.
The custom tooling keyringctl
(which uses sequoia's sq
under the
hood) is used to maintain (e.g. import public keys and signatures) a PGP
keyring that is packaged in the archlinux-keyring package and which is
automatically added and updated upon install.
More than or equal to three main signing key holders are required to uphold the web of trust. More than or equal to three valid main key signatures are required for a packager key (if it is itself still valid) to be allowed for distributing packages in the official Arch Linux repositories.
Sonames
Linux distributions mostly build C and C++ libraries and executables using
dynamic linking. This implies, that shared libraries usually provide a
soname (e.g. libexample.so.1
), which is in turn used (i.e. linked
against) by one or more other libraries or executables.
If the application binary interface (ABI) of the library in question changes,
its soname should be increased as well (e.g. libexample.so.2
). If a
package with an updated soname is released and installed, without rebuilding
any of the packages depending on it, those will fail to load (the now
non-existent) libexample.so.1
shared object.
A common task as a packager is therefore to do rebuilds for libraries and executables when a soname change is introduced. Depending on the library introducing the soname change or the library/executable being affected by it, this is sometimes a bit of a painful and time consuming experience. While it is not unheard of that projects either forget to introduce a soname change (silently breaking consumers) or accidentally downgrade their soname, consumers are more likely to run into trouble because of not yet implementing changes introduced by the ABI change (requiring patches not yet included in a stable release).
To safeguard against cases in which soname changes went unnoticed and packages
are pushed to the repositories, it is possible to make use of makepkg's
builtin dependency resolution. Extending upon the example in PKGBUILDs and assuming that libexample
is the package providing
libexample.so
:
# Maintainer: Your Name <youremail@domain.com> pkgname=libexample pkgver=1.0.0 pkgrel=1 pkgdesc="A dummy library" arch=(any) url="https://my-upstream.link/to/libexample" license=(GPL3) depends=(glibc) provides=(libexample.so) source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz) b2sums=('THISISADUMMYCHECKSUM') build() { make -C $pkgname-$pkgver } package() { make DESTDIR="$pkgdir" install -C $pkgname-$pkgver }
# Maintainer: Your Name <youremail@domain.com> pkgname=dummy-package pkgver=0.1.0 pkgrel=1 pkgdesc="A dummy package" arch=(any) url="https://my-upstream.link/to/dummy-package" license=(GPL3) depends=(another-package libexample libexample.so) optdepends=('some-additional: for additional feature X') source=(https://my-upstream.link/to/$pkgname-$pkgver.tar.gz) b2sums=('THISISADUMMYCHECKSUM') build() { make -C $pkgname-$pkgver } package() { make DESTDIR="$pkgdir" install -C $pkgname-$pkgver }
If during build time libexample
provided libexample.so.1
, the resulting
dummy-package
will now depend on libexample
and libexample.so=1-64
,
which libexample
provides.
If the libexample
package is then updated while accidentally including a
soname bump to libexample.so.2
, pacman will prevent this package from
being upgraded on a user's system, because it can no longer provide
libexample.so.1
, which is required by its consumers (i.e.
dummy-package
).
This only helps against immediate breakage on already installed systems. On
systems that are about to be installed it would lead to pacman not being able
to resolve the dependencies and bailing out. It is therefore to be considered a
stop-gap solution which allows for fixing the package(s) in question, while not
immediately breaking consumers of libexample
.
In the future this feature will be directly built into makepkg, removing the manual process of identifying shared libraries (and their sonames) which are provided by packages.
Debug packages
The ability to debug software using e.g. gdb is very powerful, as it allows users to provide vital information about failing software to the packagers and upstream projects. For this to work, a package's debug symbols need to be provided to the debugger. In February 2022 Arch Linux has started using debug packages and debuginfod, which allows just that.
Creating a package and additionally also building its debug symbols has now
become as easy as adding debug
to the options
array in a PKGBUILD
(until this option eventually is added to the default for packagers of the
distribution).
Creating users
Historically, system users and groups for packages have been created
using .install
scripts (see PKGBUILD#install). This had the downside of
requiring a specific user identifier (UID) and/or group identifier (GID)
(see UID/ GID database for specific assignments) if file ownerships also
needed to be handled in the context of a package.
Additionally, the user and group creation was not standardized and required a
script (run as root
), which was only run after creating and installing
the package and therefore not easily testable.
With the adoption of systemd and specifically sysusers.d the workflow has
changed to installing a single file in the context of a package to the vendor
location /usr/lib/sysusers.d/
. Based on the systemd alpm-hooks setup
the configuration is applied using systemd-sysusers.
Changing files after package installation
Similar to how system users and groups have been created in the past, file
modifications (e.g. ownership, extended file attributes or setuid) have
been done using .install
scripts or directly in PKGBUILDs. The problem with
this approach was, that it required the specific assignment and pinning of UIDs
and GIDs when creating the required users and groups, before doing the file
modifications (e.g. using chown).
This task has been made less complex with tmpfiles.d, which allows for
packaging a single file in a package to the vendor location
/usr/lib/tmpfiles.d/
. Due to the ordering of alpm-hooks first users
and groups are created and only afterwards the file configuration is applied
using systemd-tmpfiles, allowing for diverse scenarios.
Packaging
Working on packages for software written in different languages (e.g. PHP, Python, Ruby, C, C++ or Rust) using various build systems surely makes for a very interesting Github profile eventually (due to providing issue reports and fixes to many projects).
You can find the list of my packages amongst the official repositories. Moreover I currently maintain two unofficial user repositories: [realtime] and [pro-audio-legacy]
Packaging can be a fun but also a very time consuming and frustrating pastime. As such there are many many more examples and specifics I could list (but this article is already quite dense I am afraid).
At any rate, I hope I could spark your curiosity! If you are interested in finding out more about packaging for specific languages or best practices, the following are some good starting points: