Blogging « codeblog

October 26, 2023

Enable MTE on Pixel 8

Filed under: Blogging,Kernel,Security — kees @ 12:19 pm

The Pixel 8 hardware (Tensor G3) supports the ARM Memory Tagging Extension (MTE), and software support is available both in Android userspace and the Linux kernel. This feature is a powerful defense against linear buffer overflows and many types of use-after-free flaws. I’m extremely happy to see this hardware finally available in the real world.

Turning it on for userspace is already wired up the Android UI: Settings / System / Developer options / Memory Tagging Extension / Enable MTE until you turn if off. Once enabled it will internally change an Android “system property” named “arm64.memtag.bootctl” by adding the option “memtag“.

Turning it on for the kernel is slightly more involved, but not difficult at all. This requires manually setting the “arm64.memtag.bootctl” property mentioned above to include “memtag-kernel” as well:

Plug your phone into a system that can run the adb tool
If not already installed, install adb. For example on Debian/Ubuntu: sudo apt install adb
Turn on “USB Debugging” in the phone’s “Developer options” menu, and accept the debugging session confirmation that will pop up when you first run adb
Verify the current setting: adb shell getprop | grep memtag.bootctl

[arm64.memtag.bootctl]: [memtag]

Enable kernel MTE: adb shell setprop arm64.memtag.bootctl memtag,memtag-kernel
Check the results: adb shell getprop | grep memtag.bootctl

[arm64.memtag.bootctl]: [memtag,memtag-kernel]

Reboot your phone

To check that MTE is enabled for the kernel (which is implemented using Kernel Address Sanitizer’s Hardware Tagging mode), you can check the kernel command line after rebooting:

$ mkdir foo && cd foo
$ adb bugreport
...
$ mkdir unpacked && cd unpacked
$ unzip ../bugreport*.zip
...
$ grep kasan= bugreport*.txt
...: Command line: ... kasan=off ... kasan=on ...

The latter “kasan=on” overrides the earlier “kasan=off“.

Enjoy!

Comments Off

June 24, 2022

finding binary differences

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:11 pm

As part of the continuing work to replace 1-element arrays in the Linux kernel, it’s very handy to show that a source change has had no executable code difference. For example, if you started with this:

struct foo {
    unsigned long flags;
    u32 length;
    u32 data[1];
};

void foo_init(int count)
{
    struct foo *instance;
    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
    ...
    instance = kmalloc(bytes, GFP_KERNEL);
    ...
};

And you changed only the struct definition:

-    u32 data[1];
+    u32 data[];

The bytes calculation is going to be incorrect, since it is still subtracting 1 element’s worth of space from the desired count. (And let’s ignore for the moment the open-coded calculation that may end up with an arithmetic over/underflow here; that can be solved separately by using the struct_size() helper or the size_mul(), size_add(), etc family of helpers.)

The missed adjustment to the size calculation is relatively easy to find in this example, but sometimes it’s much less obvious how structure sizes might be woven into the code. I’ve been checking for issues by using the fantastic diffoscope tool. It can produce a LOT of noise if you try to compare builds without keeping in mind the issues solved by reproducible builds, with some additional notes. I prepare my build with the “known to disrupt code layout” options disabled, but with debug info enabled:

$ KBF="KBUILD_BUILD_TIMESTAMP=1980-01-01 KBUILD_BUILD_USER=user KBUILD_BUILD_HOST=host KBUILD_BUILD_VERSION=1"
$ OUT=gcc
$ make $KBF O=$OUT allmodconfig
$ ./scripts/config --file $OUT/.config \
        -d GCOV_KERNEL -d KCOV -d GCC_PLUGINS -d IKHEADERS -d KASAN -d UBSAN \
        -d DEBUG_INFO_NONE -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
$ make $KBF O=$OUT olddefconfig

Then I build a stock target, saving the output in “before”. In this case, I’m examining drivers/scsi/megaraid/:

$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/before
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/before/

Then I patch and build a modified target, saving the output in “after”:

$ vi the/source/code.c
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/after
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/after/

And then run diffoscope:

$ diffoscope $OUT/before/ $OUT/after/

If diffoscope output reports nothing, then we’re done. 🥳

Usually, though, when source lines move around other stuff will shift too (e.g. WARN macros rely on line numbers, so the bug table may change contents a bit, etc), and diffoscope output will look noisy. To examine just the executable code, the command that diffoscope used is reported in the output, and we can run it directly, but with possibly shifted line numbers not reported. i.e. running objdump without --line-numbers:

$ ARGS="--disassemble --demangle --reloc --no-show-raw-insn --section=.text"
$ for i in $(cd $OUT/before && echo *.o); do
        echo $i
        diff -u <(objdump $ARGS $OUT/before/$i | sed "0,/^Disassembly/d") \
                <(objdump $ARGS $OUT/after/$i  | sed "0,/^Disassembly/d")
done

If I see an unexpected difference, for example:

-    c120:      movq   $0x0,0x800(%rbx)
+    c120:      movq   $0x0,0x7f8(%rbx)

Then I'll search for the pattern with line numbers added to the objdump output:

$ vi <(objdump --line-numbers $ARGS $OUT/after/megaraid_sas_fp.o)

I'd search for "0x0,0x7f8", find the source file and line number above it, open that source file at that position, and look to see where something was being miscalculated:

$ vi drivers/scsi/megaraid/megaraid_sas_fp.c +329

Once tracked down, I'd start over at the "patch and build a modified target" step above, repeating until there were no differences. For example, in the starting example, I'd also need to make this change:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = sizeof(*instance) + sizeof(u32) * count;

Though, as hinted earlier, better yet would be:

-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = struct_size(instance, data, count);

But sometimes adding the helper usage will add binary output differences since they're performing overflow checking that might saturate at SIZE_MAX. To help with patch clarity, those changes can be done separately from fixing the array declaration.

Comments Off

April 4, 2022

security things in Linux v5.10

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:01 pm

Previously: v5.9

Linux v5.10 was released in December, 2020. Here’s my summary of various security things that I found interesting:

AMD SEV-ES
While guest VM memory encryption with AMD SEV has been supported for a while, Joerg Roedel, Thomas Lendacky, and others added register state encryption (SEV-ES). This means it’s even harder for a VM host to reconstruct a guest VM’s state.

x86 static calls
Josh Poimboeuf and Peter Zijlstra implemented static calls for x86, which operates very similarly to the “static branch” infrastructure in the kernel. With static branches, an if/else choice can be hard-coded, instead of being run-time evaluated every time. Such branches can be updated too (the kernel just rewrites the code to switch around the “branch”). All these principles apply to static calls as well, but they’re for replacing indirect function calls (i.e. a call through a function pointer) with a direct call (i.e. a hard-coded call address). This eliminates the need for Spectre mitigations (e.g. RETPOLINE) for these indirect calls, and avoids a memory lookup for the pointer. For hot-path code (like the scheduler), this has a measurable performance impact. It also serves as a kind of Control Flow Integrity implementation: an indirect call got removed, and the potential destinations have been explicitly identified at compile-time.

network RNG improvements
In an effort to improve the pseudo-random number generator used by the network subsystem (for things like port numbers and packet sequence numbers), Linux’s home-grown pRNG has been replaced by the SipHash round function, and perturbed by (hopefully) hard-to-predict internal kernel states. This should make it very hard to brute force the internal state of the pRNG and make predictions about future random numbers just from examining network traffic. Similarly, ICMP’s global rate limiter was adjusted to avoid leaking details of network state, as a start to fixing recent DNS Cache Poisoning attacks.

SafeSetID handles GID
Thomas Cedeno improved the SafeSetID LSM to handle group IDs (which required teaching the kernel about which syscalls were actually performing setgid.) Like the earlier setuid policy, this lets the system owner define an explicit list of allowed group ID transitions under CAP_SETGID (instead of to just any group), providing a way to keep the power of granting this capability much more limited. (This isn’t complete yet, though, since handling setgroups() is still needed.)

improve kernel’s internal checking of file contents
The kernel provides LSMs (like the Integrity subsystem) with details about files as they’re loaded. (For example, loading modules, new kernel images for kexec, and firmware.) There wasn’t very good coverage for cases where the contents were coming from things that weren’t files. To deal with this, new hooks were added that allow the LSMs to introspect the contents directly, and to do partial reads. This will give the LSMs much finer grain visibility into these kinds of operations.

set_fs removal continues
With the earlier work landed to free the core kernel code from set_fs(), Christoph Hellwig made it possible for set_fs() to be optional for an architecture. Subsequently, he then removed set_fs() entirely for x86, riscv, and powerpc. These architectures will now be free from the entire class of “kernel address limit” attacks that only needed to corrupt a single value in struct thead_info.

sysfs_emit() replaces sprintf() in /sys
Joe Perches tackled one of the most common bug classes with sprintf() and snprintf() in /sys handlers by creating a new helper, sysfs_emit(). This will handle the cases where kernel code was not correctly dealing with the length results from sprintf() calls, which might lead to buffer overflows in the PAGE_SIZE buffer that /sys handlers operate on. With the helper in place, it was possible to start the refactoring of the many sprintf() callers.

nosymfollow mount option
Mattias Nissler and Ross Zwisler implemented the nosymfollow mount option. This entirely disables symlink resolution for the given filesystem, similar to other mount options where noexec disallows execve(), nosuid disallows setid bits, and nodev disallows device files. Quoting the patch, it is “useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts.” (i.e. for when /proc/sys/fs/protected_symlinks isn’t a big enough hammer.) Chrome OS uses this option for its stateful filesystem, as symlink traversal as been a common attack-persistence vector.

ARMv8.5 Memory Tagging Extension support
Vincenzo Frascino added support to arm64 for the coming Memory Tagging Extension, which will be available for ARMv8.5 and later chips. It provides 4 bits of tags (covering multiples of 16 byte spans of the address space). This is enough to deterministically eliminate all linear heap buffer overflow flaws (1 tag for “free”, and then rotate even values and odd values for neighboring allocations), which is probably one of the most common bugs being currently exploited. It also makes use-after-free and over/under indexing much more difficult for attackers (but still possible if the target’s tag bits can be exposed). Maybe some day we can switch to 128 bit virtual memory addresses and have fully versioned allocations. But for now, 16 tag values is better than none, though we do still need to wait for anyone to actually be shipping ARMv8.5 hardware.

fixes for flaws found by UBSAN
The work to make UBSAN generally usable under syzkaller continues to bear fruit, with various fixes all over the kernel for stuff like shift-out-of-bounds, divide-by-zero, and integer overflow. Seeing these kinds of patches land reinforces the the rationale of shifting the burden of these kinds of checks to the toolchain: these run-time bugs continue to pop up.

flexible array conversions
The work on flexible array conversions continues. Gustavo A. R. Silva and others continued to grind on the conversions, getting the kernel ever closer to being able to enable the -Warray-bounds compiler flag and clear the path for saner bounds checking of array indexes and memcpy() usage.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.11.

Comments (4)

April 5, 2021

security things in Linux v5.9

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:24 pm

Previously: v5.8

Linux v5.9 was released in October, 2020. Here’s my summary of various security things that I found interesting:

seccomp user_notif file descriptor injection
Sargun Dhillon added the ability for SECCOMP_RET_USER_NOTIF filters to inject file descriptors into the target process using SECCOMP_IOCTL_NOTIF_ADDFD. This lets container managers fully emulate syscalls like open() and connect(), where an actual file descriptor is expected to be available after a successful syscall. In the process I fixed a couple bugs and refactored the file descriptor receiving code.

zero-initialize stack variables with Clang
When Alexander Potapenko landed support for Clang’s automatic variable initialization, it did so with a byte pattern designed to really stand out in kernel crashes. Now he’s added support for doing zero initialization via CONFIG_INIT_STACK_ALL_ZERO, which besides actually being faster, has a few behavior benefits as well. “Unlike pattern initialization, which has a higher chance of triggering existing bugs, zero initialization provides safe defaults for strings, pointers, indexes, and sizes.” Like the pattern initialization, this feature stops entire classes of uninitialized stack variable flaws.

common syscall entry/exit routines
Thomas Gleixner created architecture-independent code to do syscall entry/exit, since much of the kernel’s work during a syscall entry and exit is the same. There was no need to repeat this in each architecture, and having it implemented separately meant bugs (or features) might only get fixed (or implemented) in a handful of architectures. It means that features like seccomp become much easier to build since it wouldn’t need per-architecture implementations any more. Presently only x86 has switched over to the common routines.

SLAB kfree() hardening
To reach CONFIG_SLAB_FREELIST_HARDENED feature-parity with the SLUB heap allocator, I added naive double-free detection and the ability to detect cross-cache freeing in the SLAB allocator. This should keep a class of type-confusion bugs from biting kernels using SLAB. (Most distro kernels use SLUB, but some smaller devices prefer the slightly more compact SLAB, so this hardening is mostly aimed at those systems.)

new CAP_CHECKPOINT_RESTORE capability
Adrian Reber added the new CAP_CHECKPOINT_RESTORE capability, splitting this functionality off of CAP_SYS_ADMIN. The needs for the kernel to correctly checkpoint and restore a process (e.g. used to move processes between containers) continues to grow, and it became clear that the security implications were lower than those of CAP_SYS_ADMIN yet distinct from other capabilities. Using this capability is now the preferred method for doing things like changing /proc/self/exe.

debugfs boot-time visibility restriction
Peter Enderborg added the debugfs boot parameter to control the visibility of the kernel’s debug filesystem. The contents of debugfs continue to be a common area of sensitive information being exposed to attackers. While this was effectively possible by unsetting CONFIG_DEBUG_FS, that wasn’t a great approach for system builders needing a single set of kernel configs (e.g. a distro kernel), so now it can be disabled at boot time.

more seccomp architecture support
Michael Karcher implemented the SuperH seccomp hooks, Guo Ren implemented the C-SKY seccomp hooks, and Max Filippov implemented the xtensa seccomp hooks. Each of these included the ever-important updates to the seccomp regression testing suite in the kernel selftests.

stack protector support for RISC-V
Guo Ren implemented -fstack-protector (and -fstack-protector-strong) support for RISC-V. This is the initial global-canary support while the patches to GCC to support per-task canaries is getting finished (similar to the per-task canaries done for arm64). This will mean nearly all stack frame write overflows are no longer useful to attackers on this architecture. It’s nice to see this finally land for RISC-V, which is quickly approaching architecture feature parity with the other major architectures in the kernel.

new tasklet API
Romain Perier and Allen Pais introduced a new tasklet API to make their use safer. Much like the timer_list refactoring work done earlier, the tasklet API is also a potential source of simple function-pointer-and-first-argument controlled exploits via linear heap overwrites. It’s a smaller attack surface since it’s used much less in the kernel, but it is the same weak design, making it a sensible thing to replace. While the use of the tasklet API is considered deprecated (replaced by threaded IRQs), it’s not always a simple mechanical refactoring, so the old API still needs refactoring (since that CAN be done mechanically is most cases).

x86 FSGSBASE implementation
Sasha Levin, Andy Lutomirski, Chang S. Bae, Andi Kleen, Tony Luck, Thomas Gleixner, and others landed the long-awaited FSGSBASE series. This provides task switching performance improvements while keeping the kernel safe from modules accidentally (or maliciously) trying to use the features directly (which exposed an unprivileged direct kernel access hole).

filter x86 MSR writes
While it’s been long understood that writing to CPU Model-Specific Registers (MSRs) from userspace was a bad idea, it has been left enabled for things like MSR_IA32_ENERGY_PERF_BIAS. Boris Petkov has decided enough is enough and has now enabled logging and kernel tainting (TAINT_CPU_OUT_OF_SPEC) by default and a way to disable MSR writes at runtime. (However, since this is controlled by a normal module parameter and the root user can just turn writes back on, I continue to recommend that people build with CONFIG_X86_MSR=n.) The expectation is that userspace MSR writes will be entirely removed in future kernels.

uninitialized_var() macro removed
I made treewide changes to remove the uninitialized_var() macro, which had been used to silence compiler warnings. The rationale for this macro was weak to begin with (“the compiler is reporting an uninitialized variable that is clearly initialized”) since it was mainly papering over compiler bugs. However, it creates a much more fragile situation in the kernel since now such uses can actually disable automatic stack variable initialization, as well as mask legitimate “unused variable” warnings. The proper solution is to just initialize variables the compiler warns about.

function pointer cast removals
Oscar Carter has started removing function pointer casts from the kernel, in an effort to allow the kernel to build with -Wcast-function-type. The future use of Control Flow Integrity checking (which does validation of function prototypes matching between the caller and the target) tends not to work well with function casts, so it’d be nice to get rid of these before CFI lands.

flexible array conversions
As part of Gustavo A. R. Silva’s on-going work to replace zero-length and one-element arrays with flexible arrays, he has documented the details of the flexible array conversions, and the various helpers to be used in kernel code. Every commit gets the kernel closer to building with -Warray-bounds, which catches a lot of potential buffer overflows at compile time.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.10.

Comments (3)

February 8, 2021

security things in Linux v5.8

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:47 pm

Previously: v5.7

Linux v5.8 was released in August, 2020. Here’s my summary of various security things that caught my attention:

arm64 Branch Target Identification
Dave Martin added support for ARMv8.5’s Branch Target Instructions (BTI), which are enabled in userspace at execve() time, and all the time in the kernel (which required manually marking up a lot of non-C code, like assembly and JIT code).

With this in place, Jump-Oriented Programming (JOP, where code gadgets are chained together with jumps and calls) is no longer available to the attacker. An attacker’s code must make direct function calls. This basically reduces the “usable” code available to an attacker from every word in the kernel text to only function entries (or jump targets). This is a “low granularity” forward-edge Control Flow Integrity (CFI) feature, which is important (since it greatly reduces the potential targets that can be used in an attack) and cheap (implemented in hardware). It’s a good first step to strong CFI, but (as we’ve seen with things like CFG) it isn’t usually strong enough to stop a motivated attacker. “High granularity” CFI (which uses a more specific branch-target characteristic, like function prototypes, to track expected call sites) is not yet a hardware supported feature, but the software version will be coming in the future by way of Clang’s CFI implementation.

arm64 Shadow Call Stack
Sami Tolvanen landed the kernel implementation of Clang’s Shadow Call Stack (SCS), which protects the kernel against Return-Oriented Programming (ROP) attacks (where code gadgets are chained together with returns). This backward-edge CFI protection is implemented by keeping a second dedicated stack pointer register (x18) and keeping a copy of the return addresses stored in a separate “shadow stack”. In this way, manipulating the regular stack’s return addresses will have no effect. (And since a copy of the return address continues to live in the regular stack, no changes are needed for back trace dumps, etc.)

It’s worth noting that unlike BTI (which is hardware based), this is a software defense that relies on the location of the Shadow Stack (i.e. the value of x18) staying secret, since the memory could be written to directly. Intel’s hardware ROP defense (CET) uses a hardware shadow stack that isn’t directly writable. ARM’s hardware defense against ROP is PAC (which is actually designed as an arbitrary CFI defense — it can be used for forward-edge too), but that depends on having ARMv8.3 hardware. The expectation is that SCS will be used until PAC is available.

Kernel Concurrency Sanitizer infrastructure added
Marco Elver landed support for the Kernel Concurrency Sanitizer, which is a new debugging infrastructure to find data races in the kernel, via CONFIG_KCSAN. This immediately found real bugs, with some fixes having already landed too. For more details, see the KCSAN documentation.

new capabilities
Alexey Budankov added CAP_PERFMON, which is designed to allow access to perf(). The idea is that this capability gives a process access to only read aspects of the running kernel and system. No longer will access be needed through the much more powerful abilities of CAP_SYS_ADMIN, which has many ways to change kernel internals. This allows for a split between controls over the confidentiality (read access via CAP_PERFMON) of the kernel vs control over integrity (write access via CAP_SYS_ADMIN).

Alexei Starovoitov added CAP_BPF, which is designed to separate BPF access from the all-powerful CAP_SYS_ADMIN. It is designed to be used in combination with CAP_PERFMON for tracing-like activities and CAP_NET_ADMIN for networking-related activities. For things that could change kernel integrity (i.e. write access), CAP_SYS_ADMIN is still required.

network random number generator improvements
Willy Tarreau made the network code’s random number generator less predictable. This will further frustrate any attacker’s attempts to recover the state of the RNG externally, which might lead to the ability to hijack network sessions (by correctly guessing packet states).

fix various kernel address exposures to non-CAP_SYSLOG
I fixed several situations where kernel addresses were still being exposed to unprivileged (i.e. non-CAP_SYSLOG) users, though usually only through odd corner cases. After refactoring how capabilities were being checked for files in /sys and /proc, the kernel modules sections, kprobes, and BPF exposures got fixed. (Though in doing so, I briefly made things much worse before getting it properly fixed. Yikes!)

RISCV W^X detection
Following up on his recent work to enable strict kernel memory protections on RISCV, Zong Li has now added support for CONFIG_DEBUG_WX as seen for other architectures. Any writable and executable memory regions in the kernel (which are lovely targets for attackers) will be loudly noted at boot so they can get corrected.

execve() refactoring continues
Eric W. Biederman continued working on execve() refactoring, including getting rid of the frequently problematic recursion used to locate binary handlers. I used the opportunity to dust off some old binfmt_script regression tests and get them into the kernel selftests.

multiple /proc instances
Alexey Gladkov modernized /proc internals and provided a way to have multiple /proc instances mounted in the same PID namespace. This allows for having multiple views of /proc, with different features enabled. (Including the newly added hidepid=4 and subset=pid mount options.)

set_fs() removal continues
Christoph Hellwig, with Eric W. Biederman, Arnd Bergmann, and others, have been diligently working to entirely remove the kernel’s set_fs() interface, which has long been a source of security flaws due to weird confusions about which address space the kernel thought it should be accessing. Beyond things like the lower-level per-architecture signal handling code, this has needed to touch various parts of the ELF loader, and networking code too.

READ_IMPLIES_EXEC is no more for native 64-bit
The READ_IMPLIES_EXEC flag was a work-around for dealing with the addition of non-executable (NX) memory when x86_64 was introduced. It was designed as a way to mark a memory region as “well, since we don’t know if this memory region was expected to be executable, we must assume that if we need to read it, we need to be allowed to execute it too”. It was designed mostly for stack memory (where trampoline code might live), but it would carry over into all mmap() allocations, which would mean sometimes exposing a large attack surface to an attacker looking to find executable memory. While normally this didn’t cause problems on modern systems that correctly marked their ELF sections as NX, there were still some awkward corner-cases. I fixed this by splitting READ_IMPLIES_EXEC from the ELF PT_GNU_STACK marking on x86 and arm/arm64, and declaring that a native 64-bit process would never gain READ_IMPLIES_EXEC on x86_64 and arm64, which matches the behavior of other native 64-bit architectures that correctly didn’t ever implement READ_IMPLIES_EXEC in the first place.

array index bounds checking continues
As part of the ongoing work to use modern flexible arrays in the kernel, Gustavo A. R. Silva added the flex_array_size() helper (as a cousin to struct_size()). The zero/one-member into flex array conversions continue with over a hundred commits as we slowly get closer to being able to build with -Warray-bounds.

scnprintf() replacement continues
Chen Zhou joined Takashi Iwai in continuing to replace potentially unsafe uses of sprintf() with scnprintf(). Fixing all of these will make sure the kernel avoids nasty buffer concatenation surprises.

That’s it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.9.

Comments Off

October 30, 2020

combining “apt install” and “get dist-upgrade”?

Filed under: Blogging,Debian,Ubuntu,Ubuntu-Server — kees @ 12:07 pm

I frequently see a pattern in image build/refresh scripts where a set of packages is installed, and then all packages are updated:

apt update
apt install -y pkg1 pkg2 pkg2
apt dist-upgrade -y

While it’s not much, this results in redundant work. For example reading/writing package database, potentially running triggers (man-page refresh, ldconfig, etc). The internal package dependency resolution stuff isn’t actually different: “install” will also do upgrades of needed packages, etc. Combining them should be entirely possible, but I haven’t found a clean way to do this yet.

The best I’ve got so far is:

apt update
apt-cache dumpavail | dpkg --merge-avail -
(for i in pkg1 pkg2 pkg3; do echo "$i install") | dpkg --set-selections
apt-get dselect-upgrade

This gets me the effect of running “install” and “upgrade” at the same time, but not “dist-upgrade” (which has slightly different resolution logic that’d I’d prefer to use). Also, it includes the overhead of what should be an unnecessary update of dpkg’s database. Anyone know a better way to do this?

Update: Julian Andres Klode pointed out that dist-upgrade actually takes package arguments too just like install. *face palm* I didn’t even try it — I believed the man-page and the -h output. It works perfectly!

Comments Off

September 21, 2020

security things in Linux v5.7

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:32 pm

Previously: v5.6

Linux v5.7 was released at the end of May. Here’s my summary of various security things that caught my attention:

arm64 kernel pointer authentication
While the ARMv8.3 CPU “Pointer Authentication” (PAC) feature landed for userspace already, Kristina Martsenko has now landed PAC support in kernel mode. The current implementation uses PACIASP which protects the saved stack pointer, similar to the existing CONFIG_STACKPROTECTOR feature, only faster. This also paves the way to sign and check pointers stored in the heap, as a way to defeat function pointer overwrites in those memory regions too. Since the behavior is different from the traditional stack protector, Amit Daniel Kachhap added an LKDTM test for PAC as well.

BPF LSM
The kernel’s Linux Security Module (LSM) API provide a way to write security modules that have traditionally implemented various Mandatory Access Control (MAC) systems like SELinux, AppArmor, etc. The LSM hooks are numerous and no one LSM uses them all, as some hooks are much more specialized (like those used by IMA, Yama, LoadPin, etc). There was not, however, any way to externally attach to these hooks (not even through a regular loadable kernel module) nor build fully dynamic security policy, until KP Singh landed the API for building LSM policy using BPF. With CONFIG_BPF_LSM=y, it is possible (for a privileged process) to write kernel LSM hooks in BPF, allowing for totally custom security policy (and reporting).

execve() deadlock refactoring
There have been a number of long-standing races in the kernel’s process launching code where ptrace could deadlock. Fixing these has been attempted several times over the last many years, but Eric W. Biederman and Ernd Edlinger decided to dive in, and successfully landed the a series of refactorings, splitting up the problematic locking and refactoring their uses to remove the deadlocks. While he was at it, Eric also extended the exec_id counter to 64 bits to avoid the possibility of the counter wrapping and allowing an attacker to send arbitrary signals to processes they normally shouldn’t be able to.

slub freelist obfuscation improvements
After Silvio Cesare observed some weaknesses in the implementation of CONFIG_SLAB_FREELIST_HARDENED‘s freelist pointer content obfuscation, I improved their bit diffusion, which makes attacks require significantly more memory content exposures to defeat the obfuscation. As part of the conversation, Vitaly Nikolenko pointed out that the freelist pointer’s location made it relatively easy to target too (for either disclosures or overwrites), so I moved it away from the edge of the slab, making it harder to reach through small-sized overflows (which usually target the freelist pointer). As it turns out, there were a few assumptions in the kernel about the location of the freelist pointer, which had to also get cleaned up.

RISCV strict kernel memory protections
Zong Li implemented STRICT_KERNEL_RWX support for RISCV. And to validate the results, following v5.6’s generic page table dumping work, he landed the RISCV page dumping code. This means it’s much easier to examine the kernel’s page table layout when running a debug kernel (built with PTDUMP_DEBUGFS), visible in /sys/kernel/debug/kernel_page_tables.

array index bounds checking
This is a pretty large area of work that touches a lot of overlapping elements (and history) in the Linux kernel. The short version is: C is bad at noticing when it uses an array index beyond the bounds of the declared array, and we need to fix that. For example, don’t do this:

int foo[5];
...
foo[8] = bar;

The long version gets complicated by the evolution of “flexible array” structure members, so we’ll pause for a moment and skim the surface of this topic. While things like CONFIG_FORTIFY_SOURCE try to catch these kinds of cases in the memcpy() and strcpy() family of functions, it doesn’t catch it in open-coded array indexing, as seen in the code above. GCC has a warning (-Warray-bounds) for these cases, but it was disabled by Linus because of all the false positives seen due to “fake” flexible array members. Before flexible arrays were standardized, GNU C supported “zero sized” array members. And before that, C code would use a 1-element array. These were all designed so that some structure could be the “header” in front of some data blob that could be addressable through the last structure member:

/* 1-element array */
struct foo {
    ...
    char contents[1];
};

/* GNU C extension: 0-element array */
struct foo {
    ...
    char contents[0];
};

/* C standard: flexible array */
struct foo {
    ...
    char contents[];
};

instance = kmalloc(sizeof(struct foo) + content_size);

Converting all the zero- and one-element array members to flexible arrays is one of Gustavo A. R. Silva’s goals, and hundreds of these changes started landing. Once fixed, -Warray-bounds can be re-enabled. Much more detail can be found in the kernel’s deprecation docs.

However, that will only catch the “visible at compile time” cases. For runtime checking, the Undefined Behavior Sanitizer has an option for adding runtime array bounds checking for catching things like this where the compiler cannot perform a static analysis of the index values:

int foo[5];
...
for (i = 0; i < some_argument; i++) {
    ...
    foo[i] = bar;
    ...
}

It was, however, not separate (via kernel Kconfig) until Elena Petrova and I split it out into CONFIG_UBSAN_BOUNDS, which is fast enough for production kernel use. With this enabled, it's now possible to instrument the kernel to catch these conditions, which seem to come up with some regularity in Wi-Fi and Bluetooth drivers for some reason. Since UBSAN (and the other Sanitizers) only WARN() by default, system owners need to set panic_on_warn=1 too if they want to defend against attacks targeting these kinds of flaws. Because of this, and to avoid bloating the kernel image with all the warning messages, I introduced CONFIG_UBSAN_TRAP which effectively turns these conditions into a BUG() without needing additional sysctl settings.

Fixing "additive" snprintf() usage
A common idiom in C for building up strings is to use sprintf()'s return value to increment a pointer into a string, and build a string with more sprintf() calls:

/* safe if strlen(foo) + 1 < sizeof(string) */
wrote  = sprintf(string, "Foo: %s\n", foo);
/* overflows if strlen(foo) + strlen(bar) > sizeof(string) */
wrote += sprintf(string + wrote, "Bar: %s\n", bar);
/* writing way beyond the end of "string" now ... */
wrote += sprintf(string + wrote, "Baz: %s\n", baz);

The risk is that if these calls eventually walk off the end of the string buffer, it will start writing into other memory and create some bad situations. Switching these to snprintf() does not, however, make anything safer, since snprintf() returns how much it would have written:

/* safe, assuming available <= sizeof(string), and for this example
 * assume strlen(foo) < sizeof(string) */
wrote  = snprintf(string, available, "Foo: %s\n", foo);
/* if (strlen(bar) > available - wrote), this is still safe since the
 * write into "string" will be truncated, but now "wrote" has been
 * incremented by how much snprintf() *would* have written, so "wrote"
 * is now larger than "available". */
wrote += snprintf(string + wrote, available - wrote, "Bar: %s\n", bar);
/* string + wrote is beyond the end of string, and availabe - wrote wraps
 * around to a giant positive value, making the write effectively 
 * unbounded. */
wrote += snprintf(string + wrote, available - wrote, "Baz: %s\n", baz);

So while the first overflowing call would be safe, the next one would be targeting beyond the end of the array and the size calculation will have wrapped around to a giant limit. Replacing this idiom with scnprintf() solves the issue because it only reports what was actually written. To this end, Takashi Iwai has been landing a bunch scnprintf() fixes.

That's it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.8.

Comments Off

September 2, 2020

security things in Linux v5.6

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:22 pm

Previously: v5.5.

Linux v5.6 was released back in March. Here’s my quick summary of various features that caught my attention:

WireGuard
The widely used WireGuard VPN has been out-of-tree for a very long time. After 3 1/2 years since its initial upstream RFC, Ard Biesheuvel and Jason Donenfeld finished the work getting all the crypto prerequisites sorted out for the v5.5 kernel. For this release, Jason has gotten WireGuard itself landed. It was a twisty road, and I’m grateful to everyone involved for sticking it out and navigating the compromises and alternative solutions.

openat2() syscall and RESOLVE_* flags
Aleksa Sarai has added a number of important path resolution “scoping” options to the kernel’s open() handling, covering things like not walking above a specific point in a path hierarchy (RESOLVE_BENEATH), disabling the resolution of various “magic links” (RESOLVE_NO_MAGICLINKS) in procfs (e.g. /proc/$pid/exe) and other pseudo-filesystems, and treating a given lookup as happening relative to a different root directory (as if it were in a chroot, RESOLVE_IN_ROOT). As part of this, it became clear that there wasn’t a way to correctly extend the existing openat() syscall, so he added openat2() (which is a good example of the efforts being made to codify “Extensible Syscall” arguments). The RESOLVE_* set of flags also cover prior behaviors like RESOLVE_NO_XDEV and RESOLVE_NO_SYMLINKS.

pidfd_getfd() syscall
In the continuing growth of the much-needed pidfd APIs, Sargun Dhillon has added the pidfd_getfd() syscall which is a way to gain access to file descriptors of a process in a race-less way (or when /proc is not mounted). Before, it wasn’t always possible make sure that opening file descriptors via /proc/$pid/fd/$N was actually going to be associated with the correct PID. Much more detail about this has been written up at LWN.

openat() via io_uring
With my “attack surface reduction” hat on, I remain personally suspicious of the io_uring() family of APIs, but I can’t deny their utility for certain kinds of workloads. Being able to pipeline reads and writes without the overhead of actually making syscalls is pretty great for performance. Jens Axboe has added the IORING_OP_OPENAT command so that existing io_urings can open files to be added on the fly to the mapping of available read/write targets of a given io_uring. While LSMs are still happily able to intercept these actions, I remain wary of the growing “syscall multiplexer” that io_uring is becoming. I am, of course, glad to see that it has a comprehensive (if “out of tree”) test suite as part of liburing.

removal of blocking random pool
After making algorithmic changes to obviate separate entropy pools for random numbers, Andy Lutomirski removed the blocking random pool. This simplifies the kernel pRNG code significantly without compromising the userspace interfaces designed to fetch “cryptographically secure” random numbers. To quote Andy, “This series should not break any existing programs. /dev/urandom is unchanged. /dev/random will still block just after booting, but it will block less than it used to.” See LWN for more details on the history and discussion of the series.

arm64 support for on-chip RNG
Mark Brown added support for the future ARMv8.5’s RNG (SYS_RNDR_EL0), which is, from the kernel’s perspective, similar to x86’s RDRAND instruction. This will provide a bootloader-independent way to add entropy to the kernel’s pRNG for early boot randomness (e.g. stack canary values, memory ASLR offsets, etc). Until folks are running on ARMv8.5 systems, they can continue to depend on the bootloader for randomness (via the UEFI RNG interface) on arm64.

arm64 E0PD
Mark Brown added support for the future ARMv8.5’s E0PD feature (TCR_E0PD1), which causes all memory accesses from userspace into kernel space to fault in constant time. This is an attempt to remove any possible timing side-channel signals when probing kernel memory layout from userspace, as an alternative way to protect against Meltdown-style attacks. The expectation is that E0PD would be used instead of the more expensive Kernel Page Table Isolation (KPTI) features on arm64.

powerpc32 VMAP_STACK
Christophe Leroy added VMAP_STACK support to powerpc32, joining x86, arm64, and s390. This helps protect against the various classes of attacks that depend on exhausting the kernel stack in order to collide with neighboring kernel stacks. (Another common target, the sensitive thread_info, had already been moved away from the bottom of the stack by Christophe Leroy in Linux v5.1.)

generic Page Table dumping
Related to RISCV’s work to add page table dumping (via /sys/fs/debug/kernel_page_tables), Steven Price extracted the existing implementations from multiple architectures and created a common page table dumping framework (and then refactored all the other architectures to use it). I’m delighted to have this because I still remember when not having a working page table dumper for ARM delayed me for a while when trying to implement upstream kernel memory protections there. Anything that makes it easier for architectures to get their kernel memory protection working correctly makes me happy.

That’s in for now; let me know if there’s anything you think I missed. Next up: Linux v5.7.

Comments (1)

May 27, 2020

security things in Linux v5.5

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 1:04 pm

Previously: v5.4.

I got a bit behind on this blog post series! Let’s get caught up. Here are a bunch of security things I found interesting in the Linux kernel v5.5 release:

restrict perf_event_open() from LSM
Given the recurring flaws in the perf subsystem, there has been a strong desire to be able to entirely disable the interface. While the kernel.perf_event_paranoid sysctl knob has existed for a while, attempts to extend its control to “block all perf_event_open() calls” have failed in the past. Distribution kernels have carried the rejected sysctl patch for many years, but now Joel Fernandes has implemented a solution that was deemed acceptable: instead of extending the sysctl, add LSM hooks so that LSMs (e.g. SELinux, Apparmor, etc) can make these choices as part of their overall system policy.

generic fast full refcount_t
Will Deacon took the recent refcount_t hardening work for both x86 and arm64 and distilled the implementations into a single architecture-agnostic C version. The result was almost as fast as the x86 assembly version, but it covered more cases (e.g. increment-from-zero), and is now available by default for all architectures. (There is no longer any Kconfig associated with refcount_t; the use of the primitive provides full coverage.)

linker script cleanup for exception tables
When Rick Edgecombe presented his work on building Execute-Only memory under a hypervisor, he noted a region of memory that the kernel was attempting to read directly (instead of execute). He rearranged things for his x86-only patch series to work around the issue. Since I’d just been working in this area, I realized the root cause of this problem was the location of the exception table (which is strictly a lookup table and is never executed) and built a fix for the issue and applied it to all architectures, since it turns out the exception tables for almost all architectures are just a data table. Hopefully this will help clear the path for more Execute-Only memory work on all architectures. In the process of this, I also updated the section fill bytes on x86 to be a trap (0xCC, int3), instead of a NOP instruction so functions would need to be targeted more precisely by attacks.

KASLR for 32-bit PowerPC
Joining many other architectures, Jason Yan added kernel text base-address offset randomization (KASLR) to 32-bit PowerPC.

seccomp for RISC-V
After a bit of long road, David Abdurachmanov has added seccomp support to the RISC-V architecture. The series uncovered some more corner cases in the seccomp self tests code, which is always nice since then we get to make it more robust for the future!

seccomp USER_NOTIF continuation
When the seccomp SECCOMP_RET_USER_NOTIF interface was added, it seemed like it would only be used in very limited conditions, so the idea of needing to handle “normal” requests didn’t seem very onerous. However, since then, it has become clear that the overhead of a monitor process needing to perform lots of “normal” open() calls on behalf of the monitored process started to look more and more slow and fragile. To deal with this, it became clear that there needed to be a way for the USER_NOTIF interface to indicate that seccomp should just continue as normal and allow the syscall without any special handling. Christian Brauner implemented SECCOMP_USER_NOTIF_FLAG_CONTINUE to get this done. It comes with a bit of a disclaimer due to the chance that monitors may use it in places where ToCToU is a risk, and for possible conflicts with SECCOMP_RET_TRACE. But overall, this is a net win for container monitoring tools.

EFI_RNG_PROTOCOL for x86
Some EFI systems provide a Random Number Generator interface, which is useful for gaining some entropy in the kernel during very early boot. The arm64 boot stub has been using this for a while now, but Dominik Brodowski has now added support for x86 to do the same. This entropy is useful for kernel subsystems performing very earlier initialization whre random numbers are needed (like randomizing aspects of the SLUB memory allocator).

FORTIFY_SOURCE for MIPS
As has been enabled on many other architectures, Dmitry Korotin got MIPS building with CONFIG_FORTIFY_SOURCE, so compile-time (and some run-time) buffer overflows during calls to the memcpy() and strcpy() families of functions will be detected.

limit copy_{to,from}_user() size to INT_MAX
As done for VFS, vsnprintf(), and strscpy(), I went ahead and limited the size of copy_to_user() and copy_from_user() calls to INT_MAX in order to catch any weird overflows in size calculations.

Other things
Alexander Popov pointed out some more v5.5 features that I missed in this blog post. I’m repeating them here, with some minor edits/clarifications. Thank you Alexander!

KASan support for vmap memory expands KASan’s features to include analysis of memory allocated with vmalloc(). More specifically, this means KASan can examine the stack again, since it can be in that region since CONFIG_VMAP_STACK was introduced.
MIPS can build with GCC plugins and KCOV now, providing those systems with the associated features like stack wiping, coverage analysis, etc.
userfaultfd requires CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK. This was done to block unprivileged users from abusing the userfaultfd feature, as it is implemented in a way that provides the ability to inject file descriptors into unsuspecting processes.

Edit: added Alexander Popov’s notes

That’s it for v5.5! Let me know if there’s anything else that I should call out here. Next up: Linux v5.6.

Comments (1)

February 18, 2020

security things in Linux v5.4

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:37 pm

Previously: v5.3.

Linux kernel v5.4 was released in late November. The holidays got the best of me, but better late than never! ;) Here are some security-related things I found interesting:

waitid() gains P_PIDFD
Christian Brauner has continued his pidfd work by adding a critical mode to waitid(): P_PIDFD. This makes it possible to reap child processes via a pidfd, and completes the interfaces needed for the bulk of programs performing process lifecycle management. (i.e. a pidfd can come from /proc or clone(), and can be waited on with waitid().)

kernel lockdown
After something on the order of 8 years, Linux can now draw a bright line between “ring 0” (kernel memory) and “uid 0” (highest privilege level in userspace). The “kernel lockdown” feature, which has been an out-of-tree patch series in most Linux distros for almost as many years, attempts to enumerate all the intentional ways (i.e. interfaces not flaws) userspace might be able to read or modify kernel memory (or execute in kernel space), and disable them. While Matthew Garrett made the internal details fine-grained controllable, the basic lockdown LSM can be set to either disabled, “integrity” (kernel memory can be read but not written), or “confidentiality” (no kernel memory reads or writes). Beyond closing the many holes between userspace and the kernel, if new interfaces are added to the kernel that might violate kernel integrity or confidentiality, now there is a place to put the access control to make everyone happy and there doesn’t need to be a rehashing of the age old fight between “but root has full kernel access” vs “not in some system configurations”.

tagged memory relaxed syscall ABI
Andrey Konovalov (with Catalin Marinas and others) introduced a way to enable a “relaxed” tagged memory syscall ABI in the kernel. This means programs running on hardware that supports memory tags (or “versioning”, or “coloring”) in the upper (non-VMA) bits of a pointer address can use these addresses with the kernel without things going crazy. This is effectively teaching the kernel to ignore these high bits in places where they make no sense (i.e. mathematical comparisons) and keeping them in place where they have meaning (i.e. pointer dereferences).

As an example, if a userspace memory allocator had returned the address 0x0f00000010000000 (VMA address 0x10000000, with, say, a “high bits” tag of 0x0f), and a program used this range during a syscall that ultimately called copy_from_user() on it, the initial range check would fail if the tag bits were left in place: “that’s not a userspace address; it is greater than TASK_SIZE (0x0000800000000000)!”, so they are stripped for that check. During the actual copy into kernel memory, the tag is left in place so that when the hardware dereferences the pointer, the pointer tag can be checked against the expected tag assigned to referenced memory region. If there is a mismatch, the hardware will trigger the memory tagging protection.

Right now programs running on Sparc M7 CPUs with ADI (Application Data Integrity) can use this for hardware tagged memory, ARMv8 CPUs can use TBI (Top Byte Ignore) for software memory tagging, and eventually there will be ARMv8.5-A CPUs with MTE (Memory Tagging Extension).

boot entropy improvement
Thomas Gleixner got fed up with poor boot-time entropy and trolled Linus into coming up with reasonable way to add entropy on modern CPUs, taking advantage of timing noise, cycle counter jitter, and perhaps even the variability of speculative execution. This means that there shouldn’t be mysterious multi-second (or multi-minute!) hangs at boot when some systems don’t have enough entropy to service getrandom() syscalls from systemd or the like.

userspace writes to swap files blocked
From the department of “how did this go unnoticed for so long?”, Darrick J. Wong fixed the kernel to not allow writes from userspace to active swap files. Without this, it was possible for a user (usually root) with write access to a swap file to modify its contents, thereby changing memory contents of a process once it got paged back in. While root normally could just use CAP_PTRACE to modify a running process directly, this was a loophole that allowed lesser-privileged users (e.g. anyone in the “disk” group) without the needed capabilities to still bypass ptrace restrictions.

limit strscpy() sizes to INT_MAX
Generally speaking, if a size variable ends up larger than INT_MAX, some calculation somewhere has overflowed. And even if not, it’s probably going to hit code somewhere nearby that won’t deal well with the result. As already done in the VFS core, and vsprintf(), I added a check to strscpy() to reject sizes larger than INT_MAX.

ld.gold support removed
Thomas Gleixner removed support for the gold linker. While this isn’t providing a direct security benefit, ld.gold has been a constant source of weird bugs. Specifically where I’ve noticed, it had been pain while developing KASLR, and has more recently been causing problems while stabilizing building the kernel with Clang. Having this linker support removed makes things much easier going forward. There are enough weird bugs to fix in Clang and ld.lld. ;)

Intel TSX disabled
Given the use of Intel’s Transactional Synchronization Extensions (TSX) CPU feature by attackers to exploit speculation flaws, Pawan Gupta disabled the feature by default on CPUs that support disabling TSX.

That’s all I have for this version. Let me know if I missed anything. :) Next up is Linux v5.5!

Comments (2)

November 20, 2019

experimenting with Clang CFI on upstream Linux

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 9:09 pm

While much of the work on kernel Control Flow Integrity (CFI) is focused on arm64 (since kernel CFI is available on Android), a significant portion is in the core kernel itself (and especially the build system). Recently I got a sane build and boot on x86 with everything enabled, and I’ve been picking through some of the remaining pieces. I figured now would be a good time to document everything I do to get a build working in case other people want to play with it and find stuff that needs fixing.

First, everything is based on Sami Tolvanen’s upstream port of Clang’s forward-edge CFI, which includes his Link Time Optimization (LTO) work, which CFI requires. This tree also includes his backward-edge CFI work on arm64 with Clang’s Shadow Call Stack (SCS).

On top of that, I’ve got a few x86-specific patches that get me far enough to boot a kernel without warnings pouring across the console. Along with that are general linker script cleanups, CFI cast fixes, and x86 crypto fixes, all in various states of getting upstreamed. The resulting tree is here.

On the compiler side, you need a very recent Clang and LLD (i.e. “Clang 10”, or what I do is build from the latest git). For example, here’s how to get started. First, checkout, configure, and build Clang (leave out “--depth=1” if you want the full git history):

# Check out latest LLVM
mkdir -p $HOME/src
cd $HOME/src
git clone --depth=1 https://github.com/llvm/llvm-project.git
mkdir -p llvm-build
cd llvm-build
# Configure
mkdir -p $HOME/bin/clang-release
cmake -G Ninja \
      -DCMAKE_BUILD_TYPE=Release \
      -DLLVM_ENABLE_PROJECTS='clang;lld;compiler-rt' \
      -DCMAKE_INSTALL_PREFIX="$HOME/bin/clang-release" \
      ../llvm-project/llvm
# Build!
ninja install

Then checkout, configure, and build the CFI tree. (This assumes you’ve already got a checkout of Linus’s tree.)

# Check out my branch
cd ../linux
git remote add kees https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
git fetch kees
git checkout kees/kspp/cfi/x86 -b test/cfi
# Use the above built Clang path first for the needed binaries: clang, ld.lld, and llvm-ar.
PATH="$HOME/bin/clang-release/bin:$PATH"
# Configure (this uses "defconfig" but you could use "menuconfig"), but you must
# include CC and LD in the make args or your .config won't know about Clang.
make defconfig CC=clang LD=ld.lld
# Enable LTO and CFI.
scripts/config \
     -e CONFIG_LTO \
     -e CONFIG_THINLTO \
     -d CONFIG_LTO_NONE \
     -e CONFIG_LTO_CLANG \
     -e CONFIG_CFI_CLANG \
     -e CONFIG_CFI_PERMISSIVE \
     -e CONFIG_CFI_CLANG_SHADOW
# Enable LKDTM if you want runtime fault testing:
scripts/config -e CONFIG_LKDTM
# Build!
make -j$(getconf _NPROCESSORS_ONLN) CC=clang LD=ld.lld

Do not be alarmed by various warnings, such as:

ld.lld: warning: cannot find entry symbol _start; defaulting to 0x1000
llvm-ar: error: unable to load 'arch/x86/kernel/head_64.o': file too small to be an archive
llvm-ar: error: unable to load 'arch/x86/kernel/head64.o': file too small to be an archive
llvm-ar: error: unable to load 'arch/x86/kernel/ebda.o': file too small to be an archive
llvm-ar: error: unable to load 'arch/x86/kernel/platform-quirks.o': file too small to be an archive
WARNING: EXPORT symbol "page_offset_base" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: EXPORT symbol "vmalloc_base" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: EXPORT symbol "vmemmap_base" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: "__memcat_p" [vmlinux] is a static (unknown)
no symbols

Adjust your .config as you want (but, again, make sure the CC and LD args are pointed at Clang and LLD respectively). This should(!) result in a happy bootable x86 CFI-enabled kernel. If you want to see what a CFI failure looks like, you can poke LKDTM:

# Log into the booted system as root, then:
cat <(echo CFI_FORWARD_PROTO) >/sys/kernel/debug/provoke-crash/DIRECT
dmesg

Here’s the CFI splat I see on the console:

[   16.288372] lkdtm: Performing direct entry CFI_FORWARD_PROTO
[   16.290563] lkdtm: Calling matched prototype ...
[   16.292367] lkdtm: Calling mismatched prototype ...
[   16.293696] ------------[ cut here ]------------
[   16.294581] CFI failure (target: lkdtm_increment_int$53641d38e2dc4a151b75cbe816cbb86b.cfi_jt+0x0/0x10):
[   16.296288] WARNING: CPU: 3 PID: 2612 at kernel/cfi.c:29 __cfi_check_fail+0x38/0x40
...
[   16.346873] ---[ end trace 386b3874d294d2f7 ]---
[   16.347669] lkdtm: Fail: survived mismatched prototype function call!

The claim of “Fail: survived …” is due to CONFIG_CFI_PERMISSIVE=y. This allows the kernel to warn but continue with the bad call anyway. This is handy for debugging. In a production kernel that would be removed and the offending kernel thread would be killed. If you run this again with the config disabled, there will be no continuation from LKDTM. :)

Enjoy! And if you can figure out before me why there is still CFI instrumentation in the KPTI entry handler, please let me know and help us fix it. ;)

Comments (2)

November 14, 2019

security things in Linux v5.3

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:36 pm

Previously: v5.2.

Linux kernel v5.3 was released! I let this blog post get away from me, but it’s up now! :) Here are some security-related things I found interesting:

heap variable initialization
In the continuing work to remove “uninitialized” variables from the kernel, Alexander Potapenko added new “init_on_alloc” and “init_on_free” boot parameters (with associated Kconfig defaults) to perform zeroing of heap memory either at allocation time (i.e. all kmalloc()s effectively become kzalloc()s), at free time (i.e. all kfree()s effectively become kzfree()s), or both. The performance impact of the former under most workloads appears to be under 1%, if it’s measurable at all. The “init_on_free” option, however, is more costly but adds the benefit of reducing the lifetime of heap contents after they have been freed (which might be useful for some use-after-free attacks or side-channel attacks). Everyone should enable CONFIG_INIT_ON_ALLOC_DEFAULT_ON=1 (or boot with “init_on_alloc=1“), and the more paranoid system builders should add CONFIG_INIT_ON_FREE_DEFAULT_ON=1 (or “init_on_free=1” at boot). As workloads are found that cause performance concerns, tweaks to the initialization coverage can be added.

pidfd_open() added
Christian Brauner has continued his pidfd work by creating the next needed syscall: pidfd_open(), which takes a pid and returns a pidfd. This is useful for cases where process creation isn’t yet using CLONE_PIDFD, and where /proc may not be mounted.

-Wimplicit-fallthrough enabled globally
Gustavo A.R. Silva landed the last handful of implicit fallthrough fixes left in the kernel, which allows for -Wimplicit-fallthrough to be globally enabled for all kernel builds. This will keep any new instances of this bad code pattern from entering the kernel again. With several hundred implicit fallthroughs identified and fixed, something like 1 in 10 were missing breaks, which is way higher than I was expecting, making this work even more well justified.

x86 CR4 & CR0 pinning
In recent exploits, one of the steps for making the attacker’s life easier is to disable CPU protections like Supervisor Mode Access (and Execute) Prevention (SMAP and SMEP) by finding a way to write to CPU control registers to disable these features. For example, CR4 controls SMAP and SMEP, where disabling those would let an attacker access and execute userspace memory from kernel code again, opening up the attack to much greater flexibility. CR0 controls Write Protect (WP), which when disabled would allow an attacker to write to read-only memory like the kernel code itself. Attacks have been using the kernel’s CR4 and CR0 writing functions to make these changes (since it’s easier to gain that level of execute control), but now the kernel will attempt to “pin” sensitive bits in CR4 and CR0 to avoid them getting disabled. This forces attacks to do more work to enact such register changes going forward. (I’d like to see KVM enforce this too, which would actually protect guest kernels from all attempts to change protected register bits.)

additional kfree() sanity checking
In order to avoid corrupted pointers doing crazy things when they’re freed (as seen in recent exploits), I added additional sanity checks to verify kmem cache membership and to make sure that objects actually belong to the kernel slab heap. As a reminder, everyone should be building with CONFIG_SLAB_FREELIST_HARDENED=1.

KASLR enabled by default on arm64
Just as Kernel Address Space Layout Randomization (KASLR) was enabled by default on x86, now KASLR has been enabled by default on arm64 too. It’s worth noting, though, that in order to benefit from this setting, the bootloader used for such arm64 systems needs to either support the UEFI RNG function or provide entropy via the “/chosen/kaslr-seed” Device Tree property.

hardware security embargo documentation
As there continues to be a long tail of hardware flaws that need to be reported to the Linux kernel community under embargo, a well-defined process has been documented. This will let vendors unfamiliar with how to handle things follow the established best practices for interacting with the Linux kernel community in a way that lets mitigations get developed before embargoes are lifted. The latest (and HTML rendered) version of this process should always be available here.

Those are the things I had on my radar. Please let me know if there are other things I should add! Linux v5.4 is almost here…

Comments (6)

July 17, 2019

security things in Linux v5.2

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 5:07 pm

Previously: v5.1.

Linux kernel v5.2 was released last week! Here are some security-related things I found interesting:

page allocator freelist randomization
While the SLUB and SLAB allocator freelists have been randomized for a while now, the overarching page allocator itself wasn’t. This meant that anything doing allocation outside of the kmem_cache/kmalloc() would have deterministic placement in memory. This is bad both for security and for some cache management cases. Dan Williams implemented this randomization under CONFIG_SHUFFLE_PAGE_ALLOCATOR now, which provides additional uncertainty to memory layouts, though at a rather low granularity of 4MB (see SHUFFLE_ORDER). Also note that this feature needs to be enabled at boot time with page_alloc.shuffle=1 unless you have direct-mapped memory-side-cache (you can check the state at /sys/module/page_alloc/parameters/shuffle).

stack variable initialization with Clang
Alexander Potapenko added support via CONFIG_INIT_STACK_ALL for Clang’s -ftrivial-auto-var-init=pattern option that enables automatic initialization of stack variables. This provides even greater coverage than the prior GCC plugin for stack variable initialization, as Clang’s implementation also covers variables not passed by reference. (In theory, the kernel build should still warn about these instances, but even if they exist, Clang will initialize them.) Another notable difference between the GCC plugins and Clang’s implementation is that Clang initializes with a repeating 0xAA byte pattern, rather than zero. (Though this changes under certain situations, like for 32-bit pointers which are initialized with 0x000000AA.) As with the GCC plugin, the benefit is that the entire class of uninitialized stack variable flaws goes away.

Kernel Userspace Access Prevention on powerpc
Like SMAP on x86 and PAN on ARM, Michael Ellerman and Russell Currey have landed support for disallowing access to userspace without explicit markings in the kernel (KUAP) on Power9 and later PPC CPUs under CONFIG_PPC_RADIX_MMU=y (which is the default). This is the continuation of the execute protection (KUEP) in v4.10. Now if an attacker tries to trick the kernel into any kind of unexpected access from userspace (not just executing code), the kernel will fault.

Microarchitectural Data Sampling mitigations on x86
Another set of cache memory side-channel attacks came to light, and were consolidated together under the name Microarchitectural Data Sampling (MDS). MDS is weaker than other cache side-channels (less control over target address), but memory contents can still be exposed. Much like L1TF, when one’s threat model includes untrusted code running under Symmetric Multi Threading (SMT: more logical cores than physical cores), the only full mitigation is to disable hyperthreading (boot with “nosmt“). For all the other variations of the MDS family, Andi Kleen (and others) implemented various flushing mechanisms to avoid cache leakage.

unprivileged userfaultfd sysctl knob
Both FUSE and userfaultfd provide attackers with a way to stall a kernel thread in the middle of memory accesses from userspace by initiating an access on an unmapped page. While FUSE is usually behind some kind of access controls, userfaultfd hadn’t been. To avoid various heap grooming and heap spraying techniques for exploiting Use-after-Free flaws, Peter Xu added the new “vm.unprivileged_userfaultfd” sysctl knob to disallow unprivileged access to the userfaultfd syscall.

temporary mm for text poking on x86
The kernel regularly performs self-modification with things like text_poke() (during stuff like alternatives, ftrace, etc). Before, this was done with fixed mappings (“fixmap”) where a specific fixed address at the high end of memory was used to map physical pages as needed. However, this resulted in some temporal risks: other CPUs could write to the fixmap, or there might be stale TLB entries on removal that other CPUs might still be able to write through to change the target contents. Instead, Nadav Amit has created a separate memory map for kernel text writes, as if the kernel is trying to make writes to userspace. This mapping ends up staying local to the current CPU, and the poking address is randomized, unlike the old fixmap.

ongoing: implicit fall-through removal
Gustavo A. R. Silva is nearly done with marking (and fixing) all the implicit fall-through cases in the kernel. Based on the pull request from Gustavo, it looks very much like v5.3 will see -Wimplicit-fallthrough added to the global build flags and then this class of bug should stay extinct in the kernel.

CLONE_PIDFD added
Christian Brauner added the new CLONE_PIDFD flag to the clone() system call, which complements the pidfd work in v5.1 so that programs can now gain a handle for a process ID right at fork() (actually clone()) time, instead of needing to get the handle from /proc after process creation. With signals and forking now enabled, the next major piece (already in linux-next) will be adding P_PIDFD to the waitid() system call, and common process management can be done entirely with pidfd.

Other things
Alexander Popov pointed out some more v5.2 features that I missed in this blog post. I’m repeating them here, with some minor edits/clarifications. Thank you Alexander!

x86-64 IRQ/exception/debug stack now has guard pages to detect stack overflows immediately and deterministically from those contexts.
mincore(2) is made more conservative to avoid information exposures about memory cache states and similar.
x86’s objtool can now validate callers of uaccess helpers to make sure SMAP is not left disabled.
powerpc/32 now supports KASAN.
powerpc now warns if W+X pages are found on boot.

Edit: added CLONE_PIDFD notes, as reminded by Christian Brauner. :)
Edit: added Alexander Popov’s notes

That’s it for now; let me know if you think I should add anything here. We’re almost to -rc1 for v5.3!

Comments (4)

June 27, 2019

package hardening asymptote

Filed under: Blogging,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 3:35 pm

Forever ago I set up tooling to generate graphs representing the adoption of various hardening features in Ubuntu packaging. These were very interesting in 2006 when stack protector was making its way into the package archive. Similarly in 2008 and 2009 as FORTIFY_SOURCE and read-only relocations made their way through the archive. It took a while to really catch hold, but finally PIE-by-default started to take off in 2016 through 2018:

Graph of Ubuntu hardening feature adoption over 20 years

Around 2012 when Debian started in earnest to enable hardening features for their archive, I realized this was going to be a long road. I added the above “20 year view” for Ubuntu and then started similarly graphing hardening features in Debian packages too (the blip on PIE here was a tooling glitch, IIRC):

Graph of Debian hardening feature adoption over 10 years

Today I realized that my Ubuntu tooling broke back in January and no one noticed, including me. And really, why should anyone notice? The “near term” (weekly, monthly) graphs have been basically flat for years:

The last month of Debian hardening stats

In the long-term view the measurements have a distinctly asymptotic appearance and the graphs are maybe only good for their historical curves now. But then I wonder, what’s next? What new compiler feature adoption could be measured? I think there are still a few good candidates…

How about enabling -fstack-clash-protection (only in GCC, Clang still hasn’t implemented it).

Or how about getting serious and using forward-edge Control Flow Integrity? (Clang has -fsanitize=cfi for general purpose function prototype based enforcement, and GCC has the more limited -fvtable-verify for C++ objects.)

Where is backward-edge CFI? (Is everyone waiting for CET?)

Does anyone see something meaningful that needs adoption tracking?

Comments (1)

May 27, 2019

security things in Linux v5.1

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 8:49 pm

Previously: v5.0.

Linux kernel v5.1 has been released! Here are some security-related things that stood out to me:

introduction of pidfd
Christian Brauner landed the first portion of his work to remove pid races from the kernel: using a file descriptor to reference a process (“pidfd”). Now /proc/$pid can be opened and used as an argument for sending signals with the new pidfd_send_signal() syscall. This handle will only refer to the original process at the time the open() happened, and not to any later “reused” pid if the process dies and a new process is assigned the same pid. Using this method, it’s now possible to racelessly send signals to exactly the intended process without having to worry about pid reuse. (BTW, this commit wins the 2019 award for Most Well Documented Commit Log Justification.)

explicitly test for userspace mappings of heap memory
During Linux Conf AU 2019 Kernel Hardening BoF, Matthew Wilcox noted that there wasn’t anything in the kernel actually sanity-checking when userspace mappings were being applied to kernel heap memory (which would allow attackers to bypass the copy_{to,from}_user() infrastructure). Driver bugs or attackers able to confuse mappings wouldn’t get caught, so he added checks. To quote the commit logs: “It’s never appropriate to map a page allocated by SLAB into userspace” and “Pages which use page_type must never be mapped to userspace as it would destroy their page type”. The latter check almost immediately caught a bad case, which was quickly fixed to avoid page type corruption.

LSM stacking: shared security blobs
Casey Schaufler has landed one of the major pieces of getting multiple Linux Security Modules (LSMs) running at the same time (called “stacking”). It is now possible for LSMs to share the security-specific storage “blobs” associated with various core structures (e.g. inodes, tasks, etc) that LSMs can use for saving their state (e.g. storing which profile a given task confined under). The kernel originally gave only the single active “major” LSM (e.g. SELinux, Apprmor, etc) full control over the entire blob of storage. With “shared” security blobs, the LSM infrastructure does the allocation and management of the memory, and LSMs use an offset for reading/writing their portion of it. This unblocks the way for “medium sized” LSMs (like SARA and Landlock) to get stacked with a “major” LSM as they need to store much more state than the “minor” LSMs (e.g. Yama, LoadPin) which could already stack because they didn’t need blob storage.

SafeSetID LSM
Micah Morton added the new SafeSetID LSM, which provides a way to narrow the power associated with the CAP_SETUID capability. Normally a process with CAP_SETUID can become any user on the system, including root, which makes it a meaningless capability to hand out to non-root users in order for them to “drop privileges” to some less powerful user. There are trees of processes under Chrome OS that need to operate under different user IDs and other methods of accomplishing these transitions safely weren’t sufficient. Instead, this provides a way to create a system-wide policy for user ID transitions via setuid() (and group transitions via setgid()) when a process has the CAP_SETUID capability, making it a much more useful capability to hand out to non-root processes that need to make uid or gid transitions.

ongoing: refcount_t conversions
Elena Reshetova continued landing more refcount_t conversions in core kernel code (e.g. scheduler, futex, perf), with an additional conversion in btrfs from Anand Jain. The existing conversions, mainly when combined with syzkaller, continue to show their utility at finding bugs all over the kernel.

ongoing: implicit fall-through removal
Gustavo A. R. Silva continued to make progress on marking more implicit fall-through cases. What’s so impressive to me about this work, like refcount_t, is how many bugs it has been finding (see all the “missing break” patches). It really shows how quickly the kernel benefits from adding -Wimplicit-fallthrough to keep this class of bug from ever returning.

stack variable initialization includes scalars
The structleak gcc plugin (originally ported from PaX) had its “by reference” coverage improved to initialize scalar types as well (making “structleak” a bit of a misnomer: it now stops leaks from more than structs). Barring compiler bugs, this means that all stack variables in the kernel can be initialized before use at function entry. For variables not passed to functions by reference, the -Wuninitialized compiler flag (enabled via -Wall) already makes sure the kernel isn’t building with local-only uninitialized stack variables. And now with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled, all variables passed by reference will be initialized as well. This should eliminate most, if not all, uninitialized stack flaws with very minimal performance cost (for most workloads it is lost in the noise), though it does not have the stack data lifetime reduction benefits of GCC_PLUGIN_STACKLEAK, which wipes the stack at syscall exit. Clang has recently gained similar automatic stack initialization support, and I’d love to this feature in native gcc. To evaluate the coverage of the various stack auto-initialization features, I also wrote regression tests in lib/test_stackinit.c.

That’s it for now; please let me know if I missed anything. The v5.2 kernel development cycle is off and running already. :)

Comments Off

March 12, 2019

security things in Linux v5.0

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:04 pm

Previously: v4.20.

Linux kernel v5.0 was released last week! Looking through the changes, here are some security-related things I found interesting:

read-only linear mapping, arm64
While x86 has had a read-only linear mapping (or “Low Kernel Mapping” as shown in /sys/kernel/debug/page_tables/kernel under CONFIG_X86_PTDUMP=y) for a while, Ard Biesheuvel has added them to arm64 now. This means that ranges in the linear mapping that contain executable code (e.g. modules, JIT, etc), are not directly writable any more by attackers. On arm64, this is visible as “Linear mapping” in /sys/kernel/debug/kernel_page_tables under CONFIG_ARM64_PTDUMP=y, where you can now see the page-level granularity:

---[ Linear mapping ]---
...
0xffffb07cfc402000-0xffffb07cfc403000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc403000-0xffffb07cfc4d0000  820K PTE   RW NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d0000-0xffffb07cfc4d1000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d1000-0xffffb07cfc79d000 2864K PTE   RW NX SHD AF NG    UXN MEM/NORMAL

per-task stack canary, arm
ARM has supported stack buffer overflow protection for a long time (currently via the compiler’s -fstack-protector-strong option). However, on ARM, the compiler uses a global variable for comparing the canary value, __stack_chk_guard. This meant that everywhere in the kernel needed to use the same canary value. If an attacker could expose a canary value in one task, it could be spoofed during a buffer overflow in another task. On x86, the canary is in Thread Local Storage (TLS, defined as %gs:20 on 32-bit and %gs:40 on 64-bit), which means it’s possible to have a different canary for every task since the %gs segment points to per-task structures. To solve this for ARM, Ard Biesheuvel built a GCC plugin to replace the global canary checking code with a per-task relative reference to a new canary in struct thread_info. As he describes in his blog post, the plugin results in replacing:

8010fad8:       e30c4488        movw    r4, #50312      ; 0xc488
8010fadc:       e34840d0        movt    r4, #32976      ; 0x80d0
...
8010fb1c:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fb20:       e5943000        ldr     r3, [r4]
8010fb24:       e1520003        cmp     r2, r3
8010fb28:       1a000020        bne     8010fbb0
...
8010fbb0:       eb006738        bl      80129898 <__stack_chk_fail>

with:

8010fc18:       e1a0300d        mov     r3, sp
8010fc1c:       e3c34d7f        bic     r4, r3, #8128   ; 0x1fc0
...
8010fc60:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fc64:       e5943018        ldr     r3, [r4, #24]
8010fc68:       e1520003        cmp     r2, r3
8010fc6c:       1a000020        bne     8010fcf4
...
8010fcf4:       eb006757        bl      80129a58 <__stack_chk_fail>

r2 holds the canary saved on the stack and r3 the known-good canary to check against. In the former, r3 is loaded through r4 at a fixed address (0x80d0c488, which “readelf -s vmlinux” confirms is the global __stack_chk_guard). In the latter, it’s coming from offset 0x24 in struct thread_info (which “pahole -C thread_info vmlinux” confirms is the “stack_canary” field).

per-task stack canary, arm64
The lack of per-task canary existed on arm64 too. Ard Biesheuvel solved this differently by coordinating with GCC developer Ramana Radhakrishnan to add support for a register-based offset option (specifically “-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=...“). With this feature, the canary can be found relative to sp_el0, since that register holds the pointer to the struct task_struct, which contains the canary. I’m hoping there will be a workable Clang solution soon too (for this and 32-bit ARM). (And it’s also worth noting that, unfortunately, this support isn’t yet in a released version of GCC. It’s expected for 9.0, likely this coming May.)

top-byte-ignore, arm64
Andrey Konovalov has been laying the groundwork with his Top Byte Ignore (TBI) series which will also help support ARMv8.3’s Pointer Authentication (PAC) and ARMv8.5’s Memory Tagging (MTE). While TBI technically conflicts with PAC, both rely on using “non-VA-space” (Virtual Address) bits in memory addresses, and getting the kernel ready to deal with ignoring non-VA bits. PAC stores signatures for checking things like return addresses on the stack or stored function pointers on heap, both to stop overwrites of control flow information. MTE stores a “tag” (or, depending on your dialect, a “color” or “version”) to mark separate memory allocation regions to stop use-after-tree and linear overflows. For either of these to work, the CPU has to be put into some form of the TBI addressing mode (though for MTE, it’ll be a “check the tag” mode), otherwise the addresses would resolve into totally the wrong place in memory. Even without PAC and MTE, this byte can be used to store bits that can be checked by software (which is what the rest of Andrey’s series does: adding this logic to speed up KASan).

ongoing: implicit fall-through removal
An area of active work in the kernel is the removal of all implicit fall-through in switch statements. While the C language has a statement to indicate the end of a switch case (“break“), it doesn’t have a statement to indicate that execution should fall through to the next case statement (just the lack of a “break” is used to indicate it should fall through — but this is not always the case), and such “implicit fall-through” may lead to bugs. Gustavo Silva has been the driving force behind fixing these since at least v4.14, with well over 300 patches on the topic alone (and over 20 missing break statements found and fixed as a result of the work). The goal is to be able to add -Wimplicit-fallthrough to the build so that the kernel will stay entirely free of this class of bug going forward. From roughly 2300 warnings, the kernel is now down to about 200. It’s also worth noting that with Stephen Rothwell’s help, this bug has been kept out of linux-next by him sending warning emails to any tree maintainers where a new instance is introduced (for example, here’s a bug introduced on Feb 20th and fixed on Feb 21st).

ongoing: refcount_t conversions
There also continues to be work converting reference counters from atomic_t to refcount_t so they can gain overflow protections. There have been 18 more conversions since v4.15 from Elena Reshetova, Trond Myklebust, Kirill Tkhai, Eric Biggers, and Björn Töpel. While there are more complex cases, the minimum goal is to reduce the Coccinelle warnings from scripts/coccinelle/api/atomic_as_refcounter.cocci to zero. As of v5.0, there are 131 warnings, with the bulk of the remaining areas in fs/ (49), drivers/ (41), and kernel/ (21).

userspace PAC, arm64
Mark Rutland and Kristina Martsenko enabled kernel support for ARMv8.3 PAC in userspace. As mentioned earlier about PAC, this will give userspace the ability to block a wide variety of function pointer overwrites by “signing” function pointers before storing them to memory. The kernel manages the keys (i.e. selects random keys and sets them up), but it’s up to userspace to detect and use the new CPU instructions. The “paca” and “pacg” flags will be visible in /proc/cpuinfo for CPUs that support it.

platform keyring
Nayna Jain introduced the trusted platform keyring, which cannot be updated by userspace. This can be used to verify platform or boot-time things like firmware, initramfs, or kexec kernel signatures, etc.

SECCOMP_RET_USER_NOTIF
Tycho Andersen added the new SECCOMP_RET_USER_NOTIF return type to seccomp which allows process monitors to receive notifications over a file descriptor when a seccomp filter has been hit. This is a much more light-weight method than ptrace for performing tasks on behalf of a process. It is especially useful for performing various admin-only tasks for unprivileged containers (like mounting filesystems). The process monitor can perform the requested task (after doing some ToCToU dances to read process memory for syscall arguments), or reject the syscall.

Edit: added userspace PAC and platform keyring, suggested by Alexander Popov
Edit: tried to clarify TBI vs PAC vs MTE
Edit: added details on SECCOMP_RET_USER_NOTIF

That’s it for now; please let me know if I missed anything. The v5.1 merge window is open, so off we go! :)

Comments Off

December 24, 2018

security things in Linux v4.20

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 3:59 pm

Previously: v4.19.

Linux kernel v4.20 has been released today! Looking through the changes, here are some security-related things I found interesting:

stackleak plugin

Alexander Popov’s work to port the grsecurity STACKLEAK plugin to the upstream kernel came to fruition. While it had received Acks from x86 (and arm64) maintainers, it has been rejected a few times by Linus. With everything matching Linus’s expectations now, it and the x86 glue have landed. (The arch-specific portions for arm64 from Laura Abbott actually landed in v4.19.) The plugin tracks function calls (with a sufficiently large stack usage) to mark the maximum depth of the stack used during a syscall. With this information, at the end of a syscall, the stack can be efficiently poisoned (i.e. instead of clearing the entire stack, only the portion that was actually used during the syscall needs to be written). There are two main benefits from the stack getting wiped after every syscall. First, there are no longer “uninitialized” values left over on the stack that an attacker might be able to use in the next syscall. Next, the lifetime of any sensitive data on the stack is reduced to only being live during the syscall itself. This is mainly interesting because any information exposures or side-channel attacks from other kernel threads need to be much more carefully timed to catch the stack data before it gets wiped.

Enabling CONFIG_GCC_PLUGIN_STACKLEAK=y means almost all uninitialized variable flaws go away, with only a very minor performance hit (it appears to be under 1% for most workloads). It’s still possible that, within a single syscall, a later buggy function call could use “uninitialized” bytes from the stack from an earlier function. Fixing this will need compiler support for pre-initialization (this is under development already for Clang, for example), but that may have larger performance implications.

raise faults for kernel addresses in copy_*_user()

Jann Horn reworked x86 memory exception handling to loudly notice when copy_{to,from}_user() tries to access unmapped kernel memory. Prior this, those accesses would result in a silent error (usually visible to callers as EFAULT), making it indistinguishable from a “regular” userspace memory exception. The purpose of this is to catch cases where, for example, the unchecked __copy_to_user() is called against a kernel address. Fuzzers like syzcaller weren’t able to notice very nasty bugs because writes to kernel addresses would either corrupt memory (which may or may not get detected at a later time) or return an EFAULT that looked like things were operating normally. With this change, it’s now possible to much more easily notice missing access_ok() checks. This has already caught two other corner cases even during v4.20 in HID and Xen.

spectre v2 userspace mitigation

The support for Single Thread Indirect Branch Predictors (STIBP) has been merged. This allowed CPUs that support STIBP to effectively disable Hyper-Threading to avoid indirect branch prediction side-channels to expose information between userspace threads on the same physical CPU. Since this was a very expensive solution, this protection was made opt-in (via explicit prctl() or implicitly under seccomp()). LWN has a nice write-up of the details.

jump labels read-only after init

Ard Biesheuvel noticed that jump labels don’t need to be writable after initialization, so their data structures were made read-only. Since they point to kernel code, they might be used by attackers to manipulate the jump targets as a way to change kernel code that wasn’t intended to be changed. Better to just move everything into the read-only memory region to remove it from the possible kernel targets for attackers.

VLA removal finished

As detailed earlier for v4.17, v4.18, and v4.19, a whole bunch of people answered my call to remove Variable Length Arrays (VLAs) from the kernel. I count at least 153 commits having been added to the kernel since v4.16 to remove VLAs, with a big thanks to Gustavo A. R. Silva, Laura Abbott, Salvatore Mesoraca, Kyle Spiers, Tobin C. Harding, Stephen Kitt, Geert Uytterhoeven, Arnd Bergmann, Takashi Iwai, Suraj Jitindar Singh, Tycho Andersen, Thomas Gleixner, Stefan Wahren, Prashant Bhole, Nikolay Borisov, Nicolas Pitre, Martin Schwidefsky, Martin KaFai Lau, Lorenzo Bianconi, Himanshu Jha, Chris Wilson, Christian Lamparter, Boris Brezillon, Ard Biesheuvel, and Antoine Tenart. With all that done, “-Wvla” has been added to the top-level Makefile so we don’t get any more added back in the future.

per-task stack canaries, powerpc
For a long time, only x86 has had per-task kernel stack canaries. Other architectures would generate a single canary for the life of the boot and use it in every task. This meant that exposing a canary from one task would give an attacker everything they needed to spoof a canary in a separate attack in a different task. Christophe Leroy has solved this on powerpc now, integrating the new GCC support for the -mstack-protector-guard-reg and -mstack-protector-guard-offset options.

Given the holidays, Linus opened the merge window before v4.20 was released, letting everyone send in pull requests in the week leading up to the release. v4.21 is in the making. :) Happy New Year everyone!

Edit: clarified stackleak details, thanks to Alexander Popov. Added per-task canaries note too.

Comments (1)

October 22, 2018

security things in Linux v4.19

Filed under: Blogging,Chrome OS,Debian,Kernel,Security,Ubuntu,Ubuntu-Server — kees @ 4:17 pm

Previously: v4.18.

Linux kernel v4.19 was released today. Here are some security-related things I found interesting:

L1 Terminal Fault (L1TF)

While it seems like ages ago, the fixes for L1TF actually landed at the start of the v4.19 merge window. As with the other speculation flaw fixes, lots of people were involved, and the scope was pretty wide: bare metal machines, virtualized machines, etc. LWN has a great write-up on the L1TF flaw and the kernel’s documentation on L1TF defenses is equally detailed. I like how clean the solution is for bare-metal machines: when a page table entry should be marked invalid, instead of only changing the “Present” flag, it also inverts the address portion so even a speculative lookup ignoring the “Present” flag will land in an unmapped area.

protected regular and fifo files

Salvatore Mesoraca implemented an O_CREAT restriction in /tmp directories for FIFOs and regular files. This is similar to the existing symlink restrictions, which take effect in sticky world-writable directories (e.g. /tmp) when the opening user does not match the owner of the existing file (or directory). When a program opens a FIFO or regular file with O_CREAT and this kind of user mismatch, it is treated like it was also opened with O_EXCL: it gets rejected because there is already a file there, and the kernel wants to protect the program from writing possibly sensitive contents to a file owned by a different user. This has become a more common attack vector now that symlink and hardlink races have been eliminated.

syscall register clearing, arm64

One of the ways attackers can influence potential speculative execution flaws in the kernel is to leak information into the kernel via “unused” register contents. Most syscalls take only a few arguments, so all the other calling-convention-defined registers can be cleared instead of just left with whatever contents they had in userspace. As it turns out, clearing registers is very fast. Similar to what was done on x86, Mark Rutland implemented a full register-clearing syscall wrapper on arm64.

Variable Length Array removals, part 3

As mentioned in part 1 and part 2, VLAs continue to be removed from the kernel. While CONFIG_THREAD_INFO_IN_TASK and CONFIG_VMAP_STACK cover most issues with stack exhaustion attacks, not all architectures have those features, so getting rid of VLAs makes sure we keep a few classes of flaws out of all kernel architectures and configurations. It’s been a long road, and it’s shaping up to be a 4-part saga with the remaining VLA removals landing in the next kernel. For v4.19, several folks continued to help grind away at the problem: Arnd Bergmann, Kyle Spiers, Laura Abbott, Martin Schwidefsky, Salvatore Mesoraca, and myself.

shift overflow helper
Jason Gunthorpe noticed that while the kernel recently gained add/sub/mul/div helpers to check for arithmetic overflow, we didn’t have anything for shift-left. He added check_shl_overflow() to round out the toolbox and Leon Romanovsky immediately put it to use to solve an overflow in RDMA.

Edit: I forgot to mention this next feature when I first posted:

trusted architecture-supported RNG initialization

The Random Number Generator in the kernel seeds its pools from many entropy sources, including any architecture-specific sources (e.g. x86’s RDRAND). Due to many people not wanting to trust the architecture-specific source due to the inability to audit its operation, entropy from those sources was not credited to RNG initialization, which wants to gather “enough” entropy before claiming to be initialized. However, because some systems don’t generate enough entropy at boot time, it was taking a while to gather enough system entropy (e.g. from interrupts) before the RNG became usable, which might block userspace from starting (e.g. systemd wants to get early entropy). To help these cases, Ted T’so introduced a toggle to trust the architecture-specific entropy completely (i.e. RNG is considered fully initialized as soon as it gets the architecture-specific entropy). To use this, the kernel can be built with CONFIG_RANDOM_TRUST_CPU=y (or booted with “random.trust_cpu=on“).

That’s it for now; thanks for reading. The merge window is open for v4.20! Wish us luck. :)

Comments Off

August 30, 2017

GRUB and LUKS

Filed under: Blogging,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 10:27 am

I got myself stuck yesterday with GRUB running from an ext4 /boot/grub, but with /boot inside my LUKS LVM root partition, which meant GRUB couldn’t load the initramfs and kernel.

Luckily, it turns out that GRUB does know how to mount LUKS volumes (and LVM volumes), but all the instructions I could find talk about setting this up ahead of time (“Add GRUB_ENABLE_CRYPTODISK=y to /etc/default/grub“), rather than what the correct manual GRUB commands are to get things running on a failed boot.

These are my notes on that, in case I ever need to do this again, since there was one specific gotcha with using GRUB’s cryptomount command (noted below).

Available devices were the raw disk (hd0), the /boot/grub partition (hd0,msdos1), and the LUKS volume (hd0,msdos5):

grub> ls
(hd0) (hd0,msdos1) (hd0,msdos5)

Used cryptomount to open the LUKS volume (but without ()s! It says it works if you use parens, but then you can’t use the resulting (crypto0)):

grub> insmod luks
grub> cryptomount hd0,msdos5
Enter password...
Slot 0 opened.

Then you can load LVM and it’ll see inside the LUKS volume:

grub> insmod lvm
grub> ls
(crypto0) (hd0) (hd0,msdos1) (hd0,msdos5) (lvm/rootvg-rootlv)

And then I could boot normally:

grub> configfile $prefix/grub.cfg

After booting, I added GRUB_ENABLE_CRYPTODISK=y to /etc/default/grub and ran update-grub. I could boot normally after that, though I’d be prompted twice for the LUKS passphrase (once by GRUB, then again by the initramfs).

To avoid this, it’s possible to add a second LUKS passphrase, contained in a file in the initramfs, as described here and works for Ubuntu and Debian too. The quick summary is:

Create the keyfile and add it to LUKS:

# dd bs=512 count=4 if=/dev/urandom of=/crypto_keyfile.bin
# chmod 0400 /crypto_keyfile.bin
# cryptsetup luksAddKey /dev/sda5 /crypto_keyfile.bin
*enter original password*

Adjust the /etc/crypttab to include passing the file via /bin/cat:

sda5_crypt UUID=4aa5da72-8da6-11e7-8ac9-001cc008534d /crypto_keyfile.bin luks,keyscript=/bin/cat

Add an initramfs hook to copy the key file into the initramfs, keep non-root users from being able to read your initramfs, and trigger a rebuild:

# cat > /etc/initramfs-tools/hooks/crypto_keyfile <<EOF
#!/bin/bash
if [ "$1" = "prereqs" ] ; then
    cp /crypto_keyfile.bin "${DESTDIR}"
fi
EOF
# chmod a+x /etc/initramfs-tools/hooks/crypto_keyfile
# chmod 0700 /boot
# update-initramfs -u

This has the downside of leaving a LUKS passphrase “in the clear” while you’re booted, but if someone has root, they can just get your dm-crypt encryption key directly anyway:

# dmsetup table --showkeys sda5_crypt
0 155797496 crypt aes-cbc-essiv:sha256 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 0 8:5 2056

And of course if you’re worried about Evil Maid attacks, you’ll need a real static root of trust instead of doing full disk encryption passphrase prompting from an unverified /boot partition. :)

Comments (6)

July 27, 2015

3D printing Poe

Filed under: Blogging,Debian,Security,Ubuntu — kees @ 3:08 pm

I helped print this statue of Edgar Allan Poe, through “We the Builders“, who coordinate large-scale crowd-sourced 3D print jobs:

Poe's Face

You can see one of my parts here on top, with “-Kees” on the piece with the funky hair strand:

Poe's Hair

The MakerWare I run on Ubuntu works well. I wish they were correctly signing their repositories. Even if I use non-SSL to fetch their key, as their Ubuntu/Debian instructions recommend, it still doesn’t match the packages:

W: GPG error: http://downloads.makerbot.com trusty Release: The following signatures were invalid: BADSIG 3D019B838FB1487F MakerBot Industries dev team <dev@makerbot.com>

And it’s not just my APT configuration:

$ wget http://downloads.makerbot.com/makerware/ubuntu/dists/trusty/Release.gpg
$ wget http://downloads.makerbot.com/makerware/ubuntu/dists/trusty/Release
$ gpg --verify Release.gpg Release
gpg: Signature made Wed 11 Mar 2015 12:43:07 PM PDT using RSA key ID 8FB1487F
gpg: requesting key 8FB1487F from hkp server pgp.mit.edu
gpg: key 8FB1487F: public key "MakerBot Industries LLC (Software development team) <dev@makerbot.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
gpg: BAD signature from "MakerBot Industries LLC (Software development team) <dev@makerbot.com>"
$ grep ^Date Release
Date: Tue, 09 Jun 2015 19:41:02 UTC

Looks like they’re updating their Release file without updating the signature file. (The signature is from March, but the Release file is from June. Oops!)

Comments Off

January 13, 2015

barcode consolidation

Filed under: Blogging,Debian,General,Security,Ubuntu — kees @ 5:33 pm

I had a mess of loyalty cards filling my wallet. It kind of looked like this:

Loyalty cards, from Flickr, joelogon

They took up too much room, and I used them infrequently. The only thing of value on them are the barcodes they carry that identify my account with whatever organization they’re tied to. Other folks have talked about doing consolidation in various ways like just scanning images of the cards and printing them all together. There was a site where you typed in card details and they generated barcodes for you, too. I didn’t want to hand my identifiers to a third party, and image scanning wasn’t flexible enough. I wanted to actually have the raw numbers, so I ended up using barcode. I didn’t use the Debian nor Ubuntu package, though, since it lacked SVG support, which was added in the latest (cough March 2013) version.

I used the Android Barcode Scanner app, and just saved all the barcodes and their encoding details to a text file, noting which was which. For example:

Albertsons "035576322436","UPC_A"
Multnomah County Library "01237035218482","CODABAR"
Supportland "!0000005341632030145420","CODE_128"
...

I measured the barcode area, since some scanners can’t handle their expected barcodes being resized, (that’s another project: find out which CAN handle it), and then spat out SVG files. I compared the results to my actual cards, since some times encodings have different options (like dropping checksum characters, “-c” below):

barcode-svg -S -u in -g 1.5x0.5 -e upc-a      -b '035576322436' > albertsons.svg
barcode-svg -S -u in -g   2x0.5 -e codabar -c -b '01237035218482' > library.svg
barcode-svg -S -u cm -g 4.5x1   -e code128    -b '!0000005341632030145420' > supportland.svg
...

With Inkscape, I opened them all and organized them onto a wallet-card-sized area, printed it, and laminated it. Now my wallet is 7 cards lighter. More room for HID cards or other stuff:

Emergency Pick Card

Comments Off

December 10, 2014

Open EVSE

Filed under: Blogging,Networking,Ubuntu,Vehicles — kees @ 12:53 pm

Needing a new car, and wanting to never purchase another gas-fuelled motor, I bought a Nissan LEAF. (Trivia: LEAF appears to be a backronym for “Leading, Environmentally friendly, Affordable, Family car”.)

LEAF

This vehicle has effectively a 22kWh battery and goes maybe 80 miles on a charge, which is more than enough for running errands.

The 15-Amp (“Level 1”) EVSE (Electric Vehicle Supply Equipment) that comes with the LEAF plugs into a standard wall socket. It tells the car’s charger to draw only 12A, so it doesn’t pop the house’s circuit. This will fully charge the batteries in about 20 hours, which is not sufficient for daily use.

15A EVSE

The thing on the end that plugs into the LEAF is the J1772 connector. You’ll note it is not just an extension cord, even though it feeds the car AC power. The EVSE communicates with the vehicle to negotiate a charging current. Then the LEAF’s on-board charger takes the AC the EVSE makes available, converts it to DC, and stores it in the batteries. (My model LEAF also has a second plug for DC fast charging, which uses an entirely different connector: CHAdeMO.)

Most folks with an Electric Vehicle end up installing a “Home Charger”. (Though calling it a “charger” is a misnomer since, as mentioned, the charger is in the car.) This is actually a “Level 2” wall-mounted EVSE, which uses a dedicated high-amperage circuit. Like the Level 1, it has communication logic to do the protocol negotiation with the vehicle. However, the price of Level 2 EVSEs is crazy high (low-end $1000, high-end $3000), and that doesn’t even include the cost of getting the dedicated circuit run by an electrician.

To make matters worse, none of the standard EVSEs are Open Source, and only the high-end ones have remote access, etc. Luckily, I am hardly the first person to notice this, and since I’m so “late” to the EV world, there are several Open projects for building one’s own EVSE.

I went with OpenEVSE, since I can build it (so I understand all the components in the system), and I can change/improve the firmware (like adding a wifi board). Under Ubuntu, I was up and running after installing the arduino and avrdude packages.

Now I can charge from empty in under 8 hours, since the 30A unit charges at 24A nominal.

30A EVSE

While I was at it, I 3D printed a socket to hold the J1772 plug when it’s not in the car.

J1772 Holder

Comments Off

June 13, 2014

glibc select weakness fixed

Filed under: Blogging,Chrome OS,Debian,General,Security,Ubuntu,Ubuntu-Server — kees @ 11:21 am

In 2009, I reported this bug to glibc, describing the problem that exists when a program is using select, and has its open file descriptor resource limit raised above 1024 (FD_SETSIZE). If a network daemon starts using the FD_SET/FD_CLR glibc macros on fdset variables for descriptors larger than 1024, glibc will happily write beyond the end of the fdset variable, producing a buffer overflow condition. (This problem had existed since the introduction of the macros, so, for decades? I figured it was long over-due to have a report opened about it.)

At the time, I was told this wasn’t going to be fixed and “every program using [select] must be considered buggy.” 2 years later still more people kept asking for this feature and continued to be told “no”.

But, as it turns out, a few months later after the most recent “no”, it got silently fixed anyway, with the bug left open as “Won’t Fix”! I’m glad Florian did some house-cleaning on the glibc bug tracker, since I’d otherwise never have noticed that this protection had been added to the ever-growing list of -D_FORTIFY_SOURCE=2 protections.

I’ll still recommend everyone use poll instead of select, but now I won’t be so worried when I see requests to raise the open descriptor limit above 1024.

Comments Off

May 7, 2014

Linux Security Summit 2014

Filed under: Blogging,Chrome OS,Debian,General,Security,Ubuntu,Ubuntu-Server — kees @ 10:31 am

The Linux Security Summit is happening in Chicago August 18th and 19th, just before LinuxCon. Send us some presentation and topic proposals, and join the conversation with other like-minded people. :)

I’d love to see what people have been working on, and what they’d like to work on. Our general topics will hopefully include:

System hardening
Access control
Cryptography
Integrity control
Hardware security
Networking
Storage
Virtualization
Desktop
Tools
Management
Case studies
Emerging technologies, threats & techniques

The Call For Participation closes June 6th, so you’ve got about a month, but earlier is better.

Comments (1)

February 3, 2014

compiler hardening in Ubuntu and Debian

Filed under: Blogging,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 8:42 am

Back in 2006, the compiler in Ubuntu was patched to enable most build-time security-hardening features (relro, stack protector, fortify source). I wasn’t able to convince Debian to do the same, so Debian went the route of other distributions, adding security hardening flags during package builds only. I remain disappointed in this approach, because it means that someone who builds software without using the packaging tools on a non-Ubuntu system won’t get those hardening features. Think of a sysadmin trying the latest nginx, or a vendor like Valve building games for distribution. On Ubuntu, when you do that “./configure && make” you’ll get the features automatically.

Debian, at the time, didn’t have a good way forward even for package builds since it lacked a concept of “global package build flags”. Happily, a solution (via dh) was developed about 2 years ago, and Debian package maintainers have been working to adopt it ever since.

So, while I don’t think any distro can match Ubuntu’s method of security hardening compiler defaults, it is valuable to see the results of global package build flags in Debian on the package archive. I’ve had an on-going graph of the state of build hardening on both Ubuntu and Debian for a while, but only recently did I put together a comparison of a default install. Very few people have all the packages in the archive installed, so it’s a bit silly to only look at the archive statistics. But let’s start there, just to describe what’s being measured.

Here’s today’s snapshot of Ubuntu’s development archive for the past year (you can see development “opening” after a release every 6 months with an influx of new packages):

Here’s today’s snapshot of Debian’s unstable archive for the past year (at the start of May you can see the archive “unfreezing” after the Wheezy release; the gaps were my analysis tool failing):

Ubuntu’s lines are relatively flat because everything that can be built with hardening already is. Debian’s graph is on a slow upward trend as more packages get migrated to dh to gain knowledge of the global flags.

Each line in the graphs represents the count of source packages that contain binary packages that have at least 1 “hit” for a given category. “ELF” is just that: a source package that ultimately produces at least 1 binary package with at least 1 ELF binary in it (i.e. produces a compiled output). The “Read-only Relocations” (“relro”) hardening feature is almost always done for an ELF, excepting uncommon situations. As a result, the count of ELF and relro are close on Ubuntu. In fact, examining relro is a good indication of whether or not a source package got built with hardening of any kind. So, in Ubuntu, 91.5% of the archive is built with hardening, with Debian at 55.2%.

The “stack protector” and “fortify source” features depend on characteristics of the source itself, and may not always be present in package’s binaries even when hardening is enabled for the build (e.g. no functions got selected for stack protection, or no fortified glibc functions were used). Really these lines mostly indicate the count of packages that have a sufficiently high level of complexity that would trigger such protections.

The “PIE” and “immediate binding” (“bind_now”) features are specifically enabled by a package maintainer. PIE can have a noticeable performance impact on CPU-register-starved architectures like i386 (ia32), so it is neither patched on in Ubuntu, nor part of the default flags in Debian. (And bind_now doesn’t make much sense without PIE, so they usually go together.) It’s worth noting, however, that it probably should be the default on amd64 (x86_64), which has plenty of available registers.

Here is a comparison of default installed packages between the most recent stable releases of Ubuntu (13.10) and Debian (Wheezy). It’s clear that what the average user gets with a default fresh install is better than what the archive-to-archive comparison shows. Debian’s showing is better (74% built with hardening), though it is still clearly lagging behind Ubuntu (99%):

Comments (2)

January 27, 2014

-fstack-protector-strong

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 2:28 pm

There will be a new option in gcc 4.9 named “-fstack-protector-strong“, which offers an improved version of “-fstack-protector” without going all the way to “-fstack-protector-all“. The stack protector feature itself adds a known canary to the stack during function preamble, and checks it when the function returns. If it changed, there was a stack overflow, and the program aborts. This is fine, but figuring out when to include it is the reason behind the various options.

Since traditionally stack overflows happen with string-based manipulations, the default (-fstack-protector), only includes the canary code when a function defines an 8 (--param=ssp-buffer-size=N, N=8 by default) or more byte local character array. This means just a few functions get the checking, but they’re probably the most likely to need it, so it’s an okay balance. Various distributions ended up lowering their default --param=ssp-buffer-size option down to 4, since there were still cases of functions that should have been protected but the conservative gcc upstream default of 8 wasn’t covering them.

However, even with the increased function coverage, there are rare cases when a stack overflow happens on other kinds of stack variables. To handle this more paranoid concern, -fstack-protector-all was defined to add the canary to all functions. This results in substantial use of stack space for saving the canary on deep stack users, and measurable (though surprisingly still relatively low) performance hit due to all the saving/checking. For a long time, Chrome OS used this, since we’re paranoid. :)

In the interest of gaining back some of the lost performance and not hitting our Chrome OS build images with such a giant stack-protector hammer, Han Shen from the Chrome OS compiler team created the new option -fstack-protector-strong, which enables the canary in many more conditions:

local variable’s address used as part of the right hand side of an assignment or function argument
local variable is an array (or union containing an array), regardless of array type or length
uses register local variables

This meant we were covering all the more paranoid conditions that might lead to a stack overflow. Chrome OS has been using this option instead of -fstack-protector-all for about 10 months now.

As a quick demonstration of the options, you can see this example program under various conditions. It tries to show off an example of shoving serialized data into a non-character variable, like might happen in some network address manipulations or streaming data parsing. Since I’m using memcpy here for clarity, the builds will need to turn off FORTIFY_SOURCE, which would also notice the overflow.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct no_chars {
    unsigned int len;
    unsigned int data;
};

int main(int argc, char * argv[])
{
    struct no_chars info = { };

    if (argc < 3) {
        fprintf(stderr, "Usage: %s LENGTH DATA...\n", argv[0]);
        return 1;
    }

    info.len = atoi(argv[1]);
    memcpy(&info.data, argv[2], info.len);

    return 0;
}

Built with everything disabled, this faults trying to return to an invalid VMA:

$ gcc -Wall -O2 -U_FORTIFY_SOURCE -fno-stack-protector /tmp/boom.c -o /tmp/boom
$ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Segmentation fault (core dumped)

Built with FORTIFY_SOURCE enabled, we see the expected catch of the overflow in memcpy:

$ gcc -Wall -O2 -D_FORTIFY_SOURCE=2 -fno-stack-protector /tmp/boom.c -o /tmp/boom
$ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*** buffer overflow detected ***: /tmp/boom terminated
...

So, we’ll leave FORTIFY_SOURCE disabled for our comparisons. With pre-4.9 gcc, we can see that -fstack-protector does not get triggered to protect this function:

$ gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector /tmp/boom.c -o /tmp/boom
$ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Segmentation fault (core dumped)

However, using -fstack-protector-all does trigger the protection, as expected:

$ gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector-all /tmp/boom.c -o /tmp/boom
$ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*** stack smashing detected ***: /tmp/boom terminated
Aborted (core dumped)

And finally, using the gcc snapshot of 4.9, here is -fstack-protector-strong doing its job:

$ /usr/lib/gcc-snapshot/bin/gcc -Wall -O2 -U_FORTIFY_SOURCE -fstack-protector-strong /tmp/boom.c -o /tmp/boom
$ /tmp/boom 64 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*** stack smashing detected ***: /tmp/boom terminated
Aborted (core dumped)

For Linux 3.14, I’ve added support for -fstack-protector-strong via the new CONFIG_CC_STACKPROTECTOR_STRONG option. The old CONFIG_CC_STACKPROTECTOR will be available as CONFIG_CC_STACKPROTECTOR_REGULAR. When comparing the results on builds via size and objdump -d analysis, here’s what I found with gcc 4.9:

A normal x86_64 “defconfig” build, without stack protector had a kernel text size of 11430641 bytes with 36110 function bodies. Adding CONFIG_CC_STACKPROTECTOR_REGULAR increased the kernel text size to 11468490 (a +0.33% change), with 1015 of 36110 functions stack-protected (2.81%). Using CONFIG_CC_STACKPROTECTOR_STRONG increased the kernel text size to 11692790 (+2.24%), with 7401 of 36110 functions stack-protected (20.5%). And 20% is a far-cry from 100% if support for -fstack-protector-all was added back to the kernel.

The next bit of work will be figuring out the best way to detect the version of gcc in use when doing Debian package builds, and using -fstack-protector-strong instead of -fstack-protector. For Ubuntu, it’s much simpler because it’ll just be the compiler default.

Comments (33)

December 20, 2013

DOM scraping

Filed under: Blogging,Debian,General,Ubuntu,Ubuntu-Server,Web — kees @ 11:16 pm

For a long time now I’ve used mechanize (via either Perl or Python) for doing website interaction automation. Stuff like playing web games, checking the weather, or reviewing my balance at the bank. However, as the use of javascript continues to increase, it’s getting harder and harder to screen-scrape without actually processing DOM events. To do that, really only browsers are doing the right thing, so getting attached to an actual browser DOM is generally the only way to do any kind of web interaction automation.

It seems the thing furthest along this path is Selenium. Initially, I spent some time trying to make it work with Firefox, but gave up. Instead, this seems to work nicely with Chrome via the Chrome WebDriver. And even better, all of this works out of the box on Ubuntu 13.10 via python-selenium and chromium-chromedriver.

Running /usr/lib/chromium-browser/chromedriver2_server from chromium-chromedriver starts a network listener on port 9515. This is the WebDriver API that Selenium can talk to. When requests are made, chromedriver2_server spawns Chrome, and all the interactions happen against that browser.

Since I prefer Python, I avoided the Java interfaces and focused on the Python bindings:

#!/usr/bin/env python
import sys
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys

caps = webdriver.DesiredCapabilities.CHROME

browser = webdriver.Remote("http://localhost:9515", caps)

browser.get("https://bank.example.com/")
assert "My Bank" in browser.title

try:
    elem = browser.find_element_by_name("userid")
    elem.send_keys("username")

    elem = browser.find_element_by_name("password")
    elem.send_keys("wheee my password" + Keys.RETURN)
except NoSuchElementException:
    print "Could not find login elements"
    sys.exit(1)

assert "Account Balances" in browser.title

xpath = "//div[text()='Balance']/../../td[2]/div[contains(text(),'$')]"
balance = browser.find_element_by_xpath(xpath).text

print balance

browser.close()

This would work pretty great, but if you need to save any state between sessions, you’ll want to be able to change where Chrome stores data (since by default in this configuration, it uses an empty temporary directory via --user-data-dir=). Happily, various things about the browser environment can be controlled, including the command line arguments. This is configurable by expanding the “desired capabilities” variable:

caps = webdriver.DesiredCapabilities.CHROME
caps["chromeOptions"] = {
        "args": ["--user-data-dir=/home/user/somewhere/to/store/your/session"],
    }

A great thing about this is that you get to actually watch the browser do its work. However, in cases where this interaction is going to be fully automated, you likely won’t have a Xorg session running, so you’ll need to wrap the WebDriver in one (since it launches Chrome). I used Xvfb for this:

#!/bin/bash
# Start WebDriver under fake X and wait for it to be listening
xvfb-run /usr/lib/chromium-browser/chromedriver2_server &
pid=$!
while ! nc -q0 -w0 localhost 9515; do
    sleep 1
done

the-chrome-script
rc=$?

# Shut down WebDriver
kill $pid

exit $rc

Alternatively, all of this could be done in the python script too, but I figured it’s easier to keep the support infrastructure separate from the actual test script itself. I actually leave the xvfb-run call external too, so it’s easier to debug the browser in my own X session.

One bug I encountered was that the WebDriver’s cache of the browser’s DOM can sometimes get out of sync with the actual browser’s DOM. I didn’t find a solution to this, but managed to work around it. I’m hoping later versions fix this. :)

Comments (2)

December 10, 2013

live patching the kernel

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 3:40 pm

A nice set of recent posts have done a great job detailing the remaining ways that a root user can get at kernel memory. Part of this is driven by the ideas behind UEFI Secure Boot, but they come from the same goal: making sure that the root user cannot directly subvert the running kernel. My perspective on this is toward making sure that an attacker who has gained access and then gained root privileges can’t continue to elevate their access and install invisible kernel rootkits.

An outline for possible attack vectors is spelled out by Matthew Gerrett’s continuing “useful kernel lockdown” patch series. The set of attacks was examined by Tyler Borland in “Bypassing modules_disabled security”. His post describes each vector in detail, and he ultimately chooses MSR writing as the way to write kernel memory (and shows an example of how to re-enable module loading). One thing not mentioned is that many distros have MSR access as a module, and it’s rarely loaded. If modules_disabled is already set, an attacker won’t be able to load the MSR module to begin with. However, the other general-purpose vector, kexec, is still available. To prove out this method, Matthew wrote a proof-of-concept for changing kernel memory via kexec.

Chrome OS is several steps ahead here, since it has hibernation disabled, MSR writing disabled, kexec disabled, modules verified, root filesystem read-only and verified, kernel verified, and firmware verified. But since not all my machines are Chrome OS, I wanted to look at some additional protections against kexec on general-purpose distro kernels that have CONFIG_KEXEC enabled, especially those without UEFI Secure Boot and Matthew’s lockdown patch series.

My goal was to disable kexec without needing to rebuild my entire kernel. For future kernels, I have proposed adding /proc/sys/kernel/kexec_disabled, a partner to the existing modules_disabled, that will one-way toggle kexec off. For existing kernels, things got more ugly.

What options do I have for patching a running kernel?

First I looked back at what I’d done in the past with fixing vulnerabilities with systemtap. This ends up being a rather heavy-duty way to go about things, since you need all the distro kernel debug symbols, etc. It does work, but has a significant problem: since it uses kprobes, a root user can just turn off the probes, reverting the changes. So that’s not going to work.

Next I looked at ksplice. The original upstream has gone away, but there is still some work being done by Jiri Slaby. However, even with his updates which fixed various build problems, there were still more, even when building a 3.2 kernel (Ubuntu 12.04 LTS). So that’s out too, which is too bad, since ksplice does exactly what I want: modifies the running kernel’s functions via a module.

So, finally, I decided to just do it by hand, and wrote a friendly kernel rootkit. Instead of dealing with flipping page table permissions on the normally-unwritable kernel code memory, I borrowed from PaX’s KERNEXEC feature, and just turn off write protect checking on the CPU briefly to make the changes. The return values for functions on x86_64 are stored in RAX, so I just need to stuff the kexec_load syscall with “mov -1, %rax; ret” (-1 is EPERM):

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>

static unsigned long long_target;
static char *target;
module_param_named(syscall, long_target, ulong, 0644);
MODULE_PARM_DESC(syscall, "Address of syscall");

/* mov $-1, %rax; ret */
unsigned const char bytes[] = { 0x48, 0xc7, 0xc0, 0xff, 0xff, 0xff, 0xff,
                                0xc3 };
unsigned char *orig;

/* Borrowed from PaX KERNEXEC */
static inline void disable_wp(void)
{
        unsigned long cr0;

        preempt_disable();
        barrier();
        cr0 = read_cr0();
        cr0 &= ~X86_CR0_WP;
        write_cr0(cr0);
}

static inline void enable_wp(void)
{
        unsigned long cr0;

        cr0 = read_cr0();
        cr0 |= X86_CR0_WP;
        write_cr0(cr0);
        barrier();
        preempt_enable_no_resched();
}

static int __init syscall_eperm_init(void)
{
        int i;
        target = (char *)long_target;

        if (target == NULL)
                return -EINVAL;

        /* save original */
        orig = kmalloc(sizeof(bytes), GFP_KERNEL);
        if (!orig)
                return -ENOMEM;
        for (i = 0; i < sizeof(bytes); i++) {
                orig[i] = target[i];
        }

        pr_info("writing %lu bytes at %p\n", sizeof(bytes), target);

        disable_wp();
        for (i = 0; i < sizeof(bytes); i++) {
                target[i] = bytes[i];
        }
        enable_wp();

        return 0;
}
module_init(syscall_eperm_init);

static void __exit syscall_eperm_exit(void)
{
        int i;

        pr_info("restoring %lu bytes at %p\n", sizeof(bytes), target);

        disable_wp();
        for (i = 0; i < sizeof(bytes); i++) {
                target[i] = orig[i];
        }
        enable_wp();

        kfree(orig);
}
module_exit(syscall_eperm_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Kees Cook <kees@outflux.net>");
MODULE_DESCRIPTION("makes target syscall always return EPERM");

If I didn’t want to leave an obvious indication that the kernel had been manipulated, the module could be changed to:

not announce what it’s doing
remove the exit route to not restore the changes on module unload
error out at the end of the init function instead of staying resident

And with this in place, it’s just a matter of loading it with the address of sys_kexec_load (found via /proc/kallsyms) before I disable module loading via modprobe. Here’s my upstart script:

# modules-disable - disable modules after rc scripts are done
#
description "disable loading modules"

start on stopped module-init-tools and stopped rc

task
script
        cd /root/modules/syscall_eperm
        make clean
        make
        insmod ./syscall_eperm.ko \
                syscall=0x$(egrep ' T sys_kexec_load$' /proc/kallsyms | cut -d" " -f1)
        modprobe disable
end script

And now I’m safe from kexec before I have a kernel that contains /proc/sys/kernel/kexec_disabled.

Comments (7)

November 26, 2013

Thanks UPS

Filed under: Blogging,Debian,Networking,Ubuntu,Ubuntu-Server — kees @ 8:25 pm

My UPS has decided that every two weeks when it performs a self-test that my 116V mains power isn’t good enough, so it drains the battery and shuts down my home network. Only took a month and a half for me to see on the network graphs that my outages were, to the minute, 2 weeks apart. :)

APC Monitoring

In theory, reducing the sensitivity will fix this…

Comments Off

August 13, 2013

TPM providing /dev/hwrng

Filed under: Blogging,Chrome OS,Debian,Security,Ubuntu,Ubuntu-Server — kees @ 9:10 am

A while ago, I added support for the TPM’s pRNG to the rng-tools package in Ubuntu. Since then, Kent Yoder added TPM support directly into the kernel’s /dev/hwrng device. This means there’s no need to carry the patch in rng-tools any more, since I can use /dev/hwrng directly now:

# modprobe tpm-rng
# echo tpm-rng >> /etc/modules
# grep -v ^# /etc/default/rng-tools
RNGDOPTIONS="--fill-watermark=90%"
# service rng-tools restart

And as before, once it’s been running a while (or you send SIGUSR1 to rngd), you can see reporting in syslog:

# pkill -USR1 rngd
# tail -n 15 /var/log/syslog
Aug 13 09:51:01 linux rngd[39114]: stats: bits received from HRNG source: 260064
Aug 13 09:51:01 linux rngd[39114]: stats: bits sent to kernel pool: 216384
Aug 13 09:51:01 linux rngd[39114]: stats: entropy added to kernel pool: 216384
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2 successes: 13
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2 failures: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Monobit: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Poker: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Runs: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Long run: 0
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS 140-2(2001-10-10) Continuous run: 0
Aug 13 09:51:01 linux rngd[39114]: stats: HRNG source speed: (min=10.433; avg=10.442; max=10.454)Kibits/s
Aug 13 09:51:01 linux rngd[39114]: stats: FIPS tests speed: (min=73.360; avg=75.504; max=86.305)Mibits/s
Aug 13 09:51:01 linux rngd[39114]: stats: Lowest ready-buffers level: 2
Aug 13 09:51:01 linux rngd[39114]: stats: Entropy starvations: 0
Aug 13 09:51:01 linux rngd[39114]: stats: Time spent starving for entropy: (min=0; avg=0.000; max=0)us

I’m pondering getting this running in Chrome OS too, but I want to make sure it doesn’t suck too much battery.

Comments (9)

Older Posts »

codeblog code is freedom — patching my itch

October 26, 2023

June 24, 2022

April 4, 2022

April 5, 2021

February 8, 2021

October 30, 2020

September 21, 2020

September 2, 2020

May 27, 2020

February 18, 2020

November 20, 2019

November 14, 2019

July 17, 2019

June 27, 2019

May 27, 2019

March 12, 2019

December 24, 2018

October 22, 2018

August 30, 2017

July 27, 2015

January 13, 2015

December 10, 2014

June 13, 2014

May 7, 2014

February 3, 2014

January 27, 2014

December 20, 2013

December 10, 2013

November 26, 2013

August 13, 2013