This is my final report :) the GSoC was great, I learned a lot about the Hurd and Mach programming in general. I am also very pleased to announce that I reached my initial goal and almost all of my patches already made it upstream and into Debian :)
I spent my last week fixing issues I introduced and splitting up /hurd/init into two programs. This would make it possible to integrate the patch that frees PID 1 for sysvinit into the Hurd upstream sources. I didn't quite finish the separation, but my proof of concept works and I will finish this as my next Hurd project.
Looking back at the last fourteen weeks, I accomplished the following:
I implemented /proc/mounts, umount, freed up PID 1 for sysvinit, fixed ifupdown, sysvinit and initscripts on Hurd, implemented a proof-of-concept cgroupfs and fixed many small issues along the way. Almost all of my patches are already upstream and in Debian, a Debian/Hurd booting using sysvinit is just a few uploads away.
It has been a lot of fun and I will definitively see you around :)
Justus
... at least until the cgroup interface is fixed. So, what can it do?
So, what's missing?
So, what's wrong with Linux cgroup API?
Well for one thing the whole API is underspecified. Yes, there is Documentation/cgroups/cgroups.txt, but that is not a specification, that's a howto at best. Second, the notification API is not particularly nice:
To register a new notification handler you need to: - create a file descriptor for event notification using eventfd(2); - open a control file to be monitored (e.g. memory.usage_in_bytes); - write "<event_fd> <control_fd> <args>" to cgroup.event_control. Interpretation of args is defined by control file implementation;
Seriously? There is a POSIXly way to pass file-descriptors around, but smashing the decimal representation of it into a string is not the way to do that. Linux gets away with this hack because the kernel knows the process who wrote(2) that string in the first place, parse the string into an integer and look it up in the table of file descriptors for that process.
Now the trouble for cgroupfs is, that it is not the kernel and even if it were, it wouldn't solve the problem because on Hurd there are no file descriptors (well there are, but that's only to appease all the POSIX programs out there). Instead Hurd has ports, and you can send messages to ports, and this is pretty much everything that you can do on a Mach system. Reading a file works roughly like this:
Ports look pretty much like file descriptors, they are (usually small) integers, you can make them, destroy them, pass them around easily (yes, ports are first class objects in the Mach messaging system). Everything is implemented atop of this mechanism. It is transport-agnostic, the other end could be on another machine and you wouldn't even know. You can create proxies or filters (in fact, that is exactly how the firewall eth-filter is implemented). It's beautiful and extensible at it's heart, like Lego bricks.
So if X were a port to e.g. memory.usage_in_bytes and the cgroups interface would be less braindead^W^Wmore carefully designed so that on Hurd it could be transported like ports usually are, then cgroupfs could in fact use port X' to look up which file the caller is interested in (this is possible because cgroupfs was the one handing out the port in the first place) and generate notifications for that file. This is not possible when X is "serialized for transport" using sprintf because port names are specific for each process, so X != X'. The kernel would do the translation while sending the message, but it obviously cannot do that if the number is carried in a character array.
I'm not sure what I'm going to do next week. The gsoc timeline suggests a soft-pencils-down, time to scrub code and write documentation, not sure that this is applicable to me as I have pushed most of my work upstream as early as possible. I guess I will nag Samuel so that he merges the outstanding patches and continue working on my gnumach patch.
Tl;dr!!elfel1 Screenshot (slightly edited and annotated shell trace):
+ settrans -ca /cgroup /hurd/cgroupfs
+ mkdir /cgroup/init /cgroup/rootfs
+ echo $$ >> /cgroup/init/tasks # $$ is 6
+ echo 3 >> /cgroup/rootfs/tasks # pid 3 is the root filesystem
+ sleep 1m & echo sleep has pid $!
sleep has pid 16
+ cat /proc/cmdline > /dev/null
+ tail /cgroup/init/tasks /cgroup/rootfs/tasks
==> /cgroup/init/tasks <==
6
16
20
==> /cgroup/rootfs/tasks <==
3
19
17
+ pstree -p
init(1)-+-auth(5)
|-cgroupfs(14)
|-ext2fs(3)-+-exec(4)
| |-null(17)
| |-pflocal(8)
| |-procfs(19)
| `-term(7)
|-mach-defpager(10)
|-root=device:hd0s1(2)
`-sh(6)-+-pstree(21)
`-sleep(16)
Isn't she a beauty?
So we bind the cgroupfs translator to /cgroup, create two cgroups, init and rootfs, move the currently executing shell script (that later execs sysvinit) into the former and the root filesystem translator into the latter cgroup. We then spawn a sleep process and cat the content of /proc/cmdline into /dev/null which will make the root filesystem start the /hurd/procfs and the /hurd/null translator. We then inspect /cgroup/{init,rootfs}/tasks and find indeed all the newly spawned processes in the cgroup their parent process was in.
This is accomplished by:
I also filed a bug report containing my patches for the sysvinit package (#721917). This is the second bug report I filed during my gsoc, the first one was for the ifupdown package (#720531) which Andrew Shadura improved and merged on the very next day, thanks Andrew!
Next week I'll continue to improve the cgroupfs translator, work on the notification prototype (hopefully fixing non-root subhurds in the process, this requires a similar notification mechanism for newly created tasks and making /hurd/proc just a little subhurd aware) and trying to get my gnumach patch into a working shape (currently the parental relation of processes is a Hurd-only concept and relies upon processes telling the /hurd/proc server that a newly created process is their child. This is automatically done if the process uses fork(2) of course, but not if it uses task_create to start a new Mach task).
With the ifupdown fixes that I published last week I actually reached my initial goal, that is to make Debian/Hurd boot using sysvinit and the initscripts provided by Debian. So on Monday we were discussing in #hurd what I could do next. Michael Banck suggested that I should port Upstart, but we agreed to do something different instead for two reasons:
So we decided that no matter how the struggle between Upstart and systemd turns out, the Hurd would eventually need to support cgroups. So I started to write a cgroupfs translator, it is in its early stages but it already looks and acts a lot like Linux' cgroups:
% settrans -ac cg ./cgroupfs --release-agent=foobar % ls cg release_agent tasks % tail -n3 cg/tasks 11395 12869 1266 % mkdir cg/foo % echo 1266 >> cg/foo/tasks % tail -n3 cg/tasks cg/foo/tasks ==> cg/tasks <== 215 11395 12869 ==> cg/foo/tasks <== 1266
To make this fully functional I will have to modify /hurd/proc and most likely also GNU Mach, but on the bright side this will help make subhurds (Hurds native, by-design-for-free-and-without-overhead container like functionality) work better and more securely (among other things this could enable non-root users to start subhurds). I will also look into porting libcg (I have a hacky patch series ready) so that we can actually test the cgroupfs translator. All current users of the cgroup interface are very Linux specific (surprise!), and libcg looks like the easiest one to port. And they do have a test suite that could help me improve the cgroupfs translator.
... which is probably a good thing. But here is the boot log you all have been waiting for:
start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1] exec init proc auth INIT: version 2.88 booting Using makefile-style concurrent boot in runlevel S. Activating swap...done. Checking root file system...fsck from util-linux 2.20.1 hd2 : tray open or drive not ready hd2 : tray open or drive not ready hd2 : tray open or drive not ready hd2 : tray open or drive not ready end_request: I/O error, dev 02:00, sector 0 /dev/hd0s1: clean, 44693/181056 files, 291766/723200 blocks done. Activating lvm and md swap...(default pager): Already paging to partition hd0s5! done. Checking file systems...fsck from util-linux 2.20.1 hd2 : tray open or drive not ready hd2 : tray open or drive not ready end_request: I/O error, dev 02:00, sector 0 done. Cleaning up temporary files... /tmp. Mounting local filesystems...done. Activating swapfile swap...(default pager): Already paging to partition hd0s5! done. df: Warning: cannot read table of mounted file systems: No such file or directory Cleaning up temporary files.... Configuring network interfaces...Internet Systems Consortium DHCP Client 4.2.2 Copyright 2004-2011 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Listening on Socket//dev/eth0 Sending on Socket//dev/eth0 *** stack smashing detected ***: dhclient terminated Aborted Failed to bring up /dev/eth0. done. Cleaning up temporary files.... Setting up X socket directories... /tmp/.X11-unix /tmp/.ICE-unix. INIT: Entering runlevel: 2 Using makefile-style concurrent boot in runlevel 2. Starting enhanced syslogd: rsyslogd. Starting deferred execution scheduler: atd. Starting periodic command scheduler: cron. Starting system message bus: dbusFailed to set socket option"/var/run/dbus/system_bus_socket": Protocol not available. Starting OpenBSD Secure Shell server: sshd. unexpected ACK from keyboard GNU 0.3 (debian) (console) login: root [...] root@debian:~# ifup /dev/eth0 Internet Systems Consortium DHCP Client 4.2.2 Copyright 2004-2011 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Listening on Socket//dev/eth0 Sending on Socket//dev/eth0 *** stack smashing detected ***: dhclient terminated Aborted Failed to bring up /dev/eth0. root@debian:~# dhclient -v -pf /run/dhclient.-dev-eth0.pid -lf /var/lib/dhcp/dhclient.-dev-eth0.leases /dev/eth0 Internet Systems Consortium DHCP Client 4.2.2 Copyright 2004-2011 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Listening on Socket//dev/eth0 Sending on Socket//dev/eth0 *** stack smashing detected ***: dhclient terminated Aborted root@debian:~# dhclient -pf /run/dhclient.-dev-eth0.pid -lf /var/lib/dhcp/dhclient.-dev-eth0.leases /dev/eth0 root@debian:~# ifup /dev/eth0 Internet Systems Consortium DHCP Client 4.2.2 Copyright 2004-2011 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Listening on Socket//dev/eth0 Sending on Socket//dev/eth0 DHCPREQUEST on /dev/eth0 to 255.255.255.255 port 67 DHCPACK from 10.0.2.2 bound to 10.0.2.15 -- renewal in 34108 seconds. ps: comm: Unknown format spec root@debian:~# halt Broadcast message from root@debian (console) (Fri Aug 23 19:42:19 2013): The system is going down for system halt NOW! INIT: Switching to runlevel: 0root@debian:~# INIT: Sending processes the TERM signal INIT: Sending processes the KILL signal Using makefile-style concurrent boot in runlevel 0. Stopping deferred execution scheduler: atd. task c10f53f8 deallocating an invalid port 2098928, most probably a bug. Asking all remaining processes to terminate...done. All processes ended within 1 seconds...done. Stopping enhanced syslogd: rsyslogd. Deconfiguring network interfaces...Internet Systems Consortium DHCP Client 4.2.2 Copyright 2004-2011 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Listening on Socket//dev/eth0 Sending on Socket//dev/eth0 DHCPRELEASE on /dev/eth0 to 10.0.2.2 port 67 /dev/eth0 (2): inet address 0.0.0.0 netmask 255.255.255.0 broadcast 10.0.2.255 flags BROADCAST ALLMULTI MULTICAST mtu 1500 done. Deactivating swap...swapoff: /dev/hd0s5: 177152k swap space done. Unmounting weak filesystems...umount: /etc/mtab: Warning: duplicate entry for device /dev/hd0s1 (/servers/socket/26) umount: /etc/mtab: Warning: duplicate entry for device /dev/hd0s1 (/dev/cons) umount: could not find entry for: /dev/cons umount: could not find entry for: /servers/socket/26 done. mount: cannot remount /: Device or resource busy Will now halt. store a new irq 11init: notifying pfinet of shutdown...init: notifying tmpfs swap of shutdown...init: notifying tmpfs swap of shutdown...init: notifying tmpfs swap of shutdown...init: notifying ext2fs device:hd0s1 of shutdown...init: halting Mach (flags 0x8)... In tight loop: hit ctl-alt-del to reboot
With some tiny patches for ifupdown I've been able to resolve network related issues. All of them? Of course not, funny thing about developing for the Hurd is that once you fix one thing, then some other thing or code path is executed that has never been run on Hurd before, and therefore something else breaks. In this case I fixed ifupdown to generate valid names for the pid file and leases file and all of the sudden dhclient starts dying.
Funny thing about that is, if one drops the -v flag from the dhclient invocation as I did it above, the crash isn't triggered and once the lease file has been successfully written, it is safe to add the -v flag again. Not yet sure what goes on there, then again, looking at the source of isc-dhcp-client it is not so surprising that it crashes :/
When I first looked at ifupdown it was written in noweb, a literate programming tool. It is an interesting idea, even more so since (classic) c can be very verbose and cryptic. But it decouples the control flow from the structure of the program, which makes patching it quite a challenge since it is not as obvious where the changes have to go in. This is how ifupdown looked some weeks ago:
% wc --lines ifupdown.nw 6123 ifupdown.nw % pdftk ifupdown.pdf dump_data | grep NumberOfPages NumberOfPages: 113
The ifupdown.nw is the noweb source, from which seven .c, four .h, two .pl and one Makefile are generated. It also contains a redicioulus amount of documentation, to the point that the authors at several points did not now what to write and just drop some nonsensical lines into the file. The source also compiles to a 113 page pdf file, that contains all of the documentation and all of the code, not at all in the order that one would expect a program to be written, but in the order the authors chose to structure the documentation. Fortunately for me the maintainer decided to drop the noweb source and to add the generated files to the source control system. This made my job much easier :)
So here are the patches I published this week:
I must admit that I do not know exactly what I will do next week. Obviously fixing the dhclient crash would be nice, I'll look into that. But I'm surely find some useful thing to do.