Final GSoC report    Posted:


This is my final report :) the GSoC was great, I learned a lot about the Hurd and Mach programming in general. I am also very pleased to announce that I reached my initial goal and almost all of my patches already made it upstream and into Debian :)

I spent my last week fixing issues I introduced and splitting up /hurd/init into two programs. This would make it possible to integrate the patch that frees PID 1 for sysvinit into the Hurd upstream sources. I didn't quite finish the separation, but my proof of concept works and I will finish this as my next Hurd project.

Looking back at the last fourteen weeks, I accomplished the following:

I implemented /proc/mounts, umount, freed up PID 1 for sysvinit, fixed ifupdown, sysvinit and initscripts on Hurd, implemented a proof-of-concept cgroupfs and fixed many small issues along the way. Almost all of my patches are already upstream and in Debian, a Debian/Hurd booting using sysvinit is just a few uploads away.

It has been a lot of fun and I will definitively see you around :)

Justus

cgroupfs is as cgroupy as it gets...    Posted:


... at least until the cgroup interface is fixed. So, what can it do?

  • There is tasks and cgroup.procs. There are no thread IDs on Hurd, so cgroupfs works only on a per-process basis, not per-threads. Consequently tasks has the same semantic as cgroup.procs. Seeing that PIDs and TIDs can be used (mostly) interchangeably on Linux I think this is okay to do.
  • You can create and destroy cgroups, child processes are properly tracked.
  • You can register an release_agent and it is executed whenever the last process in a cgroup dies.
  • There is notify_on_release to enable or disable the use of release_agent.
  • There is cgroup.clone_children, one can toggle this bit but it is ignored.

So, what's missing?

  • There are no controllers. I haven't looked into this and resource accounting is one of Hurds weakest points, but it is fathomable that one could e.g. advise the scheduler inside the Mach kernel based upon the state of the cgroups if the cgroupfs process is sufficiently privileged (did I mention that any user can use cgroupfs?).
  • The notification API aka cgroup.event_control. The Hurd lacks eventfd(2), but even if that was implemented, this interface would still be impossible to implement. Rant below.
  • A patch for gnumach to make this bulletproof. I made some encouraging progress with that one this week, but there's nothing presentable yet.

So, what's wrong with Linux cgroup API?

Well for one thing the whole API is underspecified. Yes, there is Documentation/cgroups/cgroups.txt, but that is not a specification, that's a howto at best. Second, the notification API is not particularly nice:

To register a new notification handler you need to:
 - create a file descriptor for event notification using eventfd(2);
 - open a control file to be monitored (e.g. memory.usage_in_bytes);
 - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
   Interpretation of args is defined by control file implementation;

Seriously? There is a POSIXly way to pass file-descriptors around, but smashing the decimal representation of it into a string is not the way to do that. Linux gets away with this hack because the kernel knows the process who wrote(2) that string in the first place, parse the string into an integer and look it up in the table of file descriptors for that process.

Now the trouble for cgroupfs is, that it is not the kernel and even if it were, it wouldn't solve the problem because on Hurd there are no file descriptors (well there are, but that's only to appease all the POSIX programs out there). Instead Hurd has ports, and you can send messages to ports, and this is pretty much everything that you can do on a Mach system. Reading a file works roughly like this:

  1. You open a file and get a port X.
  2. You send a message like "I'm like really interested in the first Y bytes of that file" to X.
  3. Whoever has the receiving end of X (probably the one who gave you X in the first place) answers your request.

Ports look pretty much like file descriptors, they are (usually small) integers, you can make them, destroy them, pass them around easily (yes, ports are first class objects in the Mach messaging system). Everything is implemented atop of this mechanism. It is transport-agnostic, the other end could be on another machine and you wouldn't even know. You can create proxies or filters (in fact, that is exactly how the firewall eth-filter is implemented). It's beautiful and extensible at it's heart, like Lego bricks.

So if X were a port to e.g. memory.usage_in_bytes and the cgroups interface would be less braindead^W^Wmore carefully designed so that on Hurd it could be transported like ports usually are, then cgroupfs could in fact use port X' to look up which file the caller is interested in (this is possible because cgroupfs was the one handing out the port in the first place) and generate notifications for that file. This is not possible when X is "serialized for transport" using sprintf because port names are specific for each process, so X != X'. The kernel would do the translation while sending the message, but it obviously cannot do that if the number is carried in a character array.

I'm not sure what I'm going to do next week. The gsoc timeline suggests a soft-pencils-down, time to scrub code and write documentation, not sure that this is applicable to me as I have pushed most of my work upstream as early as possible. I guess I will nag Samuel so that he merges the outstanding patches and continue working on my gnumach patch.

cgroupfs keeps track of processes    Posted:


Tl;dr!!elfel1 Screenshot (slightly edited and annotated shell trace):

+ settrans -ca /cgroup /hurd/cgroupfs
+ mkdir /cgroup/init /cgroup/rootfs
+ echo $$ >> /cgroup/init/tasks  # $$ is 6
+ echo 3 >> /cgroup/rootfs/tasks # pid 3 is the root filesystem
+ sleep 1m & echo sleep has pid $!
sleep has pid 16
+ cat /proc/cmdline > /dev/null
+ tail /cgroup/init/tasks /cgroup/rootfs/tasks
==> /cgroup/init/tasks <==
6
16
20

==> /cgroup/rootfs/tasks <==
3
19
17
+ pstree -p
init(1)-+-auth(5)
        |-cgroupfs(14)
        |-ext2fs(3)-+-exec(4)
        |           |-null(17)
        |           |-pflocal(8)
        |           |-procfs(19)
        |           `-term(7)
        |-mach-defpager(10)
        |-root=device:hd0s1(2)
        `-sh(6)-+-pstree(21)
                `-sleep(16)

Isn't she a beauty?

So we bind the cgroupfs translator to /cgroup, create two cgroups, init and rootfs, move the currently executing shell script (that later execs sysvinit) into the former and the root filesystem translator into the latter cgroup. We then spawn a sleep process and cat the content of /proc/cmdline into /dev/null which will make the root filesystem start the /hurd/procfs and the /hurd/null translator. We then inspect /cgroup/{init,rootfs}/tasks and find indeed all the newly spawned processes in the cgroup their parent process was in.

This is accomplished by:

I also filed a bug report containing my patches for the sysvinit package (#721917). This is the second bug report I filed during my gsoc, the first one was for the ifupdown package (#720531) which Andrew Shadura improved and merged on the very next day, thanks Andrew!

Next week I'll continue to improve the cgroupfs translator, work on the notification prototype (hopefully fixing non-root subhurds in the process, this requires a similar notification mechanism for newly created tasks and making /hurd/proc just a little subhurd aware) and trying to get my gnumach patch into a working shape (currently the parental relation of processes is a Hurd-only concept and relies upon processes telling the /hurd/proc server that a newly created process is their child. This is automatically done if the process uses fork(2) of course, but not if it uses task_create to start a new Mach task).

What will I do next? cgroupfs \o/    Posted:


With the ifupdown fixes that I published last week I actually reached my initial goal, that is to make Debian/Hurd boot using sysvinit and the initscripts provided by Debian. So on Monday we were discussing in #hurd what I could do next. Michael Banck suggested that I should port Upstart, but we agreed to do something different instead for two reasons:

  1. Upstart and systemd are somewhat competing to be the default init system for Debian, and we felt it might be inappropriate to get involved with this question as porting Upstart to Hurd would probably also enable it to be used on FreeBSD. The Upstart folks could then point out that Upstart is more portable because it runs on all kernels used by Debian.
  2. Upstart uses ptrace(2) to track child processes of servers it monitors. Obviously this is kind of a hack, and it was conjectured that Upstart would eventually use cgroups to do that. Also, the Hurd lacks support for ptrace(2) (that is most likely by choice by the way, ptrace(2) is not a nice interface and the Hurd (Mach actually) has much nicer interfaces to implement a debugger).

So we decided that no matter how the struggle between Upstart and systemd turns out, the Hurd would eventually need to support cgroups. So I started to write a cgroupfs translator, it is in its early stages but it already looks and acts a lot like Linux' cgroups:

% settrans -ac cg ./cgroupfs --release-agent=foobar
% ls cg
release_agent  tasks
% tail -n3 cg/tasks
11395
12869
1266
% mkdir cg/foo
% echo 1266 >> cg/foo/tasks
% tail -n3 cg/tasks cg/foo/tasks
==> cg/tasks <==
215
11395
12869

==> cg/foo/tasks <==
1266

To make this fully functional I will have to modify /hurd/proc and most likely also GNU Mach, but on the bright side this will help make subhurds (Hurds native, by-design-for-free-and-without-overhead container like functionality) work better and more securely (among other things this could enable non-root users to start subhurds). I will also look into porting libcg (I have a hacky patch series ready) so that we can actually test the cgroupfs translator. All current users of the cgroup interface are very Linux specific (surprise!), and libcg looks like the easiest one to port. And they do have a test suite that could help me improve the cgroupfs translator.

No noweb anymore...    Posted:


... which is probably a good thing. But here is the boot log you all have been waiting for:

start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1] exec init proc auth
INIT: version 2.88 booting
Using makefile-style concurrent boot in runlevel S.
Activating swap...done.
Checking root file system...fsck from util-linux 2.20.1
hd2 : tray open or drive not ready
hd2 : tray open or drive not ready
hd2 : tray open or drive not ready
hd2 : tray open or drive not ready
end_request: I/O error, dev 02:00, sector 0
/dev/hd0s1: clean, 44693/181056 files, 291766/723200 blocks
done.
Activating lvm and md swap...(default pager): Already paging to partition hd0s5!
done.
Checking file systems...fsck from util-linux 2.20.1
hd2 : tray open or drive not ready
hd2 : tray open or drive not ready
end_request: I/O error, dev 02:00, sector 0
done.
Cleaning up temporary files... /tmp.
Mounting local filesystems...done.
Activating swapfile swap...(default pager): Already paging to partition hd0s5!
done.
df: Warning: cannot read table of mounted file systems: No such file or directory
Cleaning up temporary files....
Configuring network interfaces...Internet Systems Consortium DHCP Client 4.2.2
Copyright 2004-2011 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on Socket//dev/eth0
Sending on   Socket//dev/eth0
*** stack smashing detected ***: dhclient terminated
Aborted
Failed to bring up /dev/eth0.
done.
Cleaning up temporary files....
Setting up X socket directories... /tmp/.X11-unix /tmp/.ICE-unix.
INIT: Entering runlevel: 2
Using makefile-style concurrent boot in runlevel 2.
Starting enhanced syslogd: rsyslogd.
Starting deferred execution scheduler: atd.
Starting periodic command scheduler: cron.
Starting system message bus: dbusFailed to set socket option"/var/run/dbus/system_bus_socket": Protocol not available.
Starting OpenBSD Secure Shell server: sshd.
unexpected ACK from keyboard


GNU 0.3 (debian) (console)

login: root
[...]
root@debian:~# ifup /dev/eth0
Internet Systems Consortium DHCP Client 4.2.2
Copyright 2004-2011 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on Socket//dev/eth0
Sending on   Socket//dev/eth0
*** stack smashing detected ***: dhclient terminated
Aborted
Failed to bring up /dev/eth0.
root@debian:~# dhclient -v -pf /run/dhclient.-dev-eth0.pid -lf /var/lib/dhcp/dhclient.-dev-eth0.leases /dev/eth0
Internet Systems Consortium DHCP Client 4.2.2
Copyright 2004-2011 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on Socket//dev/eth0
Sending on   Socket//dev/eth0
*** stack smashing detected ***: dhclient terminated
Aborted
root@debian:~# dhclient -pf /run/dhclient.-dev-eth0.pid -lf /var/lib/dhcp/dhclient.-dev-eth0.leases /dev/eth0
root@debian:~# ifup /dev/eth0
Internet Systems Consortium DHCP Client 4.2.2
Copyright 2004-2011 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on Socket//dev/eth0
Sending on   Socket//dev/eth0
DHCPREQUEST on /dev/eth0 to 255.255.255.255 port 67
DHCPACK from 10.0.2.2
bound to 10.0.2.15 -- renewal in 34108 seconds.
ps: comm: Unknown format spec
root@debian:~# halt

Broadcast message from root@debian (console) (Fri Aug 23 19:42:19 2013):

The system is going down for system halt NOW!
INIT: Switching to runlevel: 0root@debian:~#
INIT: Sending processes the TERM signal
INIT: Sending processes the KILL signal
Using makefile-style concurrent boot in runlevel 0.
Stopping deferred execution scheduler: atd.
task c10f53f8 deallocating an invalid port 2098928, most probably a bug.
Asking all remaining processes to terminate...done.
All processes ended within 1 seconds...done.
Stopping enhanced syslogd: rsyslogd.
Deconfiguring network interfaces...Internet Systems Consortium DHCP Client 4.2.2
Copyright 2004-2011 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on Socket//dev/eth0
Sending on   Socket//dev/eth0
DHCPRELEASE on /dev/eth0 to 10.0.2.2 port 67
/dev/eth0 (2):
  inet address  0.0.0.0
  netmask       255.255.255.0
  broadcast     10.0.2.255
  flags         BROADCAST ALLMULTI MULTICAST
  mtu           1500
done.
Deactivating swap...swapoff: /dev/hd0s5: 177152k swap space
done.
Unmounting weak filesystems...umount: /etc/mtab: Warning: duplicate entry for device /dev/hd0s1 (/servers/socket/26)
umount: /etc/mtab: Warning: duplicate entry for device /dev/hd0s1 (/dev/cons)
umount: could not find entry for: /dev/cons
umount: could not find entry for: /servers/socket/26
done.
mount: cannot remount /: Device or resource busy
Will now halt.
store a new irq 11init: notifying pfinet of shutdown...init: notifying tmpfs swap of shutdown...init: notifying tmpfs swap of shutdown...init: notifying tmpfs swap of shutdown...init: notifying ext2fs device:hd0s1 of shutdown...init: halting Mach (flags 0x8)...
In tight loop: hit ctl-alt-del to reboot

With some tiny patches for ifupdown I've been able to resolve network related issues. All of them? Of course not, funny thing about developing for the Hurd is that once you fix one thing, then some other thing or code path is executed that has never been run on Hurd before, and therefore something else breaks. In this case I fixed ifupdown to generate valid names for the pid file and leases file and all of the sudden dhclient starts dying.

Funny thing about that is, if one drops the -v flag from the dhclient invocation as I did it above, the crash isn't triggered and once the lease file has been successfully written, it is safe to add the -v flag again. Not yet sure what goes on there, then again, looking at the source of isc-dhcp-client it is not so surprising that it crashes :/

When I first looked at ifupdown it was written in noweb, a literate programming tool. It is an interesting idea, even more so since (classic) c can be very verbose and cryptic. But it decouples the control flow from the structure of the program, which makes patching it quite a challenge since it is not as obvious where the changes have to go in. This is how ifupdown looked some weeks ago:

% wc --lines ifupdown.nw
6123 ifupdown.nw
% pdftk ifupdown.pdf dump_data | grep NumberOfPages
NumberOfPages: 113

The ifupdown.nw is the noweb source, from which seven .c, four .h, two .pl and one Makefile are generated. It also contains a redicioulus amount of documentation, to the point that the authors at several points did not now what to write and just drop some nonsensical lines into the file. The source also compiles to a 113 page pdf file, that contains all of the documentation and all of the code, not at all in the order that one would expect a program to be written, but in the order the authors chose to structure the documentation. Fortunately for me the maintainer decided to drop the noweb source and to add the generated files to the source control system. This made my job much easier :)

So here are the patches I published this week:

I must admit that I do not know exactly what I will do next week. Obviously fixing the dhclient crash would be nice, I'll look into that. But I'm surely find some useful thing to do.

Contents © 2013 Justus Winter - Powered by Nikola