cgroupfs is as cgroupy as it gets...

Posted: 2013-09-13 17:59 | More posts about gsoc debian hurd |

... at least until the cgroup interface is fixed. So, what can it do?

There is tasks and cgroup.procs. There are no thread IDs on Hurd, so cgroupfs works only on a per-process basis, not per-threads. Consequently tasks has the same semantic as cgroup.procs. Seeing that PIDs and TIDs can be used (mostly) interchangeably on Linux I think this is okay to do.
You can create and destroy cgroups, child processes are properly tracked.
You can register an release_agent and it is executed whenever the last process in a cgroup dies.
There is notify_on_release to enable or disable the use of release_agent.
There is cgroup.clone_children, one can toggle this bit but it is ignored.

So, what's missing?

There are no controllers. I haven't looked into this and resource accounting is one of Hurds weakest points, but it is fathomable that one could e.g. advise the scheduler inside the Mach kernel based upon the state of the cgroups if the cgroupfs process is sufficiently privileged (did I mention that any user can use cgroupfs?).
The notification API aka cgroup.event_control. The Hurd lacks eventfd(2), but even if that was implemented, this interface would still be impossible to implement. Rant below.
A patch for gnumach to make this bulletproof. I made some encouraging progress with that one this week, but there's nothing presentable yet.

So, what's wrong with Linux cgroup API?

Well for one thing the whole API is underspecified. Yes, there is Documentation/cgroups/cgroups.txt, but that is not a specification, that's a howto at best. Second, the notification API is not particularly nice:

To register a new notification handler you need to:
 - create a file descriptor for event notification using eventfd(2);
 - open a control file to be monitored (e.g. memory.usage_in_bytes);
 - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
   Interpretation of args is defined by control file implementation;

Seriously? There is a POSIXly way to pass file-descriptors around, but smashing the decimal representation of it into a string is not the way to do that. Linux gets away with this hack because the kernel knows the process who wrote(2) that string in the first place, parse the string into an integer and look it up in the table of file descriptors for that process.

Now the trouble for cgroupfs is, that it is not the kernel and even if it were, it wouldn't solve the problem because on Hurd there are no file descriptors (well there are, but that's only to appease all the POSIX programs out there). Instead Hurd has ports, and you can send messages to ports, and this is pretty much everything that you can do on a Mach system. Reading a file works roughly like this:

You open a file and get a port X.
You send a message like "I'm like really interested in the first Y bytes of that file" to X.
Whoever has the receiving end of X (probably the one who gave you X in the first place) answers your request.

Ports look pretty much like file descriptors, they are (usually small) integers, you can make them, destroy them, pass them around easily (yes, ports are first class objects in the Mach messaging system). Everything is implemented atop of this mechanism. It is transport-agnostic, the other end could be on another machine and you wouldn't even know. You can create proxies or filters (in fact, that is exactly how the firewall eth-filter is implemented). It's beautiful and extensible at it's heart, like Lego bricks.

So if X were a port to e.g. memory.usage_in_bytes and the cgroups interface would be less braindead^W^Wmore carefully designed so that on Hurd it could be transported like ports usually are, then cgroupfs could in fact use port X' to look up which file the caller is interested in (this is possible because cgroupfs was the one handing out the port in the first place) and generate notifications for that file. This is not possible when X is "serialized for transport" using sprintf because port names are specific for each process, so X != X'. The kernel would do the translation while sending the message, but it obviously cannot do that if the number is carried in a character array.

I'm not sure what I'm going to do next week. The gsoc timeline suggests a soft-pencils-down, time to scrub code and write documentation, not sure that this is applicable to me as I have pushed most of my work upstream as early as possible. I guess I will nag Samuel so that he merges the outstanding patches and continue working on my gnumach patch.