codeblog code is freedom — patching my itch

5/17/2010

yay for barriers

Filed under: Blogging,Debian,Ubuntu,Ubuntu-Server — kees @ 12:13 pm

I find it surreal to have people guessing at my motivations when they could just ask me. On top of all that, I find it weird that people spend so much time with in-fighting. I just want my system not to suck.

Some time ago (during in the Ubuntu Karmic development cycle maybe in September 2009), I started having giant problems with my build system. All I/O would start to stall, wait times would surge, and usually my entire system would just go unresponsive with the disk light on solid. This scared the crap out of me, and it wasn’t entirely obvious what was triggering it. No one else seemed to be seeing it. I managed to start tracking things using “latencytop”, and saw stuff like liferea going crazy. As I eliminated more and more things, I eventually settled on it being a problem with umount, and I reported an Ubuntu bug. It seemed to look like an upstream bug that no one else but the reporter could reproduce either.

Since no one else was seeing this issue, and it seemed related to LVM snapshots, I migrated off of snapshots, and started using aufs overlays for my builds. For a while, it seemed like things went away. It didn’t, and I started hitting it again. I opened a new (now famous) bug in Ubuntu, since now snapshots weren’t in the picture, and I didn’t want to confuse the earlier history. I managed to find a relatively minimal test-case too. A few other people commenting on the bug were seeing the problem now too, but it was less pronounced for them.

As an aside, this wasn’t a “just wait a few seconds longer” kind of issue on my system. A single umount would last 30-40 minutes. And when I’m doing parallel builds of security updates, this would turn into my system being unavailable for hours at a stretch.

Since none of the kernel developers I was in contact with were able to track down the root cause, I asked Ted Ts’o in email if he could just quickly peek in on this for me, since I figured he’d be in a good position to confirm or deny it. I didn’t want to start wasting upstream time with this if it wasn’t reproducible (see earlier upstream kernel bug). To my great relief, Ted found a few minutes to check it, and was able to immediately confirm it and give me a viable work-around (“sync; umount …”) for the time being. I confirmed the work-around, and went off to do other things.

A while later, Ted came back to deliver a bit of a rant, the purpose of which was not clear to me, but I ultimately ignored it — I didn’t seem directed at me. I just wanted my system operating normally, and he’d done me a favor to check in on it and got me a work-around.

More time passes, and I eventually get caught in another I/O-wait melt-down. On investigation, it seemed that the in-kernel work-around in the Ubuntu kernel totally back-fired on me in some cases, rendering even the user-space work-around useless. After investigating the Ubuntu-specific work-around, I re-read Ted’s rant in the course of researching what had happened during this bug’s triage.

It seemed that Ted was basically saying:
– this is an upstream problem
– RedHat hasn’t run into it and he didn’t know why

I figured I should confirm for myself if Fedora was affected, so downloaded and installed Fedora to double-check there. Since I was able to reproduce it there, I opened an upstream bug, linking back to the original Ubuntu bug, and then went to open a bug in the Fedora tracker, linking back to upstream.

And it seems to be these actions that everyone has jumped on. I will now bore you with the reality of my motivations: I wanted to fix the bug so no one would end up experiencing the same pain I’d been through over the last 6 months.

The bug was, from my perspective, a serious issue. Since I’d managed to reproduce it in another distro, it was my duty as a Free Software developer to report it to them. And, in what I felt was an unambiguous gesture, I made sure to include the link to the upstream kernel bug. Reproducing it in Ubuntu, in Fedora, and with a stock kernel had me confident that it was an upstream issue. While Ted did correctly suspect the issue was upstream, I really didn’t want to just open an upstream bug and have it be ignored. I wanted some additional proof of reproduction, which I got when I tested it on Fedora.

So, I’m rather saddened that so many people spent so much time questioning my motivations, making fun of Canonical, or doing anything other than trying to just simply solve this problem. I’m totally disinterested in inter-distro fighting. Instead, I continue to assume we’re all on the same team, fighting a philosophical battle against close-source software. And in that regard, I think it’s still true. If I ignore the rants and jeering, I come away thankful for all the people that spent time trying reproduce the issue at Canonical, at RedHat, and in the larger community. I’m hugely thankful that Ted made some time to let me know I wasn’t crazy, and there was actually a problem. I’m thankful for having some work-arounds, and I’m thankful that the root cause was eventually ferreted out, with some possible solutions. I’m even thankful that some people on the LWN thread saw that, far from malicious, I was trying to be helpful with the bug.

I just wanted my filesystem not to eat my computer. And I was hoping other people could maybe help me, since I’m not a filesystem expert. The drama around this bug is pathetic, and now by talking about it for almost 1000 words, I’m just as guilty.

© 2010, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

Be Sociable, Share!

21 Comments »

  1. I mostly agree. But I also think that having a FS expert at Canonical would not hurt the “enterprise support” story. At least once BtrFS is the default there should be someone at Canonical who knows BtrFS. Certainly wouldn’t hurt and would certainly be a good investment.

    But sure, most of what was written about this peticular bug was just envy, because Lucid is such a great release and was even released on time.

    Comment by Tom — 5/17/2010 @ 12:55 pm

  2. The drama is definitely pathetic, and I empathize. I face a similar onslaught on nearly identical terms while maintaining the audio stack in Ubuntu for five years, and as a volunteer, there is little energy to face the distro-bickering. In the end, just like anyone else passionate about his/her time and energies, we want things to just work. Thanks for sticking with it. Thanks for everything you and the rest of the community have done to make things usable.

    Comment by Daniel T Chen — 5/17/2010 @ 3:29 pm

  3. I think that part of the problem is a perceived unwillingness of Ubuntu to follow commonly established best practices and to play nicely with others.

    This includes, but is not limited to, things like

    * a reluctance to contribute back [1]
    * going ahead with things like CSD [2], no matter if it appears to make sense or not
    * partially caving in to licences [3] & closed source software (think ATI/NVidia)
    * selling the user’s default search engine [4] (or not)
    * developers being abrasive and not having the decency to simply say “sorry” (I don’t have the link to the bug report handy, but chances are you know what incident I mean)

    So while none of this is your fault, I can understand why people might lash out at a visible representation of Ubuntu. Not that I condone this in any way, mind.

    Personally, I see this as a chance for Ubuntu/Canonical to learn & improve.

    And let’s face it, others had to learn the hard way, too. gcc 2.96 [5], anyone?

    Richard

    [1] http://blog.bofh.it/debian/id_375
    [2] http://blog.martin-graesslin.com/blog/2010/05/open-letter-the-issues-with-client-side-window-decorations/
    [3] http://www.h-online.com/open/news/item/Canonical-clarifies-its-H-264-licence-993182.html
    [4] http://arstechnica.com/open-source/news/2010/01/ubuntus-default-search-engine-to-change-in-deal-with-yahoo.ars
    [5] http://gcc.gnu.org/gcc-2.96.html

    Comment by RichiH — 5/17/2010 @ 4:00 pm

  4. Keep on being awesome, Kees.

    Comment by andy grover — 5/17/2010 @ 4:03 pm

  5. I think that a valid criticism of Ubuntu/Canonical got confused with an invalid criticism of you. If we consider you as an ordinary (though skilled) user, what you did was completely correct. However, it seems that people got you confused with the Ubuntu filesystem expert guru, which is unfair to you but exposes the problem that there doesn’t seem to be an Ubuntu filesystem expert guru. This raises the question of whether Ubuntu should invest more in this area, especially as Canonical’s ambitions increase and people start talking about Ubuntu LTS as somehow being enterprise-worthy.

    Comment by Joe Buck — 5/17/2010 @ 4:28 pm

  6. I cannot stop people from conflating individuals and groups, but I find this practice lazy.

    I have met some outstandingly unfriendly people that are a part of [group], but I know [group] as a whole is not this way. Equally, when I see [group] making choices I think are dumb, I know that not every individual at [group] shares those opinions.

    Replace “[group]” here with RedHat, Canonical, Fedora, Ubuntu, LKML, Mplayer, government, etc.

    Comment by kees — 5/17/2010 @ 4:29 pm

  7. Well you are back on lwn.net. It looks like it’s a fashion trend to bash anything connected with Ubuntu/Canonical – it doesn’t matter if bugs get filed or not, somehow it’s still wrong. And as always totally unrelated things get mixed in for added effect…

    Comment by Dmitrijs Ledkovs — 5/17/2010 @ 5:16 pm

  8. You cannot avoid facing the fact that Ted Tso’s points and the common thread towards this criticisms is that fact that Canonical claims to provide enterprise support but is unwilling to invest in upstream. Canonical has high aspirations and talks about Btrfs by default for the next release while it doesn’t have any expertise to accomplish that task. It seems a lot of a hot air just to get some PR. It seems a bit sleazy.

    Comment by Jeffrey — 5/17/2010 @ 6:14 pm

  9. Dmitrijs Ledkovs: Did you actually read the news article [1] and the comments? They were quite balanced & positive. Matter of fact, the news article consists mainly of a verbatim copy of Kees’ points.

    [1] http://lwn.net/Articles/387988/

    “bash anything connected with X” works the other way too, you know.

    Finally, if by “And as always totally unrelated things get mixed in for added effect” you were referring to my comment, I suggest you read it again; with an open mind. I pointed out facts to explain the general environment comments like Tso’s may be made and stated explicitly that Kees is not at fault and that I do not condone guilt by association.

    Comment by RichiH — 5/17/2010 @ 11:13 pm

  10. My feelings for Ubuntu as an os are non-existent at best (I tried for a week, it just wasn’t for me). But I completely agree with the fact that we (the OSS developers) should think of ourselves as one community (at least on technical level). We want to solve problems, we want to have systems that match our needs. 90% of work can be shared, and I would say apart from that one snarky comment on Red Hat bugzilla, everything was done with professional attitude. I mean…Eric Sandeen closed your bug as UPSTREAM and then upstream fixed the bug (or not?). I must say that I didn’t really understand this though:

    > And it seems to be these actions that everyone has jumped on.

    Where did this happen? It’s not on the RH bugzilla, so I guess elsewhere.

    Anyway…thanks for taking the time to file the bug and trying to get all of our systems better

    Comment by Stanislav Ochotnicky — 5/17/2010 @ 11:43 pm

  11. Sadly enough, I’ve experienced the exact same thing a number of times with the LKML and X.org developers. In both cases, they keep on finding it easier to dismiss the issue than admit that a regression happened at commit NNNN. It’s one of those things that keep on tempting me to completely drop Free Software and just buy myself a Mac.

    Comment by Martin-Éric — 5/18/2010 @ 1:34 am

  12. I don’t see what’s so hard to understand about the point of Ted’s rant.

    “Instead, I continue to assume we’re all on the same team, fighting a philosophical battle against close-source software.”

    What Ted was saying is that Canonical, while pretending to be the big star of the team, is actually not pulling its weight on the playing field at all, and buying Canonical to play on your major league team is a bad idea.

    Comment by Michael Goetze — 5/18/2010 @ 5:47 am

  13. Well the criticism is misplaced at best, the point could be perfectly made for the btrfs. Everyone brags about “maybe btrfs for ubuntu”
    [http://www.netsplit.com/2010/05/14/btrfs-by-default-in-maverick/]. And here is the plan from launchpad:

    “There’s a patch in legacy grub that looks like it could be coerced into grub2 or perhaps support will land in grub2 in the next few months and we pick this up for free.”

    [https://blueprints.edge.launchpad.net/ubuntu/+spec/foundations-m-btrfs-support]

    You’ll have to admit that this does not inspire the urge to help you out on this.

    Comment by . — 5/18/2010 @ 7:02 am

  14. Hey Kees ;)

    Well, quit a tempest in a teapot I guess. In retrospect, I may have been slightly over-snarky with my closing comment, but by and large it was factual… My inbox dinged with the kernel.org bug and the RH bugzilla bug at about the same time, and I went back and read the launchpad saga, including Ted’s … rant? TBH that kind of set the stage for my feeling like the bug was filed not so much as a helpful collaborative heads up, but an attempt to find help somewhere.

    But, there was no need for me to make assumptions about motivations, so I apologize for that. I do hope that Canonical gets more proactive with bringing issues like this to the upstream developer audience, though, this wasn’t the first time that a filesystem bug has percolated for quite a while on Launchpad. The linux-ext4 list is a pretty friendly place, we’d be happy to have these issues brought to our attention sooner rather than later.

    Comment by Eric Sandeen — 5/18/2010 @ 7:14 am

  15. Yeah, knowing where to ask for help for this would have certainly saved me some time and hassle. I tend to be a bit shy to report stuff that is hard to reproduce, but I’ll just have to get over that. :)

    Comment by kees — 5/18/2010 @ 8:39 am

  16. @RichiH:
    Wrong on the CSD thing. Ubuntu’s not going to use CSD. Mark has clarified that CSD was the inspiration but will not be the implementation for Windicators.

    @Jeffrey:
    Btrfs by default? My understanding was that the decision was for Btrfs to be available on 10.10′s installer, but not for it to be the default.

    Comment by Mackenzie — 5/18/2010 @ 9:02 pm

  17. @Richard, I assume your comment:

    “developers being abrasive and not having the decency to simply say “sorry” (I don’t have the link to the bug report handy, but chances are you know what incident I mean)”

    was referring to me.

    Again, I’d echo Kees’ sentiments and ask what’s with the attitude? Why do you feel like you have to attack me in comments on a different developers blog?

    I’ve never had to say sorry for that incident, because the original reporter of the bug was never offended by it — in fact, the original reporter and I were chatting on Jabber at the time. It was only external people who decided to get offended after misunderstanding what they saw.

    I would say that the incident made me realise things and update my working practices, after all, we’re all human and we all make mistakes and we only improve ourselves if we learn from them.

    I’m not a nasty person in real life, but I discovered that dealing with 1,000 bug mails a day turned me into a nasty person. Especially since it’s not even my job to do that, I was only triaging bugs at that volume to help out our QA teams.

    It was affected my ability to do my job, not in the least due to the amount of time I was taking up to do triage, but also due to the affect it had on my mood.

    So I’ve massively scaled back doing it; after a successful trial during the late Beta period of Lucid, I found I was much more productive if I simply let QA triage the bugs and assign them to me once they were ready to be fixed.

    Since this means I should be reading no more than 10 bugs or so a day, rather than 1,000; I should be a much nicer person ;-)

    Comment by Scott James Remnant — 5/19/2010 @ 3:36 am

  18. IMHO the only regretful thing in this story is Fedora’s policy to close upstream-flagged bugs – thankfully Ubuntu’s policy is the opposite. The real waste of time is making other people redo the same tests just to find (again) that the bug is an upstream one.

    Comment by etrusco — 5/19/2010 @ 8:03 am

  19. @Mackenzie: Cool, thanks for the link.

    @Scott: I am sorry if you saw my comment as an attack. It was nothing more, but certainly not less than one fact in a list of other facts (well, I was thankfully wrong on CSD) to support my point.

    While your comment just now definitely helps to explain your behaviour and even though you claim the initial reporter was OK with your attitude, it was still extremely abrasive.
    Even though you may have been agitated at first, after having some time to cool down, you carried on with the same behaviour. You implying that everyone who disagreed with you was a nasty, untidy child was especially “classy”.
    Just now, I see the same general theme of “but I am the victim here”. I can’t say I agreed with you back then, nor do I do so now.

    In any case, I am happy to hear that your circumstances have improved (really, I am); I know how it is to work against an ever-mounting pile of work as a volunteer, so I can sympathize.

    For what it’s worth, I went to the trouble of finding out your IRC nick, vetting your general niceness with other Ubuntu people I know and then trying to contact you to talk about the whole episode in private. That failed; I assume that you did not want to have the umpteenth person talking to you about this whole thing, back then.

    In any case, if you want, we can take this conversation into /query on freenode; my nick is RichiH.

    Richard

    PS: I am not subscribed to this thread (there does not seem to be a way to do so). I will try to remember refreshing it as I did the last few times, but I can’t promise anything. I anyone wants to make sure I answer, poke me on IRC.

    Comment by RichiH — 5/19/2010 @ 3:23 pm

  20. Obviously, I missed the RSS feed. Subscribed.

    Comment by RichiH — 5/19/2010 @ 3:28 pm

  21. In response to the anonymous author of comment 13 (who posted much the same comment, again anonymously, on Scott’s blog; I’ve posted a copy of this comment there): the phrase you quote was a somewhat inaccurate transcription of a face-to-face conversation which didn’t reflect the conclusions we arrived at. In the process of drafting that specification properly, I rewrote it to reflect reality. We have in fact assigned a developer to add btrfs support to GRUB 2 – I’ve already done a fair bit of upstream work on GRUB 2 and mentioned this assignment to the current primary maintainer on IRC, so he’s aware of it.

    The background here is that a Red Hat employee posted a fairly complete-looking patch for btrfs support against GRUB Legacy to grub-devel. It’s great that they sent this, but unfortunately GRUB Legacy no longer has an upstream to speak of and so the GRUB 2 maintainers asked if it could be ported to GRUB 2 instead. I never saw a public response to that (and of course RH haven’t switched to GRUB 2 yet, so it may not be too high on their priority list) – there may have been something privately that never reached the list. Certainly, nobody has yet done this and we expect that if we want it to happen in GRUB 2 then we’ll need to do it ourselves. Things are complicated somewhat because the patch includes a copy of btrfs.h, which is GPLv2-only while GRUB is GPLv3; GRUB upstream has asked for an exception to be made, but I gather at second hand that this is tied up in Oracle’s legal department. It’s unfortunately possible that we will need to clean-room this in order to comply with everyone’s licences, although the ideal situation would be that we could simply port the already-working patch.

    I’ve already posted a patch for upstream review which at least fixes up grub-probe to cope with the way btrfs returns a virtual device number in st_dev, so that you can run btrfs / with a /boot that GRUB supports. I expect to be able to get this reviewed pretty soon, but I would hope we’d get rather further than that.

    And, for what it’s worth, while we’ve been somewhat flippantly talking about using btrfs by default for 10.10, the odds are quite heavily against that being the right thing to do. Nevertheless, it offers some interesting possibilities and we’d like to explore them sooner rather than later, so there’s no harm in setting high goals so that we have the incentive to get the remaining bits of our userspace sorted out. I assume Red Hat ported GRUB Legacy to btrfs for similar reasons.

    Comment by Colin Watson — 5/22/2010 @ 1:35 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress