Run Lola, run

“Let me tell you the story of a company who’s on the verge of closing because of n-tier complexity, application server requirements and all that mambo jumbo” said a friend.

That company’s current client is a major public service institution. That institution has a set of complex policies designed, oh, by consultants whose employing firm of course was heavily paid to customize current “best practices” to secure the operating environment and making it use all the buzzwords that run around for it had to be modern. So when said company tried to deliver a software that it had a contract on, it was impossible to debug for they could not have any kind of access on the deployment systems. Which were run, not by the customer but, oh, by another consulting firm who was obliged to follow the rules set by the first one.

The governance of the above scheme looks good on paper, doesn’t it? At least I cannot deny it is a job creator for the consulting firms at the expense of those who want to do actual work.

Which brings me to the elitist question that I am going to fire up the next time I am lectured about Enterprise Architectures: “Have you personally implemented such a system? You, not someone you directed, you! Show me how, NOW!”. I’ve grown tired of people offering their paid opinion on IT systems that will improve anything when in fact the only system they’ve done is restoring their laptop’s Windows installation.

I’ve grown tired of people who prove the laws of Systemantics right with their ambitious, unworkable designs, namely:

A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.

where in fact we know that since sometimes systems work, this is because:

A complex system that works is invariably found to have evolved from a simple system that works.

But I guess in IT we are big fans of Rube Goldberg machines.

10 λεπτά

[ Έχω την εντύπωση πως το παρακάτω σχόλιο το έχω ξαναγράψει. Διάβασα αυτό και το θυμήθηκα, έψαξα το blog όμως και δεν το βρήκα, οπότε here goes: ]

Ένα από τα πράγματα που έμαθα από τη sage-members είναι πως είναι καλό να παίρνεις τον καφέ σου και να κάνεις βόλτα γύρω-γύρω τα γραφεία για ένα δεκάλεπτο. Οι χρήστες βλέπουν πως “είσαι εκεί” και μαζεύεις προβλήματα που μπορεί να υπάρχουν αλλά δεν τα έχεις ακούσει ακόμα (που σημαίνει πως θα τα ακούσεις όταν μεγαλώσουν και σκάσουν). Ίσως τα πιο πολύτιμα 10 λεπτά της μέρας. Σε γλιτώνουν από πολλές τηλεφωνικές διακοπές σίγουρα.

Η αλήθεια είναι πως δεν το κάνω κάθε μέρα πια.

[ Το είχα γράψει τελικά – Ου γαρ έρχεται μόνον ]

Having monit complement ansible

Here is a weird thing:

When running /etc/init.d/milter-greylist restart via ansible (either direct or via a playbook) it hangs. I had no time to debug this, so I reverted to the next best workaround since the machine was already running monit:

Have ansible distribute greylist.conf and then have monit restart the process. So here is a simple playbook:

- hosts: greylist
  user: root
  tasks:
  - name: copy local milter-greylist configuration to hosts
    action: copy src=/usr/local/etc/greylist.conf dest=/etc/milter-greylist/greylist.conf

and here is how monit finishes the task:

check file greylist.conf with path /etc/milter-greylist/greylist.conf
 if changed checksum then exec "/etc/init.d/milter-greylist restart"

Of course this is just a simple case of having the two cooperate. But once you get the hang of it, more elaborate schemes where ansible and monit can cooperate will pop out.

What Engineers Do Not Learn

The presentation bellow popped up in my stream thanks to @flowchainsensei:

The Missing Basics: What Engineers Don't Learn and Why They Don't Learn It from iFoundry

The title is clearly inspired by the book “What Engineers Know and How They Know It“ (it even stars in slide 21). Goldberg presents 7 basics that most have trouble dealing with. I have to say that reading those 7 slides scored home. My instant thoughts after every slide follow:

Inability to ask: which reminded me of ESR’s “How to ask questions the smart way“.
Inability to label: Terminology is a problem. So many times we see that the same word means different things to different people. As Goldberg says, we’re linguistically naive and this is a problem.
Inability to model: Which reminded me of George Box’s “All models are wrong but some are useful”. All to often we take the (ill devised) model as the reality and expect the real life situation to behave as the model.
Inability to decompose: Which reminded me of stuff that I am reading at the first chapter of “An introduction to General Systems Thinking“.
Inability to measure: I’ll leave it without comment. I’ve had two measuring courses at NTUA. I returned to them after graduating. Thanks to a not so inspiring professor I realised the importance of measurement only after bosses asked about numbers.
Inability to draw / visualise: I can’t say really much there. I dump my thoughts on paper always, but for years I was not. And I think the very first time I thought about that was when reading “Time Management for System Administrators“. But honestly the first thing I thought about when reading this slide was “Why a Diagram is (Sometimes) Worth Ten Thousand Words“.
Inability to communicate: So what do you prefer? Write a report for upper management on a project done or work on the next cool project?.

For the finale I keep Goldberg’s remark that:

Companies do not pay $8500 for plugging in Newton’s Laws.

That is because when I was complaining to a TA that we are not actually learning anything that has to do with the work we’ll be doing “outside” he countered me with:

– “But you know Math. The others”, meaning those coming from a technical education background, “do not”.

Yeah, right but you know what? You still can be able to solve equations and not be able to model the problem properly. And that is the problem with most of the Math that we have been taught.

GanetiCon 2013

As I am writing these lines the first GanetiCon is running its final sessions on the design and future developments of Ganeti. It has been a wonderful three days workshop with the great assistance of Google, Skroutz (who provided the venue of the conference) and GRNET (makers of synnefo).

While to be able to truly gain from the workshop a familiarity with Ganeti was expected, even people with remote knowledge of how it works could follow the works. This was mostly because of the way Guido Trotter run the design discussions and @apoikos‘s comments during those. Even thought this was more of a developers’s workshop, it is a pity that not many “plain” system administrators attended. You do not often have the opportunity to meet people who write the software tools that you use. You do not often get the opportunity to discover how they think and what shapes their decisions. And you do not often have the opportunity to see this happening for software that demands scale.

Congratulations to all who took the stand to present their work (and IMHO especially to good friends @apoikos and @kargig) and to all those who took the effort to attend.

I’ll leave you all with an idea: Hold the second GanetiCon at the same venue!

→ All presentations from GanetiCon 2013

Management by wandering around in the cloud?

I first heard “Management by wandering around” years ago when I read this article about the Microsoft Windows 2000 team:

“Decisions in 10 minutes or less, or the next one is free.” He wandered the halls and asked people, “What is a decision? It’s a tool to remove confusion! Are you confused? If so, then make the decision and let’s move on!”

I use to practice this as a System Administrator armed with coffee and strolling around desks of colleagues in order to find out what goes wrong and they are reluctant to communicate via any other means than coffee. It also makes your users understand that you care. And helps them relieve some pressure off when you made urgent changes that you did not have time to communicate. Most importantly you remove obstacles from their work when they are present. So I walk around offices with a coffee mug. Which raises the question:

– Now that teams become more and more decentralised, even small ones, and operating from different time zones, what is the equivalent of walking around management? Both for project and systems managers?

Skype calls and hangouts are good for scheduled events but do not quite help the same way as wandering around and solving problems for others.

I am open to ideas.

“It is my job to move mountains”

I used to find it hard to explain what a system administrator (or a DevOp these days) does to a layman. But recently we had to deal with some bureaucracy with kid[1]’s schooling and I responded to one of the officials:

– It is my job to move mountains

Parenting can teach you a lot about your work in the most unexpected ways.

Happy SysAdminDay

If you’re looking at this page, now it is because of your SysAdmin. Go hug him today. It is SysAdminDay

On a system’s purpose

From this book:

If infecting the patients is one of the things the system does, the uncomfortable and unpalatable truth is that this is one of the system’s purposes. It may not be the intention of anyone working in the system, and it is certainly not something that anyone in their right mind would consider putting in a mission statement but nevertheless in reality it is a purpose of the system as a whole, because that’s one of the things it does.

Far too many people forget that once a system is set in motion its purpose is what the system does, not what it was funded to do. And quite often they mistake what they want the system to do as the system’s purpose, where in fact what the system does, is its purpose. Good systems converge to the intended purpose.

[Thanks to @SystemsFunking]

ansible and yum check-update

When calling yum check-update from an ansible playbook, your play may fail because check-update returns non-zero exit status for two reasons:

check-update […] returns exit value of 100 if there are packages available for an update. Also returns a list of the packages to be updated in list format. Returns 0 if no packages are available for update. Returns 1 if an error occurred.

One quick and dirty way to bypass this is to use ignore_errors: yes in your task, but this will ignore both the case of pending updates and any other kind of error and your play will continue regardless. To avoid this one can modify the play sightly to check for the exit status:

  - name: yum check-update
    shell: 'yum check-update || test $? -eq 100'

The single quotes in the shell command above do matter.