“There’s no positive consequence to taking risk in government”

Nicely put:

“Okay, I’m going to just keep my head down and do what I’ve always done, because then I can be the most productive.” So, from the outside, after it runs for many, many years, it gets really broken. And part of that, I think is because it’s not only a lack of accountability. There’s also a lack of reward system for taking any risks. There’s only a negative consequence to taking risk. There’s no positive consequence to taking risk in government.”

See? Not only in the Greek public sector.

On ansible and the script module

Ansible offers the convenience of running scripts on remote servers. But as the documentation notes:

It is usually preferable to write Ansible modules than pushing scripts. Convert your script to an Ansible module for bonus points!

There is a reason for this. Usually you have ansible run a script on your behalf when what you want to do is not achievable via a module or some combination of modules in a playbook. In extreme circumstances you will need to run a script via ansible when the receiving computer has no Python installed.

But there is a problem with running scripts this way: They are opaque.

A playbook that is applied to your machines is actually a model of that part of the machines that you want to manage. And ansible is your sensor that deals with the situation when things go sour.

It is very easy to write a script that does one thing well to one machine and does not check for failure. Now apply this to 100 or 500 machines that are similar, yet have some subtle differences between them. Can you imagine what a rewrite your script needs in order to account for all corner cases? And if you make it bullet-proof, congratulations! You’re half-way through to making your own incompatible version of ansible.

Having said that, I am guilty of running scripts instead of describing work to be done in a playbook. This mostly involves stuff that needs to be executed from a login shell (hello rvm!) which means the script begins with #!/bin/bash. However, in order to exercise better control in such situations I am not running more than one command plus checks for return codes in every script. This breaks the script down in many smaller ones, but allows me a better view when something goes wrong. Because my playbooks instead of having one script directive, they have 5 or six in a row.

You may have not described an accurate model of what you want to do using a playbook’s markup, but at least the name: directive for every single task is accurate enough to let you know what is executing, rather than having it issue a larger script where you wait whether it succeeded or not, and if not try to find out from which point exactly to roll back (if rolling back is possible).

So the new rule is:

When pushing a script through ansible, it should execute one command only plus any checks needed for return status.

The Last Sysadmin

Nothing coherent today, just three excerpts from an article, an interview and a book that are separated years apart, yet I find them somehow connected in my mind. From “Electrical Engineering — A Diminishing Role?“:

“Projecting the current trends, future computers will consist of a single chip. No one will have the foggiest idea what is on that chip. Somewhere in the basement of Intel or its successor will be a huge computer file with the listing of that chip. The last electrical engineer will sit beside the file, handcuffed to the disk drive like a scene out of “Ben Hur.” That engineer will be extremely well paid, and his or her every demand will be immediately satisfied. That engineer will be the last keeper of the secret of the universe: E = IR.”

Ever since I first read it, I always thought it was talking about The Last System Administrator.

The next piece comes from an interview that Raspberry Pi creator, Eben Upton gave to the IEEE Techwise Conversations podcast:

“I think we’ve had a reduction from, say, if you think about 1995, which was when I went to college, you could typically rely on an undergraduate having done a substantial amount of real programming, often quite a deep level of technical work on one or more platforms. Many of us could program in one or more assembly languages. And yeah, within 10 years of that point, we were getting to a point where your average applicant was maybe somebody who’d done, as you say, a little bit of Web design, maybe a little bit of Web programming—you know, we saw quite a bit of people who‘d maybe done some PHP but not that kind of deep technical understanding of how machines work.”

And the last piece comes from the “Flash Boys“:

“Russians had a reputation for being the best programmers on Wall Street [… because in Russia … ] they had been forced to learn to program computers without the luxury of endless computer time.”

Stuff to think about now that your data center has been reduced to a tab in your browser.

Oh how I love the Good Regulator Theorem

Every now and then I like finding links between Cybernetics (or Systems Dynamics, or Systems Thinking, pick your favorite variation) and System Administration. I am not the only one in this. For example Matt Simmons has written about how System Administrators act as homeostasis mechanisms for the systems they manage. And minutes before this slide came up my way:

Rule #4: Monitoring systems need to be more available and scalable than the systems being monitored

which reminded me of a theorem and a law as applied in the monitoring systems domain. The Good Regulator Theorem states that every good regulator of a system must be a model of that system. You provide the monitoring system with a model of what you need to monitor in its appropriate DSL or clickware. The more precise the model, the better the monitor.

The rule in the slide ties closely with the law of requisite variety where the variety in the control system must be equal to or larger than the variety of the perturbations in order to achieve control. Think about it: At least the downtime of your monitoring system needs to be significantly less than that of the system monitored. Otherwise what exactly are you seeing? Think of Nyqvist-Shannon sampling here. Or as John Gall has put it in The Systems Bible, a system is no better than its sensory organs.

Is it practical to make these observations? For everyday job not really, but when I find such connections between “obscure” theory (obscure for admins) and system administration, I always smile :)

PS: @adrianco during the discussion left another piece of advice:

on a later slide I said best to use two independent monitoring systems then they can watch each other.

Weber–Fechner Law

Αυτό θα είναι ένα ακόμα μη ολοκληρωμένο ποστ. Έχει πέσει πολλή δουλειά και τα καλά τα ποστ θέλουν φροντίδα και καιρό τώρα δεν το κάνω αυτό. Αλλά είναι καλύτερα να ρίξεις κάτι μισό στον αέρα, παρά να το ξεχάσεις στο ντουλάπι. Κάποιος μπορεί και να ενοχληθεί και να σε αναγκάσει να γράψεις παραπάνω.

Βλέποντας λοιπόν το τέλος της ομιλίας του Δημήτρη Αχλιόπτα που έκανε ένα γύρο στο web, χάρη στο διάγραμμα του θυμήθηκα τον Weber-Fechner Law. Να δεις που τον είχα ξανακούσει, α εδώ.

Διάβασα αρκετά σχόλια για αυτά που λέει. Τα περισσότερα ήταν out of context κατά την γνώμη μου. Ένας από τους λόγους είναι πως πλασαρίστηκε ως ομιλία, ενώ είναι φανερό πως είναι τα τελευταία 15 λεπτά ενός μαθήματος και πως ο Αχλιόπτας τα έχει “πάρει”. Οι όποιες διαφωνίες μου με αυτά που λέει είναι μικρές (π.χ. δεν κάνουν όλοι εφτάωρες συνεντεύξεις και δεν κάνουν πάντα). Θα σταθώ όμως στο γράφημα που παρουσίασε. Διάβασα πως πολλοί λένε πως δεν είναι επιστήμη αυτό που έκανε και πως βγήκε και από τα χωράφια του. Και λοιπόν; Χρησιμοποίησε μια γραφική παράσταση για να δείξει αυτό που πιστεύει. Ναι είπε απόδειξη, αλλά συγνώμη δεν εξηγούσε αν το P=NP κιόλας. Αλλά στο κάτω-κάτω δεν απέχει αυτό που είπε από το γράφημα του Weber-Fechner Law (που είναι επίσης εμπειρικές παρατηρήσεις).

Σήμα είναι ο πόνος, σήμα και η ευτυχία.

[image source]

How legacy systems die

The news server that a friend was maintaining had stopped responding for a while. Leafnode complained with:

warning: HOSTNAME: cannot resolve host name: Name or service not known

So I emailed my friend and asked. It seems that the hardware of HOSTNAME had failed, and that after a month or so I was the only person who wondered about its fate. No plans to put it back online exist.

And with that I realized that the USENET is now dead for me.

Παράσιτα

Μια και ο Θεοδωράκης και ο Βαρουφάκης πιστεύουν πως το πρόβλημα της Χώρας είναι τα παράσιτα, θυμήθηκα την αρχή από ένα κείμενο του Schneier:

“My big idea is a big question. Every cooperative system contains parasites. How do we ensure that society’s parasites don’t destroy society’s systems?

It’s all about trust, really. Not the intimate trust we have in our close friends and relatives, but the more impersonal trust we have in the various people and systems we interact with in society. I trust airline pilots, hotel clerks, ATMs, restaurant kitchens, and the company that built the computer I’m writing this short essay on. I trust that they have acted and will act in the ways I expect them to. This type of trust is more a matter of consistency or predictability than of intimacy.”

Παραδόξως με τους ορισμούς του Βαρουφάκη για τους παρασιτισμούς συμφωνώ. Από τότε που τυχαία στο Google Books έπεσα πάνω στον ορισμό του transactional leadership.

“Γεννήθηκε έξω από το παλιό σύστημα εξουσίας και προσπαθεί να δομήσει μια νέα πρόταση, για μια νέα χώρα.”

Και όσο μεγαλώνει θα αλλάζει. Αναγκαστικά. Και αναπόφευκτα ο Iron Law of Bureaucracy θα το κάνει μία από τα ίδια. Ή θα πρέπει να αυτοδιαλυθεί για να γλιτώσει.

[ * Random mode on; είναι αργά και δεν μπορώ να κοιμηθώ ]

ping in Ansible playbooks

The ping module documentation says that it does not make sense in playbooks, but it is useful only for /usr/bin/ansible. Well I think there is a case where you can include it in a playbook, and that is when you disable fact gathering. I really want to know if there is something wrong with connecting to a server, prior to starting executing the whole playbook scenario and be left with a half played one to redo. So, at least for the host sizes that I apply this, it does not hurt to have this as the first task:

---
- hosts: whatever
  user: whoever
  gather_facts: no
  tasks:
  - name: ping all hosts
    ping:

The fact gathering phase implicitly runs the setup module. If your play does not make use of fact computation, you may want to disable it and use ping, just to check how ssh communicates with ansible before feeding it work to do.

Re: on becoming a sysadmin

Μέρες που είναι, ας γράψουμε κι ένα success story.

Εκεί λίγο πριν τελειώσει το 2007 μου έστειλε mail ένας φοιτητής (ας τον πούμε Νίκο) από το ΠΑΠΕΙ που με ήξερε μόνο μέσα από το blog μου. Το mail του συνοψίζεται σε μία ερώτηση:

– Τι χρειάζομαι για να γίνω system administrator;

Μια κάπως πιο μεγάλη απάντηση από αυτή που του έστειλα δημοσίευσα την Πρωτοχρονιά του 2008. Σήμερα ο Νίκος δουλεύει σαν system administrator σε μεγάλο οργανισμό, διαχειρίζεται περισσότερα μηχανήματα, χρήστες και complex environments από εμένα και θα έλεγα πως τουλάχιστον επαγγελματικά πατάει καλά.

Μπράβο φίλε.

Are all the servers running the latest version? Ansible to the rescue

After a certain size of servers, it is impossible to remember whether they are all current or not, or even check a documentation wiki page to find out about. So how can one use ansible to find out the answer? The setup module enters the room. Assuming an all Debian installation one could run:

ansible debian-machines -m setup --tree /tmp/invetory
cd /tmp/inventory
grep ansible_distribution_version * | grep -v 7\.2

This will list Debian machines not running 7.2 (Wheezy). You can build more complex versions of the above to match your infrastructure.

PS: Many thanks to @laserllama and @jpmens.