The “oil lamp” law

[ Originally a Facebook post, copied here for posterity. ]

Some thirty years ago I was told the story of a server with an oil lamp on the side (the kind that Greek Orthodox people light to honor God and the Saints). It was put there to humor the situation: the server need not break under any circumstance.

Well, it has been my experience of many years, sectors and shops of different sizes, that no matter what, there is always at least one key system that “needs” an oil lamp by its side in the organization. A system that is critical enough to warrant all the attention it gets, yet so critical that nobody risks upgrading / changing / phasing it out during their tenure (the system is guaranteed to outlive them; I count three such systems that have outlived me). Untouchable systems that get replaced only when they physically die.

Seek out who needs an oil lamp. Plan accordingly.

[ There’s another “law” that follows as a result of the oil-lamp, but maybe for another time. ]


Achilles Voliotis writes:

The candle was placed (without oil) for fun on the VAX-750 by Spyros Potamianos, while I was admin in softlab. Since VAX was running (initially, then we installed 4.2 BSD) the Unix version of the DEC called ULTRIX, he had also put an empty bottle of the well-known shampoo ULTREX with changed E to I with a marker.
The VAX 750 consisted of 2 "boxes" (~ 30U). The lamp was on the disk device (170 MB, "removable") and became famous for a strange story that makes you become superstitious or believe in Murphy's laws.
During a service visit by the DEC technician, he saw the lamp on the "disk drive", and started joking ("you are not serious", what has the lamp to do with computers, etc.). And in a theatrical move he throws a smack at the candle that falls down.
In a satanic coincidence, the "disk drive" (actually it was a closet weighing more than several hundred Kg), almost at the same time made some strange sounds and stopped working !!!
The sequel was quite adventurous, due to a painful history between the DEC and its representative office in Greece and Cyprus (DCC). In short, for ~ 2 months, a team came to the softlab every 1-2 days, which never had 2 Greeks (French+German, Greek+German, Italian+ Greek, etc.), they disassembled the device into parts, all of them were replaced (some of them more than once), with always the same result: The "disc" spun, sometimes it worked 4-5 minutes and in the end was failing with the same strange noise.
At some point, after two months of ineffective repairs, a team comes with two Greeks (no foreigners). They Opened the lid, modified the settings in a DIP switch, closed the lid and ... the disk started working normally!
After this adventure the candle came back on the disk pack and stayed there for a very long time.

on brilliant assholes

[ yet another meaningless micropost ]

From time to time people get to read the autobiography, or memoirs of a specific time when a highly successful individual reached their peak. Fascinated by their success and seeking to improve their own status, these followers* copy the behavior they read about. Interestingly, if this behavior is assholic and abusive even more easier. Someone with psychology studies would have more to say here, I’m sure. In my “communication radius” this is very easy to observe with people who want to copy successful sports coaches, and you can see this pattern crossing over to other work domains too.

It is not easy to understand that someone can be an asshole whose brilliance may make them somewhat tolerable to their immediate workplace, while the other way round does not stand: assholic behavior does not generate brilliance. Solutions do.

If you think you’re brilliant, just pick a hard problem and solve it. I know, it’s …hard.

[*] Leadership programs and books create followers, not leaders.

Work expands to fill the time available for its completion.

This is also known as Parkinson’s Law. But earlier today I dug up from the Internet Archive this gem of a post (you should read it all, really):

Parkinson inferred this effect from two central principles governing the behavior of bureaucrats:

1. Officials want to multiply subordinates, not rivals.
2. Officials make work for one another.

Just a note when I seek to understand big organization behavior.

Sometimes n2n is good enough

You are not always in a position to use one of the big five cloud providers (BTW, I think Watson’s prediction was kind of true; serverless is making sure of that).

So when you’re working in different than usual public cloud environments, sometimes you miss features that are a given, like a VPC. A VPC is a pretty cool abstraction that allows you to have an isolated network of machines (and sometimes services) within your cloud provider and allows for easier management of things like security groups, routing traffic between machines and the like.

So what do you do when you do not have a VPC available? You need some kind of overlay networking. When deploying Kubernetes for example, you need to deploy an overlay network (there are many solutions to choose from) and you let it deal with iptables and routing hell. But, you may need to temporarily scale services that are not container orchestrated for whatever the reason (I, for example abhor running databases with Kubernetes). Still you may need an autoscaling solution like EC2 does. IPSec would be a cool solution, but deploying it in my current workplace would be too complex. Something simpler was needed. And I found it here, despite the shortcomings reported: N2N from ntop.org.

N2N was in a development hiatus, but now is back on active development. It utilizes TUN/TAP and allows you to build a VPC over the interface that the client you run on your machine creates. It comprises of two components: a supernode, which is actually a directory server that informs the members of the (let’s call it) VPC of the actual IP addresses of the members, and a client program (called edge) that creates the interface on each VM and contacts the supernode to register with it and query for needed routing information. The supernode itself is not routing any packets. It was a single point of failure, but current versions of N2N/edge support two supernodes, so your network is in a better position.

In my case I needed to autoscale a certain service for a limited amount of time, but did not have any prior knowledge of the IP addresses of the VMs that were to be created. So I had them spun off using an image that was starting edge, registering to the supernode and then routing a certain kind of traffic through a node that was also registered in the same network and acting as a NAT server. Hence, I simplified some of the iptables hell that I was dealing with until we deploy a better solution.

N2N supports encrypted traffic, and requires the equivalent of a username / password combination (common to all machines that are members of the VPC, but not known to supernode apriori).

So where else might you want to use N2N? Maybe you need a common IP address space between two cloud providers? You may be in a cloud provider that allows for VPCs but does not make it easy to route traffic from a VPC in one region to a VPC in another region? Or in cases when solutions like DMVPN are expensive and your own BGP solution an overkill? Stuff like that.

So how do the machines acquire an address in that VPC? You have two choices (a) DHCP and if that is not working (b) a static address. In the second case with you need to implement a poor man’s DHCP by having the machine assign an IP address to itself with a low probability of collision. To this end, let’s assign a /16 to that VPC and have the following entry in /etc/rc.local (yes I am still a fun of rc.local for limited usage) like:

edge -l supernode:port -c community-string -k password-string -s 255.255.0.0 -a static:172.31.$(shuf -i 1-251 -n 1).$(shuf -i 1-251 -n 1). 

The probability of a collision is (1/63001) so you get to decide at how many temporary instances you need to have a better hack at that. Plus you get to use 172.31.1.0-255 for static machinery within the VPC (like a NAT gateway for example).

Not a perfect solution, but definitely an easy, fast one.

In sed matching \d might not be what you would expect

A friend asked me the other day whether a certain “search and replace” operation over a credit card number could be done with sed: Given a number like 5105 1051 0510 5100, replace the first three components with something and leave the last one intact.

So my first take on this was:

# echo 5105 1051 0510 5100 | sed -e 's/^\([0-9]\{4\} \)\{3\}/lala /'
lala 5100

which works, but is not very legible. So here is taking advantage of the -r flag, if your modern sed supports it:

# echo 5105 1051 0510 5100 | sed -re 's/^([[:digit:]]{4} ){3}/lala /' 
lala 5100

So my friend asked, why not use \d instead of [[:digit:]] (or even [0-9])?

# echo 5105 1051 0510 5100 | sed -re 's/^(\d{4} ){3}/lala /' 
5105 1051 0510 5100

Why does this not work? Because as it is pointed in the manual:

In addition, this version of sed supports several escape characters (some of which are multi-character) to insert non-printable characters in scripts (\a, \c, \d, \o, \r, \t, \v, \x). These can cause similar problems with scripts written for other seds.

There. I guess that is why I still do not make much use of the -r flag and prefer to escape parentheses when doing matches in sed.

Happy sysadmin day

Hello and happy SysAdmin day. The baby swing bellow is from 2011. While at rest it looks like a safe swing, it is not. The chains latch too close to the middle and it is very easy for the seat to revolve around a second horizontal axis while swinging. You can understand how I know.

It is SysAdmin day today. We make sure the chains latch properly so your software runs without extra revolutions.

You’re welcome :)

baby swing

 

Your first steps installing Graylog

A new colleague needed some help to setup a Graylog installation. He had never done this before, so he asked for assistance. What follows is a rehash of an email I sent him on how to proceed and build knowledge on the subject:

So initially I had zero knowledge of Graylog. What I did to accustom myself with it was to download an OVA file with a prepared virtual machine and run it via VMware Fusion. The same VM can also be imported to VirtualBox and even to AWS, although they also provide ready AMIs for deployment in AWS.  Links:
Keep in mind that this is a full installation of what Graylog needs to work with and it also comes with a handly little script named “graylog-ctl” that manipulates a lot of configuration. The big catch is that graylog-ctl is not part of any standard Graylog deployment. It only comes with the OVA and the AMI images.
So after I had some fun with it on a VM on my workstation, reading the documentation and testing stuff, I had an initial deployment of the AMI image in AWS. But this is not an installation that can scale.  Which brings us to the next steps:
  • For Graylog to work you need to provide it with a MongoDB and an ElasticSearch database. It is your choice whether these will be clustered for high availability or not, whether they will run in the same machine or not. You control the complete architecture. So in my case I made the following decisions:
  • I am running a MongoDB replica set using three VMs. This is a standard setup as it is described in the MongoDB online documentation. Since it is not password protected, it only accepts connections from the Graylog instance. I used AWS security groups for that.
  • I am using an ElasticSearch cluster with three VMs where the nodes are both data and masters. If you can, use 7 nodes, three masters (lower machines since they do not run queries and do not index any data) and four data nodes (higher end machines). Again, since this is not password protected, I used AWS security groups to allow access only from the Graylog instance.
  • I am running a single Graylog instance on a separate VM. Currently it only listens for syslog stuff.  When the need arises, I will add a two more nodes to increase the availability.  I think I changed as many as four or five lines in the main configuration file. Graylog uses MongoDB to store its configuration, which includes anything you configure via the web interface.
  • Pay extra attention to the versions of ElasticSearch and MongoDB that your Graylog version requires. Use exactly what is mentioned in the documentation. For example in my case I am not running ES 6.x but the latest 5.x.
Now it is time to up your game. Once you see that your installation is working you have to decide whether to password protect access to MongoDB and ElasticSearch and whether to encrypt traffic between all those instances or not. I say give it a go.
I’ve not even touched issues like database management for Mongo and Elastic, backing them up, restoring, deleting indices, etc because this is post from zero to your first week testing Graylog.  There is plenty of stuff out there to take you to the next level, once you get used to the complexity of the software involved.
Should you need any more help, ping me anytime.

Release It!

I finally got to read Release It!. It would have served me better had I read it 10 years or so ago when I actually bought it. But never too late. There is a second edition that came out this month. While this first edition clearly shows its age and relics of another time, I guess the second edition is more hip, fresh, readable and with more experience to share.

Not a timeless book, but one that needs a refresh every decade or so. I tag it under “system administration” too, because a system is what you build, works and bring money in, not just what Ops runs and grumbles about.

The methods of system administration

Moses B. popped this in my timeline. Although the piece is about philosophy (of which I know nothing about) I thought I could entertain myself by paraphrasing the key points a bit. I am blatantly plagiarizing:

People like me, who have been trying to do Ops (system administration by a new name) for more than twenty years, do in due course learn, if they’re lucky, how to do what they’ve been trying to do: that is, they do learn how to do Ops. But although I’ve learned how to do Ops, nobody ever told me how do it, and, so far as I would guess, nobody will have told you how to do it, or is likely to tell you how to do it in the future.

So back in the early 90s why was I not taught how to do Ops? In parallelism with the philosophy article, here are some ideas:

  • System administration was free of methods.
  • The methods were kept quiet, perhaps because people thought “the methods of cannot be explicitly taught/said, but have to be shown via example/exemplars.”.
  • “those of us who have learned how to do it struggled so hard to get where we now are that…we think you, too, should suffer,” perhaps because “we’re now selfishly reluctant to give you some of the fruit of our struggle for free”.
  • “the methods of (early to middle) system administration were not unified”.
  • “it is a form of boundary policing. Under such a regime, outsiders are permanently mystified about the rules of the game because they don’t know what move to make to acquire standing”.

Let me see. When I started in this line of work there was no manual other than man and the technical documentation from the vendor (if you were lucky to have access to it).  Humble posts to the USENET groups and to lists like decstation-managers also paid off.

People lead by example. So their accumulated knowledge was not part of the technical documentation (I still find old vendor technical documentation far superior than today’s). Knowledge gained from experience, was passed the same way forward: By experience. “Do that” said the wizard in a stressful situation and afterwards they would share the war story that led to that. You were then supposed to generalize from what you were shown.

But sometimes, being shown was not enough. You had to struggle. Because the wizard as an apprentice was forced to learn how to swim by themselves, they retained the same belief for the next generation, or even for a peer (if you’re an equal, you can do it). The initiation process would require you to struggle, to earn the knowledge, to be able to read the packets in the wire like being the network driver yourself. The wizard was one with the machine, why should you benefit from this and not become one too?

How would you administer your systems? Via, SAM, SMIT, some other curses-based tool, if any, the shell, batch scripts doing rlogin? How? How would you do that in an era when infrastructure as code was not a thing and cfengine was a curiosity? How do you even do that now that you are hit by the paradox of choice? Did you ever hear of a book like this? If you did, did you read it?

Do you remember what a BOFH is? I know, because I was one. Before becoming one, I was told: “I do not understand why you’re asking this: You should either know what to do, and not ask, or you do not know what to do, and do not have to ask because you will not do it”, said the elder, then thought to be wise BOFH.

We are priests. Either you know how to read our hieroglyphs and you’re in, or you’re out. A great sense of empowerment came out of this. We take your crap and place it in production. It breaks. We tell you why and you still do not have a clue. We’ve explained the issue clearly, but not in your language. Your fault.

I will go with the most charitable explanation: We do not know how to pass the craft’s knowledge and everything else is a defense so as not to fall in your eyes.

I’m sorry.

PS: And then, DevOps came along