Russ Cox on regular expressions

Thanks to Ozan S. Yigit I found out about a three-article series by Russ Cox on regular expressions:

I knew about Russ Cox and his interest in regular expressions because of this link to a pdf copy of “Programming Techniques: Regular expression search algorithm” that I had found at his site. Somehow I had missed the articles. Using Ozan’s words “russ cox, like other top-notch cs people, takes a topic and nails it shut. these three papers are more valuable to me than any RE book”.

Yes the articles are that good. However the good news do not stop here. Russ Cox implemented a fast, safe, thread-friendly alternative to backtracking regular expression engines (like those used in PCRE, Perl, and Python) written in C++, called RE2. It even comes with a POSIX (egrep) mode.

The postmaster in me quickly thought of the possibility of implementing a milter that makes use of RE2, just like milter-regex uses traditional regex(3), but my time is so limited by other more pressing projects, that I can only wish that someone else undertakes such a task.

Algorithms on Strings

I was first exposed to string matching by given to read “Algorithms for Finding Patterns in Strings” back in 1990, when I naively asked Prof. Stathis Zachos something like “How does grep work?”.

Time passed, I became a system administrator and most of my exposure to string matching was through scripts and sysadmin stuff automation. Automata are nice, but Perl and shell brought food to the table.

These memories surfaced because I got to read “Algorithms on Strings” in January thanks to Bill Gasarch. Complete, self-contained and with plain and well understood English, the book covers the subject fulfilling simultaneously the needs of those who want to just read the theory, those who want to see the proofs and those who just want to write code.

The pseudocode in the book is understood by anyone who has ever written a single program in C or Java. It either introduces new functions or makes use of others previously defined. This may make it a little difficult at first for people who need to write something described in, for example, chapter six and may find themselves reading from chapter one up to six. In this process the book manages to educate even the programmer who does not care about theory not only about how to do certain functions, but why they are done the way they are. As a plus, references to appropriate Unix shell tools (e.g. diff) are given when appropriate.

A really impressive book, definitely worth your time! A book that you can use both to learn about stuff and as a reference.

The purpose of SMTP-HELO

Years ago D. J. Bernstein wrote “I recommend that server implementors let clients skip HELO, to support a future transition to a world without HELO”. I suppose that anyone who has spend enough time “speaking” SMTP as part of debugging mail systems must have wondered about the need for HELO to even exist in SMTP.

Well it was not always there. RFCs 722 (Sep 1980) and 780 (May 1981) do not include it. It first appears in RFC 788 (Nov 1981). But why?

Back in 2005 in comp.mail.imap Mark Crispin explained why:

The purpose of HELO (and the Received: header line) was to fix a problem that went away with the NCP->TCP transition.

He goes on to explain that in the NCP days the IMPs that relayed messages knew only of the destinations of them and how that could lead to loops delivering the messages to the sender’s machine instead of the recipient’s. HELO solved the loop probelm. The transition from NCP to TCP/IP took place in 1/1/1983 in what is known as the Internet Flag Day. That should have effectively ended the life of HELO. But no, “people felt strongly about making this never happen again” and with the introduction of SMTP:

the SMTP client identified itself (HELO), and you were allowed to barf if the HELO claimed to be yourself since that meant that the network was in loopback.

HELO not only survived, but also a trend emerged as it started to be used as a weak authentication mechanism. People started checking whether the IP addrees of the connecting machine and the argument supplied with HELO had matching A and PTR RRs. This lead to the RFC 1123 prohibition:

However, the receiver MUST NOT refuse to accept a message, even if the sender’s HELO command fails verification.

This prohibition stands even with the current SMTP specification (RFC 5321):

Information captured in the verification attempt is for logging and tracing purposes. Note that this prohibition applies to the matching of the parameter to its IP address only

This is not to be interpreted as that no connection can be rejected based on the argument supplied with HELO. This thread over at RFC Ignorant discusses such valid cases where rejection is possible.

So there, now you not only know the history of HELO and why it was invented, you also know that it is not needed since 1983.

SMTP servers should not require, or ascribe meaning to, HELO or EHLO.