Better implementations of diff for extremely large files

Cool thing I learned from twitter today:

For files larger than diff(1) can handle there exists diffh.c from Bell Labs (and has been around since the PDP-11 days). But even diffh is not always good enough. So why not use idiffh which works on any system with a C compiler and knows of no limits in file sizes and strings?

My review on “Algorithms on strings”

My review on “Algorithms on strings” (for which I’ve blogged before) for the ACM SIGACT News is out. There’s a typographical error though: I did not review “Algorithms on strings” by Dan Gusfield, but “Algorithms on strings” by Crochemore, Hancart and Lecroq.

Thank you Bill Gasarch for the opportunity and thank you for fixing the typo too!

PS: You can download the review PDF from Bill Gasarch’s site.

Update: The review entry is corrected in the ACM site: Like Bill Gasarch wrote to me: “There is no such thing as a final version of anything anymore!“

Russ Cox on regular expressions

Thanks to Ozan S. Yigit I found out about a three-article series by Russ Cox on regular expressions:

I knew about Russ Cox and his interest in regular expressions because of this link to a pdf copy of “Programming Techniques: Regular expression search algorithm” that I had found at his site. Somehow I had missed the articles. Using Ozan’s words “russ cox, like other top-notch cs people, takes a topic and nails it shut. these three papers are more valuable to me than any RE book”.

Yes the articles are that good. However the good news do not stop here. Russ Cox implemented a fast, safe, thread-friendly alternative to backtracking regular expression engines (like those used in PCRE, Perl, and Python) written in C++, called RE2. It even comes with a POSIX (egrep) mode.

The postmaster in me quickly thought of the possibility of implementing a milter that makes use of RE2, just like milter-regex uses traditional regex(3), but my time is so limited by other more pressing projects, that I can only wish that someone else undertakes such a task.

Algorithms on Strings

I was first exposed to string matching by given to read “Algorithms for Finding Patterns in Strings” back in 1990, when I naively asked Prof. Stathis Zachos something like “How does grep work?”.

Time passed, I became a system administrator and most of my exposure to string matching was through scripts and sysadmin stuff automation. Automata are nice, but Perl and shell brought food to the table.

These memories surfaced because I got to read “Algorithms on Strings” in January thanks to Bill Gasarch. Complete, self-contained and with plain and well understood English, the book covers the subject fulfilling simultaneously the needs of those who want to just read the theory, those who want to see the proofs and those who just want to write code.

The pseudocode in the book is understood by anyone who has ever written a single program in C or Java. It either introduces new functions or makes use of others previously defined. This may make it a little difficult at first for people who need to write something described in, for example, chapter six and may find themselves reading from chapter one up to six. In this process the book manages to educate even the programmer who does not care about theory not only about how to do certain functions, but why they are done the way they are. As a plus, references to appropriate Unix shell tools (e.g. diff) are given when appropriate.

A really impressive book, definitely worth your time! A book that you can use both to learn about stuff and as a reference.