dotplots – part1 – Managing organized complexity

[ This draft has been gathering dust for a few years, so I am posting it because what the heck, I’ve written it ]

I found out about dotplots while reading “Algorithms on Strings” (thanks to Bill Gasarch) an impressive book by the way, that easily makes it to the top-5 CS books I’ve read. It may well be one of the most complete works on the subject fulfilling simultaneously the needs of those who want to read the theory, see the proofs and/or write code (for it contains well understood pseudocode). But I digress.

Dotplots are tables that can be used to compare sequences (and therefore strings) and can reveal hidden patterns that may not be observable with other methods. They are in use for decades and known to people who work on computational biology (see for example “A high speed, high capacity homology matrix: zooming through SV40 and polyoma“)

Definition: Given two strings x and y with lengths m and n respectively we define a table Dot (the dotplot) of size m×n such that Dot[i][j] is true if x[i] equals y[j] and false otherwise.

Using the book example sequences ACGT and ATGCTACG we get the dotplot:

 ATGCTACG
A*....*..
C...*..*.
G..*....*
T.*..*...

To interpret a dotplot we observe that diagonals express similarities and antidiagonals reveal the existence of a substring in reverse order. In a similar fashion horizontal lines represent insertions and vertical lines deletions that lead from one (sub)string to the other. One can therefore think of a dotplot as a kind of visual grep.

Dotplots find usage outside computational biology. They can be used for text analysis and even translated plagiarism! However, the screen real estate is not big enough to use for comparisons that involve for example the source code of the X Window System and techniques mentioned in “Dotplot: a Program for Exploring Self-Similarity in Millions of Lines of Text and Code” and “Dotplot Patterns: A Literal Look at Pattern Languages” can be used. Both papers contain really interesting applications of dotplots and explanations of the visual patterns that occur. See for example:

d1-000 — 2 million lines of C code; an entire module of a telecommunications switching program. Source: Dotplot Patterns by J. Helfman

As shown by the ASCII example above, it is very easy to create proof-of-concept dotplots. But character based dotplots are limited by screen space. In fact I set out to create my own version of a dotplot program using gnuplot for the graphical stuff. Building a basic dotplotter is really easy once you read the techniques presented in Helfman’s papers but the result might not be as “good looking” as the ones that Helfman has done. To build cool dotplots, it may be needed to customize your software appropriately. For example, to dotplot source code, your program might need to have parsing capabilities for the language in question (for example two for loops might be similar to another when you look at it, but whitespace might make them look different to a simplistic dotplot builder).

[ There is not going to be a part2. Originally I wanted to experiment with dotplots and spam, but I am not a postmaster anymore, so I’ll leave it to the next interested postmaster. ]

dotplots – part1

Published by adamo

Leave a comment Cancel reply

Share this:

Published by adamo

Leave a comment Cancel reply