Concordance: a Simple Free Visual Concordance Computation for Text Corpora

Jean-Daniel Fekete ( Jean-Daniel.Fekete@inria.fr), August 23rd 2003

Concordance is a program designed to compare text files visually to search for similar sequences among them. Text files are compared as follow:

First, a list of words is extracted from each text file. A word is a sequence of consecutive alphanumeric characters longer than 2 so spaces, ponctuations and control characters are simply considered as text delimiters. With that definition, a Word document can be used as a text file.

Second, an image is created using two word lists. They can be different or the same since auto-concordance is useful. The image is a table of similarity. It is computed by considering that each row is associated with each word of the first list and each column is associated with each word of the second list. The cell at the intersection of a row and a column contains the distance between the words. The distance is technically the "Normalized Edit Distance". If the words have no common characters, the distance is one and if the words are the same, the distance is zero. In between, the value reflects the similarity of the words. Two words are similar if transforming the first into the second requires very few operations, an operation being either a removal, an insertion or a substitution so the distance between "a" and "b" is one, the distance between "aa" and "bb" is two. This distance is normalized to fit between 0 and 1.

The normalized distance is not computed using the algorithm descibed in A. Marzal and E. Vidal, Computation of Normalized Edit Distance and Applications, IEEE trans. on Pattern Analysis and Machine Intelligence, PAMI-15(9), 926--932, 1993. 15. It uses a standard Edit Distance (or Levenshtein distance) and divides it by the longest string to normalize it. This last algorithm gives the same result as the first one on our simple case. This simple case consists in weighting the three "edit" operations equally, inserting one character, removing one character or changing one character has a weight of 1 (one). The edit distance is the minmum sum of the weights of operations required to transform the first string into the other one.

Concordance of several text files

If several file names are given to the concordance program, it computes all the possible pairwise comparisons and produces a web page to see them all.

A larger concordance can be found at http://www.lri.fr/~fekete/concordance/jabes-conc/.

Using concordance

To start concordance, you can either use a command line or drop files on the program icon.

Once the program is started, it computes the images in its connected directory and create icons of these images in a directory called "icon" that it creates. It also produces a file called "index.html" for looking at the results with a web browser.

while it computes the images, it displays some feedback since the computation can be quite long. Also, the maximum size of word list is limited to 3000, producing images of 3000x3000 at worst.

Icons are computed by reducing the image by a factor of 8. The reduction tries to preserve interesting features by keeping the darkest features instead of averaging the pixels. The icons look darker but large features are still visible.

Download

You can download:

Concordance is distributed under the GNU Public Licence.


Last modified: Sun Sep 14 19:29:34 Romance Daylight Time 2003