On the used difference algorithms.

 The design of an algorithm for text difference evaluation can be done in many different ways with also different results.

One of them is to look at oneself. How does a human being evaluate the difference between 2 texts? I found that in the normal case one looks for similar text parts in the beginning of texts and holds them as a found match and then looks at the rest of the text and doing the same etc. The advantage is that man can take many points in the text into account at the same time thus being very fast and effective but often not being very precise on the other side.

For a computer one has to be very more specific and has to build a systematical approach. So the approach which has been found as close to the described method I have named the "standard sequential algorithm". This algorithm searches for similar text parts which are considered as "similar" if a certain number of consecutive words or lines are equal. Then looking for the next similar text parts etc. Here the method tries first to find the the next text part from the first text in the second text and then vice versa. At this point one can decide what to do: Accepting the first or accepting the second result as a found match. The method decides here in this way to get the least difference.

 

The other way to get an algorithm is to formulate the problem as a mathematical optimization problem and to solve it using numerical mathematics. I have used here the branch and bound algorithm to solve integer optimization problems. In this case the optimization problem is to minimize the difference, i.e. to find the largest number of matches. When applying to long texts this algorithm is very much slower than the standard sequential algorithm. The main difference between the algorithms is that the mathematical optimization algorithm takes the whole texts into account at the same time – the sequential algorithm only searches „locally“ in each step. This results also in the need of much more memory for the branch and bound algorithm.

 

To get a better performance it is recommended to invoke the algorithms in a recursive manner. For example this means you first take the lines of the texts and build a difference „by lines“. The difference result will be a sequence of „deleted“, „inserted“ or „similar“ lines. Then we look for subsequent lists of deleted and inserted lines. This give us new, but smaller difference problems. Each of these problems can be solved „by words“. Looking now in the results for subsequent lists of deleted and inserted words we can invoke one of the algorithms on the level of letters.

 Back to TLTextDiff...

 

Dr. Andreas Lindae, E-mail...