Sometimes you want to determine how similar a string s is to another string t. This is represented by the concept of Edit Distance, also called Levenshtein Distance. The smaller the Edit Distance, the more "closer" or "similar".
The Edit Distance is the cost to change the string s to the string t by adding the following 3 operations. Costs are ususally set for each of the 3 types of operations.
To calculate the Edit Distance, we use Dynamic Programming. In order to find the Edit Distance between strings of length $n$ and length $m$, each element of the 2-dimensional matrix $(n+1) \times (m+1)$ is calculated, so the computational complexty is $O(mn)$.
Carefully check the state of the string in each cell in "Table 1: State of the string being edited". If you understand that the state of the string consists of three elements:
Consider an edit operation to minimize the cost of generating the string "ara" from the string "saka". We want to reuse the reusable characters in the string "saka" while consuming other characters to generate the string "ara" as efficiently as possible.
To solve with Dynamic Programming, we use a table with the original (to be consumed) string in the vertical direction and the target (to be generated) string in the horizontal direction. Leave a blank space at the beginning of each string.
The state of the string being edited corresponding to each cell in this table is as follows.
a | r | a | ||
|saka | a|saka | ar|saka | ara|saka | |
s | |aka | a|aka | ar|aka | ara|aka |
a | |ka | a|ka | ar|ka | ara|ka |
k | |a | a|a | ar|a | ara|a |
a | | | a| | ar| | ara| |
If you just want to find the minimum cost, one table is fine. If you need to record the steps of your editing operations, prepare another table.
In this example, we assume the cost of each operation as follows. Of course you can set these values however you like.
Operation | Simbol | Const | Explanation |
---|---|---|---|
INSERT | $I$ | $1$ | Insert a character to the left of the cursor. |
DELETE | $D$ | $1$ | Delete a character to the right of the cursor. |
SUBSTITUTE | $S$ | $1$ | Substitute a character to the right of the cursor, and move the cursor position one character to the right. |
MATCH | $M$ | $0$ | Move the cursor position one character to the right. |
In the following, the $L$ row and $R$ column elements of the table are expressed as $(L,R)$.
![]() | Of the two tables, the left table records the minimum costs and the right table records the last operation performed. | ||||||||||||||||||
![]() |
Enter cost $0$ at ($0$, $0$) in the left table,
and enter '$-$' character to the ($0$, $0$) cell in the right table,
because no operation has been performed yet.
| ||||||||||||||||||
![]() |
Fill in the leftmost column ($L$, $0$) of the left table while adding the DELETE cost.
Since the operation is "DELETE", enter '$D$' in ($L$, $0$) cells of the right table.
| ||||||||||||||||||
![]() |
Enter the cost of INSERT in the column ($0$, $R$) of the top row of the left table
while adding the INSERT cost.
Since the operation is INSERT, enter '$I$' in ($0$, $R$) in the right table.
| ||||||||||||||||||
![]() |
$L=1$, $R=1$. The cost to be enterd in ($L$, $R$) in the left table is one of the following.
| ||||||||||||||||||
![]() |
Let's consider the minimum cost to reach the state ($1$, $1$) in the left table.
Since the character 's' at the cursor position is not equal to the character 'a' to be generated,
"MATCH (= Cursor Move)" can not be used.
Therefore, we consider 3 kinds of operations to arrive at ($1$, $1$):
Regardless of the operation order, the state of ($1$, $1$) is the state where "the leading 's' is disappeared from the string to be consumed" and "the leading 'a' has been generated from the string to be generated". The minimum cost to reach that state is $1$. | ||||||||||||||||||
![]() |
$L=2$, $R=1$. The cost to be enterd in ($L$, $R$) in the left table is one of the following.
| ||||||||||||||||||
![]() |
In ($2$, $1$) of the left table, the character to "be consumed" and the character to "be generated" are both 'a',
so MATCH (= Cursor Move) can be used.
There are other ways to "DELETE 'a'" from ($1$, $1$) and "INSERT 'a'" from ($2$, $0$),
but "MATCH (= Cursor Move)" from ($1$, $0$) is the minimum cost.
Enter $1+0=1$ in the left table ($2$, $1$) and '$M$' in the right table ($2$, $1$).
The state of ($2$, $1$) is "First, DELETE the leading 's' from the original string, then MOVE CURSOR through 'a' for the character of the orignal and the target is the same. The minimum cost of the operations is $1 \mbox{(DELETE cost)} + 0 \mbox{(MATCH cost)} = 1$. | ||||||||||||||||||
![]() |
In ($3$, $1$) of the left table, the character 'k' to "be consumed" is different from the character 'a' to "be generated",
so MATCH (= Cursor Move) cannot be used.
The cost of "SUBSTITUTE from ($2$, $0$)" is 3,
the cost of "INSERT from ($3$, $0$)" is 4,
and the cost of "DELETE from ($2$, $1$)" is 2.
| ||||||||||||||||||
![]() |
All the cell is filled. If multiple symbols are written together in the right table, it means that either operation has the lowest cost. For example, the symbol "SI" means "either SUBSTITUTE or INSERT is OK". You can see that the minimum cost is $2$ by looking at the bottom right cell of the left table ($4$, $3$). | ||||||||||||||||||
![]() |
You can see how to get to the minimum cost by going backwards starting from the bottom right cell of the right table.
The operation sequence with the lowest cost is as follows.
|
EditDistance.py |
import sys COST_DEL = 1 COST_INS = 1 COST_SUBST = 1 COST_MATCH = 0 COST = [] def solve(s,t,flag = False): global COST sl = len(s) tl = len(t) COST = [[0 for i in range(tl+1)] for j in range(sl+1)] COST[0][0] = 0 for i in range(1,sl+1): COST[i][0] = COST[i-1][0] + COST_DEL for i in range(1,tl+1): COST[0][i] = COST[0][i-1] + COST_INS for i in range(1,sl+1): for j in range(1,tl+1): c_d = COST[i][j-1] + COST_DEL c_i = COST[i-1][j] + COST_INS c_s = COST[i-1][j-1] + COST_SUBST if s[i-1]==t[j-1]: c_m = COST[i-1][j-1] + COST_MATCH else: c_m = sys.maxsize COST[i][j] = min(c_d, c_i, c_s, c_m) return COST[sl][tl] |
Log of EditDistance.py |
$ python |
EditDistancePath.py |
import sys COST_DEL = 1 COST_INS = 1 COST_SUBST = 1 COST_MATCH = 0 COST = [] OP = [] def solve(s,t): global COST, OP sl = len(s) tl = len(t) COST = [[0 for i in range(tl+1)] for j in range(sl+1)] OP = [['' for i in range(tl+1)] for j in range(sl+1)] COST[0][0] = 0 for i in range(1,sl+1): COST[i][0] = COST[i-1][0] + COST_DEL OP[i][0] += 'D' for i in range(1,tl+1): COST[0][i] = COST[0][i-1] + COST_INS OP[0][i] += 'I' for i in range(1,sl+1): for j in range(1,tl+1): c_d = COST[i][j-1] + COST_DEL c_i = COST[i-1][j] + COST_INS c_s = COST[i-1][j-1] + COST_SUBST if s[i-1]==t[j-1]: c_m = COST[i-1][j-1] + COST_MATCH else: c_m = sys.maxsize COST[i][j] = min(c_d, c_i, c_s, c_m) if (c_d == COST[i][j]): OP[i][j] += 'D' if (c_i == COST[i][j]): OP[i][j] += 'I' if (c_s == COST[i][j]): OP[i][j] += 'S' if (c_m == COST[i][j]): OP[i][j] += 'M' return COST[sl][tl], getPath(s,t) def getPath(s,t): global OP r = [] row = len(s) col = len(t) while row > 0 or col > 0: s_ch = s[row-1] t_ch = t[col-1] if 'M' in OP[row][col]: r.append('MATCH '+s_ch) row -= 1 col -= 1 elif 'S' in OP[row][col]: r.append('SUBSTITUTE '+s_ch+' with '+t_ch) row -= 1 col -= 1 elif 'I' in OP[row][col]: r.append('INSERT '+t_ch) col -= 1 elif 'D' in OP[row][col]: r.append('DELETE '+s_ch) row -= 1 r.reverse() return r |
Log of EditDistancePath.py |
$ python |
Consider a program that calculates the edit distance between two strings.
In the following program, the array cost
is
"the minimum cost to reach each state"
and the array parent
records "the most recent operation to reach that state with the minimum cost".
Since each type of operation is represented by a different bit, even multiple operations can be recorded at the same time.
For simplicity, only one operation sequence of the minimal cost is shown, even if there are multiple.
EditDistance.java |
import java.util.*; public class EditDistance { int COST_M = 0, COST_S = 1, COST_D = 1, COST_I = 1; public static final int OP_M = 1, OP_S = 2, OP_I = 4, OP_D = 8; String s,t; int[][] cost, op; public EditDistance(String s,String t) { this.s = s; this.t = t; } public void setSubstituteCost(int COST_S) { this.COST_S = COST_S; } public void setInsertCost(int COST_I) { this.COST_I = COST_I; } public void setDeleteCost(int COST_D) { this.COST_D = COST_D; } public int solve() { cost = new int[s.length()+1][t.length()+1]; cost[0][0] = 0; op = new int[s.length()+1][t.length()+1]; op[0][0] = 0; for (int i=1; i<cost.length; i++) { cost[i][0] = cost[i-1][0] + COST_D; op[i][0] = OP_D; } for (int i=1; i<cost[0].length; i++) { cost[0][i] = cost[0][i-1] + COST_I; op[0][i] = OP_I; } for (int si=1; si<cost.length; si++) { for (int ti=1; ti<cost[0].length; ti++) { int c_m = (s.charAt(si-1) == t.charAt(ti-1)) ? cost[si-1][ti-1] + COST_M : Integer.MAX_VALUE; int c_s = cost[si-1][ti-1] + COST_S; int c_i = cost[si][ti-1] + COST_I; int c_d = cost[si-1][ti] + COST_D; cost[si][ti] = Math.min(c_m,Math.min(c_s,Math.min(c_i,c_d))); op[si][ti] = 0; if (cost[si][ti] == c_m) op[si][ti] |= OP_M; if (cost[si][ti] == c_s) op[si][ti] |= OP_S; if (cost[si][ti] == c_i) op[si][ti] |= OP_I; if (cost[si][ti] == c_d) op[si][ti] |= OP_D; } } return cost[s.length()][t.length()]; } public String[] getPath() { ArrayList<String> v = new ArrayList<String>(); int si=cost.length-1; int ti=cost[0].length-1; while (si > 0 || ti > 0) { if ((op[si][ti] & OP_M) != 0) { v.add("matches '"+s.charAt(si-1)+"' and '"+t.charAt(ti-1)+"'"); si--; ti--; } else if ((op[si][ti] & OP_S) != 0) { v.add("substitute '"+s.charAt(si-1)+"' by '"+t.charAt(ti-1)+"'"); si--; ti--; } else if ((op[si][ti] & OP_I) != 0) { v.add("insert '"+t.charAt(ti-1)+"'"); ti--; } else if ((op[si][ti] & OP_D) != 0) { v.add("delete '"+s.charAt(si-1)+"'"); si--; } else throw new RuntimeException("bad op"); } Collections.reverse(v); return v.toArray(new String[0]); } public String showCostTable() { return toString(cost); } public String showOpTable() { return toString(op); } public String toString(int[][] x) { StringBuilder sb = new StringBuilder(" | "); for (int i=0; i<t.length(); i++) sb.append(" "+t.charAt(i)); int w = sb.toString().length(); sb.append("\n"); for (int i=0; i<w; i++) sb.append((i==1) ? "+" : "-"); sb.append("\n"); for (int line=0; line < x.length; line++) { if (line == 0) sb.append(" | "); else sb.append(s.charAt(line-1)+"| "); for (int col=0; col < x[line].length; col++) { int v=x[line][col]; if (v < 10 && v>=0) sb.append(" "); sb.append(v + " "); } sb.append("\n"); } return sb.toString(); } } |
RunEditDistance.java |
/* * [Usage] * $ java RunEditDistance saka ara */ import java.util.*; public class RunEditDistance { public static void main(String[] args) { Scanner sc = new Scanner(System.in); String s,t; if (args.length > 0) s=args[0]; else { System.out.print("s="); s = sc.next();} if (args.length > 1) t=args[1]; else { System.out.print("t="); t = sc.next();} EditDistance ed = new EditDistance(s,t); System.out.println(ed.solve()); System.out.println(ed.showCostTable()); System.out.println(ed.showOpTable()); String[] a = ed.getPath(); for (int i=0; i<a.length; i++) System.out.println(a[i]); } } |
RunEditDistance.javaの実行例 |
$ javac RunEditDistance.java EditDistance.java $ java RunEditDistance "saka" "ara" 2 | a r a -+------------ | 0 1 2 3 s| 1 1 2 3 a| 2 1 2 2 k| 3 2 2 3 a| 4 3 3 2 | a r a -+------------ | 0 4 4 4 s| 8 2 6 6 a| 8 1 6 1 k| 8 8 2 14 a| 8 9 10 1 delete 's' matches 'a' and 'a' substitute 'k' by 'r' matches 'a' and 'a' $ java RunEditDistance "apple is good" "applet is god" 2 | a p p l e t i s g o d -+------------------------------------------ | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a| 1 0 1 2 3 4 5 6 7 8 9 10 11 12 p| 2 1 0 1 2 3 4 5 6 7 8 9 10 11 p| 3 2 1 0 1 2 3 4 5 6 7 8 9 10 l| 4 3 2 1 0 1 2 3 4 5 6 7 8 9 e| 5 4 3 2 1 0 1 2 3 4 5 6 7 8 | 6 5 4 3 2 1 1 1 2 3 4 5 6 7 i| 7 6 5 4 3 2 2 2 1 2 3 4 5 6 s| 8 7 6 5 4 3 3 3 2 1 2 3 4 5 | 9 8 7 6 5 4 4 3 3 2 1 2 3 4 g| 10 9 8 7 6 5 5 4 4 3 2 1 2 3 o| 11 10 9 8 7 6 6 5 5 4 3 2 1 2 o| 12 11 10 9 8 7 7 6 6 5 4 3 2 2 d| 13 12 11 10 9 8 8 7 7 6 5 4 3 2 | a p p l e t i s g o d -+------------------------------------------ | 0 4 4 4 4 4 4 4 4 4 4 4 4 4 a| 8 1 4 4 4 4 4 4 4 4 4 4 4 4 p| 8 8 1 5 4 4 4 4 4 4 4 4 4 4 p| 8 8 9 1 4 4 4 4 4 4 4 4 4 4 l| 8 8 8 8 1 4 4 4 4 4 4 4 4 4 e| 8 8 8 8 8 1 4 4 4 4 4 4 4 4 | 8 8 8 8 8 8 2 1 4 4 5 4 4 4 i| 8 8 8 8 8 8 10 10 1 4 4 4 4 4 s| 8 8 8 8 8 8 10 10 8 1 4 4 4 4 | 8 8 8 8 8 8 10 1 8 8 1 4 4 4 g| 8 8 8 8 8 8 10 8 10 8 8 1 4 4 o| 8 8 8 8 8 8 10 8 10 8 8 8 1 4 o| 8 8 8 8 8 8 10 8 10 8 8 8 9 2 d| 8 8 8 8 8 8 10 8 10 8 8 8 8 1 matches 'a' and 'a' matches 'p' and 'p' matches 'p' and 'p' matches 'l' and 'l' matches 'e' and 'e' insert 't' matches ' ' and ' ' matches 'i' and 'i' matches 's' and 's' matches ' ' and ' ' matches 'g' and 'g' delete 'o' matches 'o' and 'o' matches 'd' and 'd' $ java RunEditDistance "thou shalt not" "you should not" 5 | y o u s h o u l d n o t -+--------------------------------------------- | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 t| 1 1 2 3 4 5 6 7 8 9 10 11 12 13 13 h| 2 2 2 3 4 5 5 6 7 8 9 10 11 12 13 o| 3 3 2 3 4 5 6 5 6 7 8 9 10 11 12 u| 4 4 3 2 3 4 5 6 5 6 7 8 9 10 11 | 5 5 4 3 2 3 4 5 6 6 7 7 8 9 10 s| 6 6 5 4 3 2 3 4 5 6 7 8 8 9 10 h| 7 7 6 5 4 3 2 3 4 5 6 7 8 9 10 a| 8 8 7 6 5 4 3 3 4 5 6 7 8 9 10 l| 9 9 8 7 6 5 4 4 4 4 5 6 7 8 9 t| 10 10 9 8 7 6 5 5 5 5 5 6 7 8 8 | 11 11 10 9 8 7 6 6 6 6 6 5 6 7 8 n| 12 12 11 10 9 8 7 7 7 7 7 6 5 6 7 o| 13 13 12 11 10 9 8 7 8 8 8 7 6 5 6 t| 14 14 13 12 11 10 9 8 8 9 9 8 7 6 5 | y o u s h o u l d n o t -+--------------------------------------------- | 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 t| 8 2 6 6 6 6 6 6 6 6 6 6 6 6 1 h| 8 10 2 6 6 6 1 4 4 4 4 4 4 4 4 o| 8 10 1 6 6 6 14 1 4 4 4 4 4 5 4 u| 8 10 8 1 4 4 4 12 1 4 4 4 4 4 4 | 8 10 8 8 1 4 4 4 12 2 6 1 4 4 4 s| 8 10 8 8 8 1 4 4 4 4 6 14 2 6 6 h| 8 10 8 8 8 8 1 4 4 4 4 4 4 6 6 a| 8 10 8 8 8 8 8 2 6 6 6 6 6 6 6 l| 8 10 8 8 8 8 8 10 2 1 4 4 4 4 4 t| 8 10 8 8 8 8 8 10 10 10 2 6 6 6 1 | 8 10 8 8 9 8 8 10 10 10 10 1 4 4 4 n| 8 10 8 8 8 8 8 10 10 10 10 8 1 4 4 o| 8 10 9 8 8 8 8 1 14 10 10 8 8 1 4 t| 8 10 8 8 8 8 8 8 2 14 10 8 8 8 1 delete 't' substitute 'h' by 'y' matches 'o' and 'o' matches 'u' and 'u' matches ' ' and ' ' matches 's' and 's' matches 'h' and 'h' insert 'o' substitute 'a' by 'u' matches 'l' and 'l' substitute 't' by 'd' matches ' ' and ' ' matches 'n' and 'n' matches 'o' and 'o' matches 't' and 't' |
Submission File | EditDistance.java |
---|---|
Comment: | Minimum cost of editing the original string "sushi and wine belong to food" to the target string "sun shines and window blows, good" |
Submit to: |
Assume that the cost of SUBSTITUTE, INSERT, DELETE and MATCH operations are $1$, $1$, $1$, $0$, respectively. Find the minimum cost to edit from the original string "sushi and wine belong to food" to the target string "sun shines and window blows, good".
NearWord.java | |||||||
Comment: |
Note that the input dataset is different for each student.
The input dataset is assigned by the result of modulo 3 of the last digit of the student number.
Paste the output of the program for each input data.
(Paste the 7 execution result).
| ||||||
---|---|---|---|---|---|---|---|
Submit to: |
Write a program that points out typos in English and suggests correct words. However, it must meet the following specifications.
The contents of "mwords_74550common.txt" are take from the list of 74550 words including in several dictionaries which is available from Grady Ward's Moby Project http://icon.shef.ac.uk/Moby/mwords.html .
Part of "mwords_74550common.txt" |
...(omitted)... alow alp alpaca alpenglow alpenhorn alpenstock alpestrine alpha alpha and omega alpha decay ...(omitted)... |
NearWord.java |
import java.util.*; import java.io.*; public class NearWord { private Vector<String> dic; private int minCost; public NearWord(String fname) { try { Scanner sc = new Scanner(new File(fname)); dic = new Vector<String>(); while (sc.hasNext()) dic.add(sc.nextLine()); } catch (Exception e) { e.printStackTrace(); System.exit(-1); } } public int cost() { return minCost; } public Vector<String> find(String s) { minCost = Integer.MAX_VALUE; // [Exercise] Write the appropriate code hear. } public static void main(String[] args) { NearWord nw = new NearWord("mwords_74550common.txt"); Scanner sc = new Scanner(System.in); String s = sc.nextLine(); Vector<String> words = nw.find(s); System.out.println(nw.cost()); for (String w: words) System.out.println(w); } } |
Execution log of NearWord.java |
$ javac NearWord.java |