3

From Calculate Levenshtein distance between two strings in Python it is possible to calculate distance and similarity between two given strings(sentences).

And from Levenshtein Distance and Text Similarity in Python to return the matrix for each character and distance for two strings.

Are there any ways to calculate distance and similarity between each word in a string and print the matrix for each word in a string(sentences)?

a = "This is a dog."
b = "This is a cat."

from difflib import ndiff

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    print (matrix)
    return (matrix[size_x - 1, size_y - 1])

levenshtein(a, b)

Outputs

>> 3

Matrix

[[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
 [ 1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13.]
 [ 2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12.]
 [ 3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11.]
 [ 4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
 [ 6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.  8.]
 [ 7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.  7.]
 [ 8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.  6.]
 [ 9.  8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.  5.]
 [10.  9.  8.  7.  6.  5.  4.  3.  2.  1.  0.  1.  2.  3.  4.]
 [11. 10.  9.  8.  7.  6.  5.  4.  3.  2.  1.  1.  2.  3.  4.]
 [12. 11. 10.  9.  8.  7.  6.  5.  4.  3.  2.  2.  2.  3.  4.]
 [13. 12. 11. 10.  9.  8.  7.  6.  5.  4.  3.  3.  3.  3.  4.]
 [14. 13. 12. 11. 10.  9.  8.  7.  6.  5.  4.  4.  4.  4.  3.]]

General Levenshtein distance for character level shown in below fig. enter image description here

Is it possible to calculate Levenshtein Distance for Word Level?

Required Matrix

          This is a cat

This
is
a
dog
Pluviophile
  • 1,293
  • 7
  • 20
  • 40

2 Answers2

0

well... simply put a .split() at the end of your first two lines:

a = "This is a dog.".split()
b = "This is a cat.".split()

Your algorithm works with the iterables, and the string is broken into it's characters. You do the split, and a,b would be a list of words, then your algorithm works on the word-level

Output on your example:

[[0. 1. 2. 3. 4.]
 [1. 0. 1. 2. 3.]
 [2. 1. 0. 1. 2.]
 [3. 2. 1. 0. 1.]
 [4. 3. 2. 1. 1.]]

1.0
Alireza
  • 405
  • 3
  • 15
0

Maybe try this:

from functools import lru_cache
from itertools import product

@lru_cache(maxsize=4095)
def ld(s, t):
    """
    Levenshtein distance memoized implementation from Rosetta code:
    https://rosettacode.org/wiki/Levenshtein_distance#Python
    """
    if not s: return len(t)
    if not t: return len(s)
    if s[0] == t[0]: return ld(s[1:], t[1:])
    l1 = ld(s, t[1:])      # Deletion.
    l2 = ld(s[1:], t)      # Insertion.
    l3 = ld(s[1:], t[1:])  # Substitution.
    return 1 + min(l1, l2, l3)


a = "this is a sentence".split()
b = "yet another cat thing".split()

# To get the triplets.
for i, j in product(a, b):
    print((i, j, ld(i, j)))

To get a matrix:

from scipy.sparse import coo_matrix
import numpy as np

a = "this is a sentence".split()
b = "yet another cat thing , yes".split()

tripets = np.array([(i, j, ld(w1, w2)) for (i, w1) , (j, w2) in product(enumerate(a), enumerate(b))])
row, col, data = [np.squeeze(splt) for splt in np.hsplit(tripets, tripets.shape[-1])]
coo_matrix((data, (row, col))).toarray()

[out]:

array([[4, 5, 4, 2, 4, 3],
       [3, 7, 3, 4, 2, 2],
       [3, 6, 2, 5, 1, 3],
       [6, 7, 7, 7, 8, 7]])
alvas
  • 136
  • 5