Given a string, s, and a set of strings, S, find a string, k that closely resembles s. For this problem, some possible answers would be Jaro Distance, Cosine Similarity, TFIDF, Levenshtein distance, fuzzy logic, etc., which produce a score between two strings to reliably approximate their atomic similarity.
In this blog post, I’d like to explore one of my favorites, Sørensen–Dice coefficient, to achieve even better similarity comparisons for some cases like FQDNs. Incidentally, this method is used widely in the industry  including New Relic!
Sørensen–Dice Coefficient
Sørensen–Dice Coefficient is “a statistic used to gauge the similarity of two samples.” (Wikipedia) A sample can be a sequence of types such as a string or an integer. It can also be described mathematically as below, where the cardinality of intersection (of X and Y) is multiplied by a constant 2 (2 because the elements common in both samples were reduced through intersection). The outcome in the numerator portion is then divided by the individual cardinality of X and Y.
I wrote a quick implementation below for reference:


Tests
I ran some test cases of SDC (using my implementation above) against a builtin string match module called difflib.SequenceMatcher and a 3rd party module named pythonstringsimilarity.


Algorithm  x  y  Score 

SDC  prod.useast01.verizon.com  prod.useast02.verizon.com  0.846 
JaroWinkler  prod.useast01.verizon.com  prod.useast02.verizon.com  0.012 
Levenshtein  prod.useast01.verizon.com  prod.useast02.verizon.com  1.0 
Metric Longest Common Subsequence  prod.useast01.verizon.com  prod.useast02.verizon.com  0.038 
NGram (2)  prod.useast01.verizon.com  prod.useast02.verizon.com  0.038 
NGram (4)  prod.useast01.verizon.com  prod.useast02.verizon.com  0.038 
SequenceMatcher  prod.useast01.verizon.com  prod.useast02.verizon.com  0.0 
SDC  apple  apple  1.0 
JaroWinkler  apple  apple  0.0 
Levenshtein  apple  apple  0.0 
Metric Longest Common Subsequence  apple  apple  0.0 
NGram (2)  apple  apple  0.0 
NGram (4)  apple  apple  0.0 
SequenceMatcher  apple  apple  0.0 
SDC  apple  apples  0.36 
JaroWinkler  apple  apples  0.02 
Levenshtein  apple  apples  1.0 
Metric Longest Common Subsequence  apple  apples  0.16 
NGram (2)  apple  apples  0.16 
NGram (4)  apple  apples  0.16 
SequenceMatcher  apple  apples  0.0 
SDC  apple  orange  0.18 
JaroWinkler  apple  orange  0.42 
Levenshtein  apple  orange  5.0 
Metric Longest Common Subsequence  apple  orange  0.66 
NGram (2)  apple  orange  0.91 
NGram (4)  apple  orange  0.95 
SequenceMatcher  apple  orange  0.0 
As you can see, using the Sørensen–Dice coefficient yields a much better accuracy especially in a case where the sample strings are composed of complex symbols. Now this doesn’t mean that using other algorithms are fundamentally weaker  as even without using SDC, you can still work around with limited impact by, for example, preprocessing the sample strings (e.g. split by symbols and comparing each element separately).