Skip to content
Snippets Groups Projects
Commit 9c4e407b authored by Prof. Dr. Robert Jäschke's avatar Prof. Dr. Robert Jäschke
Browse files

added a strange type of word similarity visualisation

parent c2e5b740
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Graphing a kind of "Hamming Similarity" of strings
This notebook explores a slightly weird similarity measure for strings.
## Equal characters in strings
Given two strings, the idea is to consider the positions where their characters match:
%% Cell type:code id: tags:
```
v = "Wiesbaden"
w = "Potsdam"
# s a – the matching characters of the two strings
```
%% Cell type:markdown id: tags:
We can extract those characters with a loop:
%% Cell type:code id: tags:
```
m = [] # resulting equal characters
for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
if v[i] == w[i]: # check character equality
m.append(v[i]) # add character
m
```
%% Cell type:markdown id: tags:
Let's create a function that, given two strings, returns their equal characters:
%% Cell type:code id: tags:
```
def equal_chars(w, v):
m = [] # resulting equal characters
for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
if v[i] == w[i]: # check character equality
m.append(v[i]) # add character
return m
```
%% Cell type:markdown id: tags:
By the way: thanks to Python's [list comprehensions](https://docs.python.org/3/howto/functional.html#generator-expressions-and-list-comprehensions) we can write the body of the function in one line:
%% Cell type:code id: tags:
```
def equal_chars(w, v):
return [v[i] for i in range(min(map(len, [v, w]))) if v[i] == w[i]]
```
%% Cell type:markdown id: tags:
## Similarity
Now the number of equal characters between two strings defines a similarity measure. For example, the similarity of our two strings is:
%% Cell type:code id: tags:
```
len(equal_chars(v, w))
```
%% Cell type:markdown id: tags:
## Graph
Now given a set of strings, for example, the 16 capitals of all German states:
%% Cell type:code id: tags:
```
capitals_de = ["Berlin", "Bremen", "Dresden", "Düsseldorf", "Erfurt",
"Hamburg", "Hannover", "Kiel", "Magdeburg", "Mainz", "München",
"Potsdam", "Saarbrücken", "Schwerin", "Stuttgart", "Wiesbaden"]
```
%% Cell type:markdown id: tags:
we can create a graph with the strings as nodes by connecting strings whose similarity is larger than zero, that is, they have at least one position with equal characters:
%% Cell type:code id: tags:
```
import networkx as nx
def sim_graph(words):
G = nx.Graph() # resulting graph
for k, v in enumerate(words): # first node
for l, w in enumerate(words): # second node
if k > l: # avoid reverse duplicates
ec = equal_chars(v, w) # equal characters
sim = len(ec) # similarity
if sim > 0: # ignore dissimilar words
G.add_edge(v, w, label="".join(ec), weight=sim) # add edge
return G
```
%% Cell type:markdown id: tags:
Let's compute the graph for our set of capitals:
%% Cell type:markdown id: tags:
g = sim_graph(capitals_de)
%% Cell type:markdown id: tags:
A good way to understand a graph is to visualise it:
%% Cell type:code id: tags:
```
%matplotlib inline
from networkx.drawing.nx_agraph import graphviz_layout
import matplotlib.pyplot as plt
pos = graphviz_layout(g, prog='dot')
nx.draw(g, pos, with_labels=True, arrows=False)
```
%% Cell type:markdown id: tags:
This layout is not the best but we can try to use graphviz directly:
%% Cell type:code id: tags:
```
from networkx.drawing.nx_pydot import write_dot
import pydot
from IPython.display import HTML, display
import random
write_dot(g, "graph.dot")
graph = pydot.graph_from_dot_file("graph.dot")
graph[0].write_svg("graph.svg")
display(HTML('<img src="graph.svg?{0}">'.format(random.randint(0,2e9))))
```
......@@ -30,11 +30,12 @@ their difficulty (☆ = simple, ☆☆ = advanced, ☆☆☆ = sophisticated):
data, basic statistics and visualisation (☆☆)
- [[file:crawling_a_blog.ipynb][Crawling a blog]] :: crawling web sites, basic text mining, basic
statistics and visualisation (☆☆)
- [[file:distances.ipynb][Distances]] :: Comprehensive interactive simulation of recovering
- [[file:distances.ipynb][Distances]] :: comprehensive interactive simulation of recovering
information from noisy data (namely, point positions given their
noisy distance matrix) (☆☆☆)
- [[file:exponential_smoothing.ipynb][Exponential smoothing]] :: using [[https://ipywidgets.readthedocs.io/en/latest/examples/Widget%2520Basics.html][Jupyter's interactive widget]] to
explore [[https://en.wikipedia.org/wiki/Exponential_smoothing][exponential smoothing]] (☆)
- [[file:Hamming.ipynb][Hamming]] :: a graph visualising a strange type of word similarity (☆)
- [[file:Jupyter-Demo.ipynb][Jupyter demo]] :: demo of some Jupyter features useful for creating
learning material (☆)
- [[file:statistics_top50faculty.ipynb][Statistics top 50 faculty]] :: exploratory statistical analysis of the
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment