Skip to content
Snippets Groups Projects
Hamming.ipynb 5.97 KiB
Newer Older
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Graphing a kind of \"Hamming Similarity\" of strings\n",
    "\n",
    "This notebook explores a slightly weird similarity measure for strings.\n",
    "\n",
    "## Equal characters in strings\n",
    "\n",
    "Given two strings, the idea is to consider the positions where their characters match:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "v = \"Wiesbaden\"\n",
    "w = \"Potsdam\"\n",
    "#       s a     – the matching characters of the two strings "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can extract those characters with a loop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = []                                 # resulting equal characters\n",
    "for i in range(min(map(len, [v, w]))): # loop over the shortest word's length\n",
    "    if v[i] == w[i]:                   # check character equality \n",
    "        m.append(v[i])                 # add character\n",
    "m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create a function that, given two strings, returns their equal characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def equal_chars(w, v):\n",
    "    m = []                                 # resulting equal characters\n",
    "    for i in range(min(map(len, [v, w]))): # loop over the shortest word's length\n",
    "        if v[i] == w[i]:                   # check character equality \n",
    "            m.append(v[i])                 # add character\n",
    "    return m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By the way: thanks to Python's [list comprehensions](https://docs.python.org/3/howto/functional.html#generator-expressions-and-list-comprehensions) we can write the body of the function in one line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def equal_chars(w, v):\n",
    "    return [v[i] for i in range(min(map(len, [v, w]))) if v[i] == w[i]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Similarity \n",
    "\n",
    "Now the number of equal characters between two strings defines a similarity measure. For example, the similarity of our two strings is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(equal_chars(v, w))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Graph\n",
    "\n",
    "Now given a set of strings, for example, the 16 capitals of all German states:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "capitals_de = [\"Berlin\", \"Bremen\", \"Dresden\", \"Düsseldorf\", \"Erfurt\",\n",
    "         \"Hamburg\", \"Hannover\", \"Kiel\", \"Magdeburg\", \"Mainz\", \"München\",\n",
    "         \"Potsdam\", \"Saarbrücken\", \"Schwerin\", \"Stuttgart\", \"Wiesbaden\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we can create a graph with the strings as nodes by connecting strings whose similarity is larger than zero, that is, they have at least one position with equal characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import networkx as nx\n",
    "\n",
    "def sim_graph(words):\n",
    "    G = nx.Graph()                                                  # resulting graph\n",
    "\n",
    "    for k, v in enumerate(words):                                   # first node\n",
    "        for l, w in enumerate(words):                               # second node\n",
    "            if k > l:                                               # avoid reverse duplicates\n",
    "                ec = equal_chars(v, w)                              # equal characters\n",
    "                sim = len(ec)                                       # similarity\n",
    "                if sim > 0:                                         # ignore dissimilar words\n",
    "                    G.add_edge(v, w, label=\"\".join(ec), weight=sim) # add edge\n",
    "    return G"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compute the graph for our set of capitals:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "g = sim_graph(capitals_de)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A good way to understand a graph is to visualise it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "from networkx.drawing.nx_agraph import graphviz_layout\n",
    "import matplotlib.pyplot as plt\n",
    "    \n",
    "pos = graphviz_layout(g, prog='dot')\n",
    "nx.draw(g, pos, with_labels=True, arrows=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This layout is not the best but we can try to use graphviz directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from networkx.drawing.nx_pydot import write_dot\n",
    "import pydot\n",
    "from IPython.display import HTML, display\n",
    "import random\n",
    "\n",
    "write_dot(g, \"graph.dot\")\n",
    "graph = pydot.graph_from_dot_file(\"graph.dot\")\n",
    "graph[0].write_svg(\"graph.svg\")\n",
    "    \n",
    "display(HTML('<img src=\"graph.svg?{0}\">'.format(random.randint(0,2e9))))    "
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}