added Twitter example

692e1cbd · Prof. Dr. Robert Jäschke · 9c4e407b · 692e1cbd · 692e1cbd · 692e1cbd
Commit 692e1cbd authored 4 years ago by Prof. Dr. Robert Jäschke
--- a/Hamming.ipynb
+++ b/Hamming.ipynb
@@ -4,13 +4,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Graphing a kind of \"Hamming Similarity\" of strings\n",
+    "# Graphing a kind of \"Hamming similarity\" for strings\n",
    "\n",
-    "This notebook explores a slightly weird similarity measure for strings.\n",
+    "This notebook explores a (probably slightly weird) similarity measure for strings.\n",
    "\n",
    "## Equal characters in strings\n",
    "\n",
-    "Given two strings, the idea is to consider the positions where their characters match:"
+    "Given two strings, our idea is to consider the positions where their characters match:"
   ]
  },
  {
@@ -39,8 +39,8 @@
   "source": [
    "m = []                                 # resulting equal characters\n",
    "for i in range(min(map(len, [v, w]))): # loop over the shortest word's length\n",
-    "    if v[i] == w[i]:                   # check character equality \n",
-    "        m.append(v[i])                 # add character\n",
+    "    if v[i] == w[i]:                   # equal characters at this position?\n",
+    "        m.append(v[i])                 # collect equal character\n",
    "m"
   ]
  },
@@ -57,7 +57,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "def equal_chars(w, v):\n",
+    "def equal_chars(v, w):\n",
    "    m = []                                 # resulting equal characters\n",
    "    for i in range(min(map(len, [v, w]))): # loop over the shortest word's length\n",
    "        if v[i] == w[i]:                   # check character equality \n",
@@ -78,17 +78,49 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "def equal_chars(w, v):\n",
+    "def equal_chars(v, w):\n",
    "    return [v[i] for i in range(min(map(len, [v, w]))) if v[i] == w[i]]"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's test our newly defined function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "equal_chars(v, w)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And with two different words:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "equal_chars(\"Washington\", \"Massachusetts\")"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Similarity \n",
    "\n",
-    "Now the number of equal characters between two strings defines a similarity measure. For example, the similarity of our two strings is:"
+    "Now we regard the number of equal characters in two strings as a similarity measure. For example, the similarity of our two strings is:"
   ]
  },
  {
@@ -120,6 +152,26 @@
    "         \"Potsdam\", \"Saarbrücken\", \"Schwerin\", \"Stuttgart\", \"Wiesbaden\"]"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "or the names of the 16 German states:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "states_de = [\"Baden-Württemberg\", \"Bayern\", \"Berlin\", \"Brandenburg\", \n",
+    "             \"Bremen\", \"Hamburg\", \"Hessen\", \"Mecklenburg-Vorpommern\", \n",
+    "             \"Niedersachsen\", \"Nordrhein-Westfalen\", \"Rheinland-Pfalz\", \n",
+    "             \"Saarland\", \"Sachsen\", \"Sachsen-Anhalt\", \n",
+    "             \"Schleswig-Holstein\", \"Thüringen\"]"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -152,14 +204,16 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's compute the graph for our set of capitals:"
+    "Let's compute the graph for our set of capitals or states:"
   ]
  },
  {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
   "metadata": {},
+   "outputs": [],
   "source": [
-    "g = sim_graph(capitals_de)"
+    "g = sim_graph(states_de)"
   ]
  },
  {
@@ -180,14 +234,16 @@
    "import matplotlib.pyplot as plt\n",
    "    \n",
    "pos = graphviz_layout(g, prog='dot')\n",
-    "nx.draw(g, pos, with_labels=True, arrows=False)"
+    "nx.draw(g, pos, with_labels=True, arrows=False)\n",
+    "nx.draw_networkx_edge_labels(g, pos, edge_labels=nx.get_edge_attributes(g, 'label'), font_color='blue')\n",
+    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This layout is not the best but we can try to use graphviz directly:"
+    "This layout is not the best so it's better to use graphviz:"
   ]
  },
  {

 %% Cell type:markdown id: tags:

-# Graphing a kind of "Hamming Similarity" of strings
+# Graphing a kind of "Hamming similarity" for strings

-This notebook explores a slightly weird similarity measure for strings.
+This notebook explores a (probably slightly weird) similarity measure for strings.

 ## Equal characters in strings

-Given two strings, the idea is to consider the positions where their characters match:
+Given two strings, our idea is to consider the positions where their characters match:

 %% Cell type:code id: tags:

 ``` 
 v = "Wiesbaden"
 w = "Potsdam"
 #       s a     – the matching characters of the two strings
 ```

 %% Cell type:markdown id: tags:

 We can extract those characters with a loop:

 %% Cell type:code id: tags:

 ``` 
 m = []                                 # resulting equal characters
 for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
-    if v[i] == w[i]:                   # check character equality
-        m.append(v[i])                 # add character
+    if v[i] == w[i]:                   # equal characters at this position?
+        m.append(v[i])                 # collect equal character
 m
 ```

 %% Cell type:markdown id: tags:

 Let's create a function that, given two strings, returns their equal characters:

 %% Cell type:code id: tags:

 ``` 
-def equal_chars(w, v):
+def equal_chars(v, w):
    m = []                                 # resulting equal characters
    for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
        if v[i] == w[i]:                   # check character equality
            m.append(v[i])                 # add character
    return m
 ```

 %% Cell type:markdown id: tags:

 By the way: thanks to Python's [list comprehensions](https://docs.python.org/3/howto/functional.html#generator-expressions-and-list-comprehensions) we can write the body of the function in one line:

 %% Cell type:code id: tags:

 ``` 
-def equal_chars(w, v):
+def equal_chars(v, w):
    return [v[i] for i in range(min(map(len, [v, w]))) if v[i] == w[i]]
 ```

 %% Cell type:markdown id: tags:

+Let's test our newly defined function:
+
+%% Cell type:code id: tags:
+
+``` 
+equal_chars(v, w)
+```
+
+%% Cell type:markdown id: tags:
+
+And with two different words:
+
+%% Cell type:code id: tags:
+
+``` 
+equal_chars("Washington", "Massachusetts")
+```
+
+%% Cell type:markdown id: tags:
+
 ## Similarity

-Now the number of equal characters between two strings defines a similarity measure. For example, the similarity of our two strings is:
+Now we regard the number of equal characters in two strings as a similarity measure. For example, the similarity of our two strings is:

 %% Cell type:code id: tags:

 ``` 
 len(equal_chars(v, w))
 ```

 %% Cell type:markdown id: tags:

 ## Graph

 Now given a set of strings, for example, the 16 capitals of all German states:

 %% Cell type:code id: tags:

 ``` 
 capitals_de = ["Berlin", "Bremen", "Dresden", "Düsseldorf", "Erfurt",
         "Hamburg", "Hannover", "Kiel", "Magdeburg", "Mainz", "München",
         "Potsdam", "Saarbrücken", "Schwerin", "Stuttgart", "Wiesbaden"]
 ```

 %% Cell type:markdown id: tags:

+or the names of the 16 German states:
+
+%% Cell type:code id: tags:
+
+``` 
+states_de = ["Baden-Württemberg", "Bayern", "Berlin", "Brandenburg",
+             "Bremen", "Hamburg", "Hessen", "Mecklenburg-Vorpommern",
+             "Niedersachsen", "Nordrhein-Westfalen", "Rheinland-Pfalz",
+             "Saarland", "Sachsen", "Sachsen-Anhalt",
+             "Schleswig-Holstein", "Thüringen"]
+```
+
+%% Cell type:markdown id: tags:
+
 we can create a graph with the strings as nodes by connecting strings whose similarity is larger than zero, that is, they have at least one position with equal characters:

 %% Cell type:code id: tags:

 ``` 
 import networkx as nx

 def sim_graph(words):
    G = nx.Graph()                                                  # resulting graph

    for k, v in enumerate(words):                                   # first node
        for l, w in enumerate(words):                               # second node
            if k > l:                                               # avoid reverse duplicates
                ec = equal_chars(v, w)                              # equal characters
                sim = len(ec)                                       # similarity
                if sim > 0:                                         # ignore dissimilar words
                    G.add_edge(v, w, label="".join(ec), weight=sim) # add edge
    return G
 ```

 %% Cell type:markdown id: tags:

-Let's compute the graph for our set of capitals:
+Let's compute the graph for our set of capitals or states:

-%% Cell type:markdown id: tags:
+%% Cell type:code id: tags:

-g = sim_graph(capitals_de)
+``` 
+g = sim_graph(states_de)
+```

 %% Cell type:markdown id: tags:

 A good way to understand a graph is to visualise it:

 %% Cell type:code id: tags:

 ``` 
 %matplotlib inline
 from networkx.drawing.nx_agraph import graphviz_layout
 import matplotlib.pyplot as plt

 pos = graphviz_layout(g, prog='dot')
 nx.draw(g, pos, with_labels=True, arrows=False)
+nx.draw_networkx_edge_labels(g, pos, edge_labels=nx.get_edge_attributes(g, 'label'), font_color='blue')
+plt.show()
 ```

 %% Cell type:markdown id: tags:

-This layout is not the best but we can try to use graphviz directly:
+This layout is not the best so it's better to use graphviz:

 %% Cell type:code id: tags:

 ``` 
 from networkx.drawing.nx_pydot import write_dot
 import pydot
 from IPython.display import HTML, display
 import random

 write_dot(g, "graph.dot")
 graph = pydot.graph_from_dot_file("graph.dot")
 graph[0].write_svg("graph.svg")

 display(HTML('<img src="graph.svg?{0}">'.format(random.randint(0,2e9))))
 ```

--- a/README.org
+++ b/README.org
@@ -41,6 +41,7 @@ their difficulty (☆ = simple, ☆☆ = advanced, ☆☆☆ = sophisticated):
 - [[file:statistics_top50faculty.ipynb][Statistics top 50 faculty]] :: exploratory statistical analysis of the
     [[http://cs.brown.edu/people/apapouts/faculty_dataset.html][dataset of 2200 faculty in 50 top US computer science graduate
     programs]] (☆☆)
+- [[file:Twitter.ipynb][Twitter]] :: analysing Twitter data (raw JSON from Twitter's API) (☆)
 - [[file:classification.ipynb][Classification]] :: basic machine learning classification example (☆)
 - [[file:community_detection.ipynb][Community detection]] :: applying community detection algorithms to
     network graphs (☆☆)
--- a/Twitter.ipynb
+++ b/Twitter.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Analysing Twitter Data\n",
+    "\n",
+    "The input is assumed to be one tweet per line in [JSON](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet) as delivered by Twitter's streaming API. \n",
+    "\n",
+    "The subsequent examples read data from a compressed file containing a 1% sample of the tweets of one day.\n",
+    "\n",
+    "## top tweeting users"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gzip as gz\n",
+    "import json\n",
+    "from collections import Counter\n",
+    "\n",
+    "# this likely requires several minutes\n",
+    "with gz.open(\"statuses.log.2017-09-24.gz\") as f:   # reading the compressed file is faster\n",
+    "    usertweets = Counter()                         # let's count tweets per user\n",
+    "    usernames = {}                                 # map user ids to names\n",
+    "\n",
+    "    for tweet in f:\n",
+    "        js = json.loads(tweet)                     # parse JSON\n",
+    "        if \"user\" not in js:                       # skip non-tweet lines\n",
+    "            continue\n",
+    "        user = js[\"user\"]                          # get user object\n",
+    "        userid = user[\"id_str\"]                    # get user id as string\n",
+    "        usertweets[userid] += 1                    # count tweet\n",
+    "        usernames[userid] = user[\"screen_name\"]    # map id to name\n",
+    "\n",
+    "    for (usr, cnt) in usertweets.most_common(20):  # loop over top twenty\n",
+    "        print(cnt, usernames[user], usr, sep='\\t') # output tweet count, name, id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## top hashtags"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gzip as gz\n",
+    "import json\n",
+    "from collections import Counter\n",
+    "\n",
+    "# this likely requires several minutes\n",
+    "with gz.open(\"statuses.log.2017-09-24.gz\") as f:   # reading the compressed file is faster\n",
+    "    hashtags = Counter()                           # let's count tweets per hashtag\n",
+    "\n",
+    "    for tweet in f:\n",
+    "        js = json.loads(tweet)                     # parse JSON\n",
+    "        if \"user\" not in js:                       # skip non-tweet lines\n",
+    "            continue\n",
+    "        if \"hashtag\" not in js[\"entities\"]:        # skip tweets without hashtags\n",
+    "            continue\n",
+    "        for hashtag in js[\"entities\"][\"hashtags\"]: # loop over hashtags\n",
+    "            hashtags[hashtag] += 1                 # count each hashtag\n",
+    "\n",
+    "    for (ht, cnt) in hashtags.most_common(20):     # loop over top twenty\n",
+    "        print(cnt, ht, sep='\\t')                   # output tweet count, hashtag"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Analysing Twitter Data
+
+The input is assumed to be one tweet per line in [JSON](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet) as delivered by Twitter's streaming API.
+
+The subsequent examples read data from a compressed file containing a 1% sample of the tweets of one day.
+
+## top tweeting users
+
+%% Cell type:code id: tags:
+
+``` 
+import gzip as gz
+import json
+from collections import Counter
+
+# this likely requires several minutes
+with gz.open("statuses.log.2017-09-24.gz") as f:   # reading the compressed file is faster
+    usertweets = Counter()                         # let's count tweets per user
+    usernames = {}                                 # map user ids to names
+
+    for tweet in f:
+        js = json.loads(tweet)                     # parse JSON
+        if "user" not in js:                       # skip non-tweet lines
+            continue
+        user = js["user"]                          # get user object
+        userid = user["id_str"]                    # get user id as string
+        usertweets[userid] += 1                    # count tweet
+        usernames[userid] = user["screen_name"]    # map id to name
+
+    for (usr, cnt) in usertweets.most_common(20):  # loop over top twenty
+        print(cnt, usernames[user], usr, sep='\t') # output tweet count, name, id
+```
+
+%% Cell type:markdown id: tags:
+
+## top hashtags
+
+%% Cell type:code id: tags:
+
+``` 
+import gzip as gz
+import json
+from collections import Counter
+
+# this likely requires several minutes
+with gz.open("statuses.log.2017-09-24.gz") as f:   # reading the compressed file is faster
+    hashtags = Counter()                           # let's count tweets per hashtag
+
+    for tweet in f:
+        js = json.loads(tweet)                     # parse JSON
+        if "user" not in js:                       # skip non-tweet lines
+            continue
+        if "hashtag" not in js["entities"]:        # skip tweets without hashtags
+            continue
+        for hashtag in js["entities"]["hashtags"]: # loop over hashtags
+            hashtags[hashtag] += 1                 # count each hashtag
+
+    for (ht, cnt) in hashtags.most_common(20):     # loop over top twenty
+        print(cnt, ht, sep='\t')                   # output tweet count, hashtag
+```