Skip to content
Snippets Groups Projects
Commit 692e1cbd authored by Prof. Dr. Robert Jäschke's avatar Prof. Dr. Robert Jäschke
Browse files

added Twitter example

parent 9c4e407b
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Graphing a kind of "Hamming Similarity" of strings
# Graphing a kind of "Hamming similarity" for strings
This notebook explores a slightly weird similarity measure for strings.
This notebook explores a (probably slightly weird) similarity measure for strings.
## Equal characters in strings
Given two strings, the idea is to consider the positions where their characters match:
Given two strings, our idea is to consider the positions where their characters match:
%% Cell type:code id: tags:
```
v = "Wiesbaden"
w = "Potsdam"
# s a – the matching characters of the two strings
```
%% Cell type:markdown id: tags:
We can extract those characters with a loop:
%% Cell type:code id: tags:
```
m = [] # resulting equal characters
for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
if v[i] == w[i]: # check character equality
m.append(v[i]) # add character
if v[i] == w[i]: # equal characters at this position?
m.append(v[i]) # collect equal character
m
```
%% Cell type:markdown id: tags:
Let's create a function that, given two strings, returns their equal characters:
%% Cell type:code id: tags:
```
def equal_chars(w, v):
def equal_chars(v, w):
m = [] # resulting equal characters
for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
if v[i] == w[i]: # check character equality
m.append(v[i]) # add character
return m
```
%% Cell type:markdown id: tags:
By the way: thanks to Python's [list comprehensions](https://docs.python.org/3/howto/functional.html#generator-expressions-and-list-comprehensions) we can write the body of the function in one line:
%% Cell type:code id: tags:
```
def equal_chars(w, v):
def equal_chars(v, w):
return [v[i] for i in range(min(map(len, [v, w]))) if v[i] == w[i]]
```
%% Cell type:markdown id: tags:
Let's test our newly defined function:
%% Cell type:code id: tags:
```
equal_chars(v, w)
```
%% Cell type:markdown id: tags:
And with two different words:
%% Cell type:code id: tags:
```
equal_chars("Washington", "Massachusetts")
```
%% Cell type:markdown id: tags:
## Similarity
Now the number of equal characters between two strings defines a similarity measure. For example, the similarity of our two strings is:
Now we regard the number of equal characters in two strings as a similarity measure. For example, the similarity of our two strings is:
%% Cell type:code id: tags:
```
len(equal_chars(v, w))
```
%% Cell type:markdown id: tags:
## Graph
Now given a set of strings, for example, the 16 capitals of all German states:
%% Cell type:code id: tags:
```
capitals_de = ["Berlin", "Bremen", "Dresden", "Düsseldorf", "Erfurt",
"Hamburg", "Hannover", "Kiel", "Magdeburg", "Mainz", "München",
"Potsdam", "Saarbrücken", "Schwerin", "Stuttgart", "Wiesbaden"]
```
%% Cell type:markdown id: tags:
or the names of the 16 German states:
%% Cell type:code id: tags:
```
states_de = ["Baden-Württemberg", "Bayern", "Berlin", "Brandenburg",
"Bremen", "Hamburg", "Hessen", "Mecklenburg-Vorpommern",
"Niedersachsen", "Nordrhein-Westfalen", "Rheinland-Pfalz",
"Saarland", "Sachsen", "Sachsen-Anhalt",
"Schleswig-Holstein", "Thüringen"]
```
%% Cell type:markdown id: tags:
we can create a graph with the strings as nodes by connecting strings whose similarity is larger than zero, that is, they have at least one position with equal characters:
%% Cell type:code id: tags:
```
import networkx as nx
def sim_graph(words):
G = nx.Graph() # resulting graph
for k, v in enumerate(words): # first node
for l, w in enumerate(words): # second node
if k > l: # avoid reverse duplicates
ec = equal_chars(v, w) # equal characters
sim = len(ec) # similarity
if sim > 0: # ignore dissimilar words
G.add_edge(v, w, label="".join(ec), weight=sim) # add edge
return G
```
%% Cell type:markdown id: tags:
Let's compute the graph for our set of capitals:
Let's compute the graph for our set of capitals or states:
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
g = sim_graph(capitals_de)
```
g = sim_graph(states_de)
```
%% Cell type:markdown id: tags:
A good way to understand a graph is to visualise it:
%% Cell type:code id: tags:
```
%matplotlib inline
from networkx.drawing.nx_agraph import graphviz_layout
import matplotlib.pyplot as plt
pos = graphviz_layout(g, prog='dot')
nx.draw(g, pos, with_labels=True, arrows=False)
nx.draw_networkx_edge_labels(g, pos, edge_labels=nx.get_edge_attributes(g, 'label'), font_color='blue')
plt.show()
```
%% Cell type:markdown id: tags:
This layout is not the best but we can try to use graphviz directly:
This layout is not the best so it's better to use graphviz:
%% Cell type:code id: tags:
```
from networkx.drawing.nx_pydot import write_dot
import pydot
from IPython.display import HTML, display
import random
write_dot(g, "graph.dot")
graph = pydot.graph_from_dot_file("graph.dot")
graph[0].write_svg("graph.svg")
display(HTML('<img src="graph.svg?{0}">'.format(random.randint(0,2e9))))
```
......
......@@ -41,6 +41,7 @@ their difficulty (☆ = simple, ☆☆ = advanced, ☆☆☆ = sophisticated):
- [[file:statistics_top50faculty.ipynb][Statistics top 50 faculty]] :: exploratory statistical analysis of the
[[http://cs.brown.edu/people/apapouts/faculty_dataset.html][dataset of 2200 faculty in 50 top US computer science graduate
programs]] (☆☆)
- [[file:Twitter.ipynb][Twitter]] :: analysing Twitter data (raw JSON from Twitter's API) (☆)
- [[file:classification.ipynb][Classification]] :: basic machine learning classification example (☆)
- [[file:community_detection.ipynb][Community detection]] :: applying community detection algorithms to
network graphs (☆☆)
%% Cell type:markdown id: tags:
# Analysing Twitter Data
The input is assumed to be one tweet per line in [JSON](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet) as delivered by Twitter's streaming API.
The subsequent examples read data from a compressed file containing a 1% sample of the tweets of one day.
## top tweeting users
%% Cell type:code id: tags:
```
import gzip as gz
import json
from collections import Counter
# this likely requires several minutes
with gz.open("statuses.log.2017-09-24.gz") as f: # reading the compressed file is faster
usertweets = Counter() # let's count tweets per user
usernames = {} # map user ids to names
for tweet in f:
js = json.loads(tweet) # parse JSON
if "user" not in js: # skip non-tweet lines
continue
user = js["user"] # get user object
userid = user["id_str"] # get user id as string
usertweets[userid] += 1 # count tweet
usernames[userid] = user["screen_name"] # map id to name
for (usr, cnt) in usertweets.most_common(20): # loop over top twenty
print(cnt, usernames[user], usr, sep='\t') # output tweet count, name, id
```
%% Cell type:markdown id: tags:
## top hashtags
%% Cell type:code id: tags:
```
import gzip as gz
import json
from collections import Counter
# this likely requires several minutes
with gz.open("statuses.log.2017-09-24.gz") as f: # reading the compressed file is faster
hashtags = Counter() # let's count tweets per hashtag
for tweet in f:
js = json.loads(tweet) # parse JSON
if "user" not in js: # skip non-tweet lines
continue
if "hashtag" not in js["entities"]: # skip tweets without hashtags
continue
for hashtag in js["entities"]["hashtags"]: # loop over hashtags
hashtags[hashtag] += 1 # count each hashtag
for (ht, cnt) in hashtags.most_common(20): # loop over top twenty
print(cnt, ht, sep='\t') # output tweet count, hashtag
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment