"This layout is not the best but we can try to use graphviz directly:"
"This layout is not the best so it's better to use graphviz:"
]
},
{
...
...
%% Cell type:markdown id: tags:
# Graphing a kind of "Hamming Similarity" of strings
# Graphing a kind of "Hamming similarity" for strings
This notebook explores a slightly weird similarity measure for strings.
This notebook explores a (probably slightly weird) similarity measure for strings.
## Equal characters in strings
Given two strings, the idea is to consider the positions where their characters match:
Given two strings, our idea is to consider the positions where their characters match:
%% Cell type:code id: tags:
```
v = "Wiesbaden"
w = "Potsdam"
# s a – the matching characters of the two strings
```
%% Cell type:markdown id: tags:
We can extract those characters with a loop:
%% Cell type:code id: tags:
```
m = [] # resulting equal characters
for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
if v[i] == w[i]: # check character equality
m.append(v[i]) # add character
if v[i] == w[i]: # equal characters at this position?
m.append(v[i]) # collect equal character
m
```
%% Cell type:markdown id: tags:
Let's create a function that, given two strings, returns their equal characters:
%% Cell type:code id: tags:
```
def equal_chars(w, v):
def equal_chars(v, w):
m = [] # resulting equal characters
for i in range(min(map(len, [v, w]))): # loop over the shortest word's length
if v[i] == w[i]: # check character equality
m.append(v[i]) # add character
return m
```
%% Cell type:markdown id: tags:
By the way: thanks to Python's [list comprehensions](https://docs.python.org/3/howto/functional.html#generator-expressions-and-list-comprehensions) we can write the body of the function in one line:
%% Cell type:code id: tags:
```
def equal_chars(w, v):
def equal_chars(v, w):
return [v[i] for i in range(min(map(len, [v, w]))) if v[i] == w[i]]
```
%% Cell type:markdown id: tags:
Let's test our newly defined function:
%% Cell type:code id: tags:
```
equal_chars(v, w)
```
%% Cell type:markdown id: tags:
And with two different words:
%% Cell type:code id: tags:
```
equal_chars("Washington", "Massachusetts")
```
%% Cell type:markdown id: tags:
## Similarity
Now the number of equal characters between two strings defines a similarity measure. For example, the similarity of our two strings is:
Now we regard the number of equal characters in two strings as a similarity measure. For example, the similarity of our two strings is:
%% Cell type:code id: tags:
```
len(equal_chars(v, w))
```
%% Cell type:markdown id: tags:
## Graph
Now given a set of strings, for example, the 16 capitals of all German states:
we can create a graph with the strings as nodes by connecting strings whose similarity is larger than zero, that is, they have at least one position with equal characters:
"The input is assumed to be one tweet per line in [JSON](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet) as delivered by Twitter's streaming API. \n",
"\n",
"The subsequent examples read data from a compressed file containing a 1% sample of the tweets of one day.\n",
"\n",
"## top tweeting users"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gzip as gz\n",
"import json\n",
"from collections import Counter\n",
"\n",
"# this likely requires several minutes\n",
"with gz.open(\"statuses.log.2017-09-24.gz\") as f: # reading the compressed file is faster\n",
" usertweets = Counter() # let's count tweets per user\n",
" usernames = {} # map user ids to names\n",
"\n",
" for tweet in f:\n",
" js = json.loads(tweet) # parse JSON\n",
" if \"user\" not in js: # skip non-tweet lines\n",
" continue\n",
" user = js[\"user\"] # get user object\n",
" userid = user[\"id_str\"] # get user id as string\n",
" usertweets[userid] += 1 # count tweet\n",
" usernames[userid] = user[\"screen_name\"] # map id to name\n",
"\n",
" for (usr, cnt) in usertweets.most_common(20): # loop over top twenty\n",
The input is assumed to be one tweet per line in [JSON](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet) as delivered by Twitter's streaming API.
The subsequent examples read data from a compressed file containing a 1% sample of the tweets of one day.
## top tweeting users
%% Cell type:code id: tags:
```
import gzip as gz
import json
from collections import Counter
# this likely requires several minutes
with gz.open("statuses.log.2017-09-24.gz") as f: # reading the compressed file is faster
usertweets = Counter() # let's count tweets per user
usernames = {} # map user ids to names
for tweet in f:
js = json.loads(tweet) # parse JSON
if "user" not in js: # skip non-tweet lines
continue
user = js["user"] # get user object
userid = user["id_str"] # get user id as string
usertweets[userid] += 1 # count tweet
usernames[userid] = user["screen_name"] # map id to name
for (usr, cnt) in usertweets.most_common(20): # loop over top twenty
print(cnt, usernames[user], usr, sep='\t') # output tweet count, name, id
```
%% Cell type:markdown id: tags:
## top hashtags
%% Cell type:code id: tags:
```
import gzip as gz
import json
from collections import Counter
# this likely requires several minutes
with gz.open("statuses.log.2017-09-24.gz") as f: # reading the compressed file is faster
hashtags = Counter() # let's count tweets per hashtag
for tweet in f:
js = json.loads(tweet) # parse JSON
if "user" not in js: # skip non-tweet lines
continue
if "hashtag" not in js["entities"]: # skip tweets without hashtags
continue
for hashtag in js["entities"]["hashtags"]: # loop over hashtags
hashtags[hashtag] += 1 # count each hashtag
for (ht, cnt) in hashtags.most_common(20): # loop over top twenty