Skip to content
Snippets Groups Projects
Commit bef8275c authored by Prof. Dr. Robert Jäschke's avatar Prof. Dr. Robert Jäschke
Browse files

added regex example

parent 9f031762
No related branches found
No related tags found
No related merge requests found
......@@ -32,6 +32,8 @@ So far, notebooks are listed by difficulty, indicated by stars (☆ = simple,
- [[file:notebooks/Twitter.ipynb][Twitter]] :: analysing Twitter data (raw JSON from Twitter's API) (☆)
- [[file:notebooks/wikipedia_language_editions.ipynb][Wikipedia language editions]] :: plotting the depth and number of
articles of different Wikipedia language editions (☆)
- [[file:notebooks/wikipedia_regex.ipynb][Regular expressions]] :: simple information extraction from Wikipedia
articles (☆)
- [[file:notebooks/amazon_reviews.ipynb][Amazon reviews]] :: crawling web sites with [[https://scrapy.org/][Scrapy]], processing JSON
data, basic statistics and visualisation (☆☆)
- [[file:notebooks/Art.ipynb][Art]] :: Creating computer-generated art by translation, scaling and
......
%% Cell type:markdown id: tags:
# Anwendung von regulären Ausdrücken am Beispiel von Wikipedia-Seiten
%% Cell type:markdown id: tags:
## IP-Adressen
Wir wollen alle IP-Adressen aus der [zugehörigen Wikipedia-Seite](https://de.wikipedia.org/wiki/IP-Adresse) extrahieren. Wir verwenden dazu den regulären Ausdruck `[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`, der zwar recht allgemein ist und viele Muster zulässt die keine IP-Adressen sind (z.B. 1000.1000.1000.1000), für dieses Beispiel aber völlig ausreicht:
%% Cell type:code id: tags:
``` python
import urllib.request
import re
with urllib.request.urlopen("https://de.wikipedia.org/wiki/IP-Adresse") as f:
html = f.read().decode('utf8')
for ipaddress in sorted(re.findall("[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+", html)):
print(ipaddress)
```
%% Output
0.0.0.0
0.0.0.0
0.255.255.255
000.000.000.003
10.0.0.0
10.0.0.0
10.255.255.255
100.127.255.255
100.64.0.0
100.64.0.0
127.0.0.0
127.0.0.0
127.0.0.0
127.0.0.1
127.0.0.1
127.255.255.255
128.0.0.0
128.0.0.0
128.0.255.255
13.0.0.0
14.0.0.0
14.0.0.0
14.255.255.255
169.254.0.0
169.254.0.0
169.254.255.255
172.16.0.0
172.16.0.0
172.31.255.255
191.255.0.0
191.255.0.0
191.255.255.255
192.0.0.0
192.0.0.0
192.0.0.0
192.0.0.0
192.0.0.255
192.0.0.7
192.0.2.0
192.0.2.0
192.0.2.255
192.0.2.42
192.168.0.0
192.168.0.0
192.168.0.254
192.168.0.254
192.168.0.254
192.168.0.254
192.168.0.255
192.168.0.255
192.168.1.1
192.168.1.2
192.168.2.254
192.168.2.254
192.168.2.254
192.168.255.255
192.88.99.0
192.88.99.0
192.88.99.255
198.18.0.0
198.18.0.0
198.19.255.255
198.51.100.0
198.51.100.0
198.51.100.255
203.0.113.0
203.0.113.0
203.0.113.192
203.0.113.195
203.0.113.195
203.0.113.255
203.000.113.192
203.000.113.195
203.000.113.195
223.255.255.0
223.255.255.0
223.255.255.255
224.0.0.0
224.0.0.0
239.255.255.255
24.0.0.0
24.0.0.0
24.255.255.255
240.0.0.0
240.0.0.0
255.255.255.192
255.255.255.224
255.255.255.224
255.255.255.224
255.255.255.255
255.255.255.255
255.255.255.255
255.255.255.255
340.282.366.920
39.0.0.0
39.0.0.0
39.255.255.255
4.294.967.296
53.0.0.0
54.0.0.0
607.431.768.211
665.570.793.348
9.0.0.0
93.184.216.34
938.463.463.374
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment