+lambda

01011f15 · hachmeis@hu-berlin.de · cf284fef · 01011f15
Commit 01011f15 authored 2 years ago by hachmeis@hu-berlin.de
--- a/programmierspass/pandas.ipynb
+++ b/programmierspass/pandas.ipynb
@@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "id": "3ff02aca",
+   "id": "532e7e1e",
   "metadata": {},
   "source": [
    "# Daten einlesen und analysieren mit Pandas\n",
@@ -11,7 +11,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "9f62ac40",
+   "id": "164b9c0d",
   "metadata": {},
   "source": [
    "## Installation\n",
@@ -22,7 +22,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "d82392ea",
+   "id": "00c1cced",
   "metadata": {},
   "source": [
    "## Import Statement\n",
@@ -32,7 +32,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "115c3765",
+   "id": "f1048116",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -46,7 +46,7 @@
    }
   },
   "cell_type": "markdown",
-   "id": "ad369b95",
+   "id": "d2976c77",
   "metadata": {},
   "source": [
    "## DataFrames\n",
@@ -57,7 +57,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "95750ea4",
+   "id": "34e54a2a",
   "metadata": {},
   "source": [
    "### Wie erstellt man DataFrames? \n",
@@ -68,7 +68,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "92c2067a",
+   "id": "bc1bb54d",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -78,7 +78,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "8fd3f91c",
+   "id": "94ff20fa",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -87,7 +87,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "3da8dfae",
+   "id": "87eae82c",
   "metadata": {},
   "source": [
    "#### Einlesen einer Excel-Datei\n",
@@ -97,7 +97,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "a3784fa3",
+   "id": "21a56d1a",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -106,7 +106,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "5275f2d5",
+   "id": "ea4ae721",
   "metadata": {},
   "source": [
    "### Einlesen einer CSV Datei\n",
@@ -119,7 +119,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "6b7b4c76",
+   "id": "b5735f8f",
   "metadata": {},
   "outputs": [],
   "source": [
@@ -128,7 +128,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "3a9c7760",
+   "id": "457ee3a5",
   "metadata": {},
   "source": [
    "## Einlesen von HTML"
@@ -137,14 +137,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "26257be1",
+   "id": "82e944d3",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "7323c72e",
+   "id": "9e87daca",
   "metadata": {},
   "source": [
    "## Datenbereinigung\n",
@@ -155,14 +155,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "51571e56",
+   "id": "d20e1881",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "82ba7b2a",
+   "id": "addda084",
   "metadata": {},
   "source": [
    "### Fehlende Werte ersetzen\n",
@@ -172,14 +172,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "8accf74e",
+   "id": "34637e3b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "6a88e31f",
+   "id": "e77785ca",
   "metadata": {},
   "source": [
    "## Nützliche Funktionen der DataFrame\n",
@@ -189,14 +189,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "6f0980aa",
+   "id": "b3b5f090",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "77fb244e",
+   "id": "7a4ac4a4",
   "metadata": {},
   "source": [
    "### Unterste Spalten anzeigen"
@@ -205,14 +205,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "9aaed85c",
+   "id": "a2e39b7c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "818060c4",
+   "id": "1b479d1f",
   "metadata": {},
   "source": [
    "### Datentypen der Spalten der DataFrame"
@@ -221,14 +221,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "b702bdb3",
+   "id": "0d3ff72a",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "31f5e502",
+   "id": "878d7811",
   "metadata": {},
   "source": [
    "### Descriptive Statistiken der DataFrame"
@@ -237,14 +237,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "0568d2a0",
+   "id": "23647869",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "a20a3e39",
+   "id": "a12eb0cb",
   "metadata": {},
   "source": [
    "### Descriptive Statistiken pro Spalte"
@@ -253,14 +253,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "08339b12",
+   "id": "80d5d597",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "af34f55c",
+   "id": "9c3caccf",
   "metadata": {},
   "source": [
    "## Selektion von Daten aus DataFrames\n",
@@ -271,14 +271,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "05ba9409",
+   "id": "7bafdfc1",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "db70c055",
+   "id": "97aad3de",
   "metadata": {},
   "source": [
    "### Einzelne Zellen extrahieren\n",
@@ -291,14 +291,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "e82303e6",
+   "id": "ade5a46f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "fae19c23",
+   "id": "ab307fb7",
   "metadata": {},
   "source": [
    "Es können auch gezielt Daten aus Zellen extrahiert werden mit Angabe der Index- und Spaltennamen."
@@ -307,14 +307,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "461d1edc",
+   "id": "350cae87",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "8c61bb24",
+   "id": "fe5a3611",
   "metadata": {},
   "source": [
    "## Zuweisung von neuen Spalten, Zeilen und Werten\n",
@@ -324,14 +324,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "98104be1",
+   "id": "2f1eaa39",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "2b7233a6",
+   "id": "131b37ea",
   "metadata": {},
   "source": [
    "### Neue Zeile"
@@ -340,14 +340,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "405e7c0f",
+   "id": "599d6c3b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "754f0b5d",
+   "id": "dea1aa70",
   "metadata": {},
   "source": [
    "### Neue Werte in Zelle"
@@ -356,7 +356,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "e4a1bd03",
+   "id": "7071f172",
   "metadata": {},
   "outputs": [],
   "source": []
@@ -368,7 +368,7 @@
    }
   },
   "cell_type": "markdown",
-   "id": "3d116064",
+   "id": "6c319518",
   "metadata": {},
   "source": [
    "## Gruppierungen\n",
@@ -379,7 +379,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "15351777",
+   "id": "92d1e6be",
   "metadata": {},
   "source": [
    "Gruppieren ohne Aggregatsfunktion gibt ein [DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#) zurück und keine DataFrame. Erst nach Aufruf der gewünschten Aggregatsfunktion bekommt man wieder eine DataFrame."
@@ -388,14 +388,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "9e06a8d3",
+   "id": "1c106732",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "448eaac0",
+   "id": "af7210de",
   "metadata": {},
   "source": [
    "## DataFrames filtern nach Bedingungen"
@@ -404,7 +404,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "0f14d0bc",
+   "id": "92438742",
   "metadata": {},
   "outputs": [],
   "source": []
@@ -416,7 +416,7 @@
    }
   },
   "cell_type": "markdown",
-   "id": "c8e086c7",
+   "id": "256527fc",
   "metadata": {},
   "source": [
    "## DataFrames zusammenfügen (Merge)\n",
@@ -427,14 +427,14 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "a2d50399",
+   "id": "182f57ef",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
-   "id": "21137c01",
+   "id": "d6f9a0dd",
   "metadata": {},
   "source": [
    "## Plots"
@@ -443,7 +443,7 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "23d198b9",
+   "id": "709288ea",
   "metadata": {},
   "outputs": [],
   "source": []
@@ -455,7 +455,7 @@
    }
   },
   "cell_type": "markdown",
-   "id": "778c5013",
+   "id": "8b1d4e55",
   "metadata": {},
   "source": [
    "## Pivot\n",
@@ -466,15 +466,24 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "3a76bdb0",
+   "id": "0dbc5cee",
   "metadata": {},
   "outputs": [],
   "source": []
  },
+  {
+   "cell_type": "markdown",
+   "id": "acc6f66c",
+   "metadata": {},
+   "source": [
+    "## Was tun, wenn die gewünschte Funktion nicht in *Pandas* vorhanden?\n",
+    "Mit der Nutzung von [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) lassen sich Funktionen auf ganze Serien oder DataFrames anwenden. Es können auch eigene Funktionen mit `lambda` angewendet werden."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "6b601c97",
+   "id": "c6c2876f",
   "metadata": {},
   "outputs": [],
   "source": []

-%% Cell type:markdown id:3ff02aca tags:
+%% Cell type:markdown id:532e7e1e tags:
 # Daten einlesen und analysieren mit Pandas
 In diesem Notebook geht es um das Einlesen und Analysieren von Daten mit der Python-Bibliothek [*Pandas*](https://pandas.pydata.org/docs/index.html).
-%% Cell type:markdown id:9f62ac40 tags:
+%% Cell type:markdown id:164b9c0d tags:
 ## Installation
 - Terminal: `pip install pandas` oder `conda install pandas`.
 - Jupyter notebook: `!pip install pandas` oder `!conda install pandas`.
-%% Cell type:markdown id:d82392ea tags:
+%% Cell type:markdown id:00c1cced tags:
 ## Import Statement
 Wie bei anderen Python-Bibliotheken auch, müssen wir *Pandas* vor der Nutzung erst importieren. Es ist Konvention dabei *Pandas* als `pd` abzukürzen. In der Regel sind daher mit Funktionen die mit `pd.FUNKTIONSNAME()` aufgerufen werden, Funktionen aus *Pandas*.
-%% Cell type:code id:115c3765 tags:
+%% Cell type:code id:f1048116 tags:
 ``` 
 import pandas as pd
 ```
-%% Cell type:markdown id:ad369b95 tags:
+%% Cell type:markdown id:d2976c77 tags:
 ## DataFrames
 Die [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) ist ein zweidimensionaler Datentyp in *Pandas* und dient zur Verarbeitung von tabellarischen Daten. Es ist Konvention Variablen die DataFarames referenzieren als `df` zu bezeichnen.
 ![dataframe.webp](attachment:dataframe.webp)
 Quelle: https://pynative.com/python-pandas-dataframe/
-%% Cell type:markdown id:95750ea4 tags:
+%% Cell type:markdown id:34e54a2a tags:
 ### Wie erstellt man DataFrames?
 #### Manuelle Erstellung
 Hier gibt es verschiedene Möglichkeiten, allerdings ist kann ein Python-Dictionary direkt die Spaltennamen und Werte festgelegt werden. Jeder Schlüssel entspricht hierbei dem Spaltennamen und die jeweiligen Werte den Listen der Schlüssel.
-%% Cell type:code id:92c2067a tags:
+%% Cell type:code id:bc1bb54d tags:
 ``` 
 adict = {'a': 1, 'b': 2, 'c': 3}
 ```
-%% Cell type:code id:8fd3f91c tags:
+%% Cell type:code id:94ff20fa tags:
 ``` 
 pd.DataFrame(adict.values(), index=adict.keys())
 ```
-%% Cell type:markdown id:3da8dfae tags:
+%% Cell type:markdown id:87eae82c tags:
 #### Einlesen einer Excel-Datei
 In der Praxis werden DataFrames meist durch das einlesen von Daten aus externen Formaten erzeugt, wie bspw. Excel oder CSV-Dateien.
-%% Cell type:code id:a3784fa3 tags:
+%% Cell type:code id:21a56d1a tags:
 ``` 
 df = pd.read_excel('example.xlsx')
 ```
-%% Cell type:markdown id:5275f2d5 tags:
+%% Cell type:markdown id:ea4ae721 tags:
 ### Einlesen einer CSV Datei
 In diesem Beispiel verwenden wir den [Books Datensatz](https://www.kaggle.com/datasets/jalota/books-dataset).
 <div class="alert alert-info">
 <b>Achtung</b> Beim einlesen von CSV Dateien, ist es wichtig auf den Parameter <i>sep</i> zu achten. Dieser gibt vor, was als Trennungszeichen verwendet wird.
 </div>
-%% Cell type:code id:6b7b4c76 tags:
+%% Cell type:code id:b5735f8f tags:
 ``` 
 df = pd.read_csv('books.csv')
 ```
-%% Cell type:markdown id:3a9c7760 tags:
+%% Cell type:markdown id:457ee3a5 tags:
 ## Einlesen von HTML
-%% Cell type:code id:26257be1 tags:
+%% Cell type:code id:82e944d3 tags:
 ``` 
 ```
-%% Cell type:markdown id:7323c72e tags:
+%% Cell type:markdown id:9e87daca tags:
 ## Datenbereinigung
 ### Datentypen korrigieren
 Oft sind beim Einlesen von Daten nicht alle Datentypen direkt entsprechend gesetzt in der DataFrame. Daher, muss dies manuell gemacht werden mit der Funktion [`astype()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html).
-%% Cell type:code id:51571e56 tags:
+%% Cell type:code id:d20e1881 tags:
 ``` 
 ```
-%% Cell type:markdown id:82ba7b2a tags:
+%% Cell type:markdown id:addda084 tags:
 ### Fehlende Werte ersetzen
 Fehlende Werte werden in *Pandas* mit `nan` bezeichnet. Daher sollten Werte die in den Rohdaten für fehlende Werte stehen (bspw. "unbekannt", "-", "n.a.") damit ersetzt werden.
-%% Cell type:code id:8accf74e tags:
+%% Cell type:code id:34637e3b tags:
 ``` 
 ```
-%% Cell type:markdown id:6a88e31f tags:
+%% Cell type:markdown id:e77785ca tags:
 ## Nützliche Funktionen der DataFrame
 ### Oberste Spalten zeigen
-%% Cell type:code id:6f0980aa tags:
+%% Cell type:code id:b3b5f090 tags:
 ``` 
 ```
-%% Cell type:markdown id:77fb244e tags:
+%% Cell type:markdown id:7a4ac4a4 tags:
 ### Unterste Spalten anzeigen
-%% Cell type:code id:9aaed85c tags:
+%% Cell type:code id:a2e39b7c tags:
 ``` 
 ```
-%% Cell type:markdown id:818060c4 tags:
+%% Cell type:markdown id:1b479d1f tags:
 ### Datentypen der Spalten der DataFrame
-%% Cell type:code id:b702bdb3 tags:
+%% Cell type:code id:0d3ff72a tags:
 ``` 
 ```
-%% Cell type:markdown id:31f5e502 tags:
+%% Cell type:markdown id:878d7811 tags:
 ### Descriptive Statistiken der DataFrame
-%% Cell type:code id:0568d2a0 tags:
+%% Cell type:code id:23647869 tags:
 ``` 
 ```
-%% Cell type:markdown id:a20a3e39 tags:
+%% Cell type:markdown id:a12eb0cb tags:
 ### Descriptive Statistiken pro Spalte
-%% Cell type:code id:08339b12 tags:
+%% Cell type:code id:80d5d597 tags:
 ``` 
 ```
-%% Cell type:markdown id:af34f55c tags:
+%% Cell type:markdown id:9c3caccf tags:
 ## Selektion von Daten aus DataFrames
 ### Spalten extrahieren
 Wenn nur eine Spalte extrahiert wird, entspricht der Datentyp einer [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html?highlight=series#pandas.Series). Diese hat teilweise andere Funktionen und nicht alle Funktionen, welche die DataFrame hat.
-%% Cell type:code id:05ba9409 tags:
+%% Cell type:code id:7bafdfc1 tags:
 ``` 
 ```
-%% Cell type:markdown id:db70c055 tags:
+%% Cell type:markdown id:97aad3de tags:
 ### Einzelne Zellen extrahieren
 Eine Möglichkeit ist, dies mit dem Index der Dimension zu tun, bspw: "Gib mir die Daten der zweiten Spalte und der dritten Zeile". Dies funktioniert mit der Funktion `iloc`.
 <div class="alert alert-info">
 <b>Achtung!</b> Wie in Python üblich, started die Indizierung bei 0. Das heißt: die erste Zeile und Spalte wird jeweils mit 0 extrahiert und nicht mit 1. 1 extrahiert die zweite Spalte usw.
 </div>
-%% Cell type:code id:e82303e6 tags:
+%% Cell type:code id:ade5a46f tags:
 ``` 
 ```
-%% Cell type:markdown id:fae19c23 tags:
+%% Cell type:markdown id:ab307fb7 tags:
 Es können auch gezielt Daten aus Zellen extrahiert werden mit Angabe der Index- und Spaltennamen.
-%% Cell type:code id:461d1edc tags:
+%% Cell type:code id:350cae87 tags:
 ``` 
 ```
-%% Cell type:markdown id:8c61bb24 tags:
+%% Cell type:markdown id:fe5a3611 tags:
 ## Zuweisung von neuen Spalten, Zeilen und Werten
 ### Neue Spalte
-%% Cell type:code id:98104be1 tags:
+%% Cell type:code id:2f1eaa39 tags:
 ``` 
 ```
-%% Cell type:markdown id:2b7233a6 tags:
+%% Cell type:markdown id:131b37ea tags:
 ### Neue Zeile
-%% Cell type:code id:405e7c0f tags:
+%% Cell type:code id:599d6c3b tags:
 ``` 
 ```
-%% Cell type:markdown id:754f0b5d tags:
+%% Cell type:markdown id:dea1aa70 tags:
 ### Neue Werte in Zelle
-%% Cell type:code id:e4a1bd03 tags:
+%% Cell type:code id:7071f172 tags:
 ``` 
 ```
-%% Cell type:markdown id:3d116064 tags:
+%% Cell type:markdown id:6c319518 tags:
 ## Gruppierungen
 Die Gruppierungen in *Pandas* sind vergleichbar mit dem `GROUP BY` in SQL.
 ![groupby_pandas.png](attachment:groupby_pandas.png)
 Quelle: https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e
-%% Cell type:markdown id:15351777 tags:
+%% Cell type:markdown id:92d1e6be tags:
 Gruppieren ohne Aggregatsfunktion gibt ein [DataFrame.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#) zurück und keine DataFrame. Erst nach Aufruf der gewünschten Aggregatsfunktion bekommt man wieder eine DataFrame.
-%% Cell type:code id:9e06a8d3 tags:
+%% Cell type:code id:1c106732 tags:
 ``` 
 ```
-%% Cell type:markdown id:448eaac0 tags:
+%% Cell type:markdown id:af7210de tags:
 ## DataFrames filtern nach Bedingungen
-%% Cell type:code id:0f14d0bc tags:
+%% Cell type:code id:92438742 tags:
 ``` 
 ```
-%% Cell type:markdown id:c8e086c7 tags:
+%% Cell type:markdown id:256527fc tags:
 ## DataFrames zusammenfügen (Merge)
 ![pandas_merge-2.png](attachment:pandas_merge-2.png)
 Quelle: https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
-%% Cell type:code id:a2d50399 tags:
+%% Cell type:code id:182f57ef tags:
 ``` 
 ```
-%% Cell type:markdown id:21137c01 tags:
+%% Cell type:markdown id:d6f9a0dd tags:
 ## Plots
-%% Cell type:code id:23d198b9 tags:
+%% Cell type:code id:709288ea tags:
 ``` 
 ```
-%% Cell type:markdown id:778c5013 tags:
+%% Cell type:markdown id:8b1d4e55 tags:
 ## Pivot
 ![reshaping_pivot.png](attachment:reshaping_pivot.png)
 Quelle: https://pandas.pydata.org/docs/user_guide/reshaping.html
-%% Cell type:code id:3a76bdb0 tags:
+%% Cell type:code id:0dbc5cee tags:
 ``` 
 ```
-%% Cell type:code id:6b601c97 tags:
+%% Cell type:markdown id:acc6f66c tags:
+## Was tun, wenn die gewünschte Funktion nicht in *Pandas* vorhanden?
+Mit der Nutzung von [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) lassen sich Funktionen auf ganze Serien oder DataFrames anwenden. Es können auch eigene Funktionen mit `lambda` angewendet werden.
+%% Cell type:code id:c6c2876f tags:
 ``` 
 ```