added first notebook

bacd4a76 · Prof. Dr. Robert Jäschke · 53a60e37 · bacd4a76 · bacd4a76
Commit bacd4a76 authored 4 years ago by Prof. Dr. Robert Jäschke
--- a/README.org
+++ b/README.org
+* Beispiele für Jupyter Notebooks
+
+Ein Platz, um Jupyter-Notebooks zu sammeln.
+
+Alle Notebooks sollten ohne weitere Dateien funktionieren,
+Standard-Python-Bibliotheken nutzen (pandas, scikit-learn, etc.) und
+ihre Daten idealerweise aus dem Netz ziehen.
--- a/statistics_top50faculty.ipynb
+++ b/statistics_top50faculty.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Basic statistics using the top 50 faculty dataset\n",
+    "\n",
+    "[Dataset of 2200 faculty in 50 top US Computer Science Graduate Programs](http://cs.brown.edu/people/apapouts/faculty_dataset.html)\n",
+    "\n",
+    "## Preprocessing\n",
+    "\n",
+    "Load the data and get an overview:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "df = pd.read_csv('https://raw.githubusercontent.com/brownhci/drafty/master-node/databaits/data/professors.csv')\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Clean the data (run just once):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.loc[(df['JoinYear']=='Full')] # that row contains an error!\n",
+    "df.drop(3115, inplace=True)      # so let's remove it\n",
+    "\n",
+    "df['JoinYear'] = pd.to_numeric(df[\"JoinYear\"]) # required for pandas.hist()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exploration\n",
+    "\n",
+    "Plot histograms of some of the columns:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.hist(column='JoinYear')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['Rank'].value_counts().plot(kind='bar')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['University'].value_counts().plot(kind='bar', figsize=(15,5))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['Gender'].value_counts().plot(kind='bar')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Apparently, the data needs more cleansing ...**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Analysis\n",
+    "\n",
+    "Questions that could be explored:\n",
+    "\n",
+    "- Is the proportion of female staff increasing over time?\n",
+    "- Are higher ranks predominantly occupied by male staff?\n",
+    "- Which universities or fields have an almost equal gender distribution?"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Basic statistics using the top 50 faculty dataset
+
+[Dataset of 2200 faculty in 50 top US Computer Science Graduate Programs](http://cs.brown.edu/people/apapouts/faculty_dataset.html)
+
+## Preprocessing
+
+Load the data and get an overview:
+
+%% Cell type:code id: tags:
+
+``` 
+import pandas as pd
+df = pd.read_csv('https://raw.githubusercontent.com/brownhci/drafty/master-node/databaits/data/professors.csv')
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+Clean the data (run just once):
+
+%% Cell type:code id: tags:
+
+``` 
+df.loc[(df['JoinYear']=='Full')] # that row contains an error!
+df.drop(3115, inplace=True)      # so let's remove it
+
+df['JoinYear'] = pd.to_numeric(df["JoinYear"]) # required for pandas.hist()
+```
+
+%% Cell type:markdown id: tags:
+
+## Exploration
+
+Plot histograms of some of the columns:
+
+%% Cell type:code id: tags:
+
+``` 
+df.hist(column='JoinYear')
+```
+
+%% Cell type:code id: tags:
+
+``` 
+df['Rank'].value_counts().plot(kind='bar')
+```
+
+%% Cell type:code id: tags:
+
+``` 
+df['University'].value_counts().plot(kind='bar', figsize=(15,5))
+```
+
+%% Cell type:code id: tags:
+
+``` 
+df['Gender'].value_counts().plot(kind='bar')
+```
+
+%% Cell type:markdown id: tags:
+
+**Apparently, the data needs more cleansing ...**
+
+%% Cell type:markdown id: tags:
+
+## Analysis
+
+Questions that could be explored:
+
+- Is the proportion of female staff increasing over time?
+- Are higher ranks predominantly occupied by male staff?
+- Which universities or fields have an almost equal gender distribution?