moved amazon reviews example to this repo

058a1237 · Prof. Dr. Robert Jäschke · 9f32176d · 058a1237 · 058a1237
Commit 058a1237 authored 4 years ago by Prof. Dr. Robert Jäschke
--- a/amazon_reviews.ipynb
+++ b/amazon_reviews.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Analyzing Amazon Reviews with Python\n",
+    "\n",
+    "Session material [amazon_reviews.json](amazon_reviews.json)\n",
+    "\n",
+    "Michael Paris, Humboldt Universität Berlin \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# settings for tutorial presentation with RISE\n",
+    "from notebook.services.config import ConfigManager\n",
+    "cm = ConfigManager()\n",
+    "cm.update('livereveal', {\n",
+    "              'width': '100%',\n",
+    "              'height': '100%',\n",
+    "              'scroll': True,\n",
+    "              'enable_chalkboard': True,\n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Scrapy\n",
+    "\n",
+    "The data presented to you in this course has been collected using [scrapy](https://scrapy.org/). \n",
+    "Scrapy is an open source and collaborative framework for extracting the data you need from websites. \n",
+    "\n",
+    "If you would like to set up scrapy just execute the follwing cells for an example!\n",
+    "\n",
+    "Pro Tip: Run all commands in this jupyter-notebook from a dedicated environment (conda, virtualenv etc.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# package setup\n",
+    "!pip install scrapy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# creating a scrapy project\n",
+    "!scrapy startproject tutorial"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# keeping tabs on the project directory\n",
+    "import os\n",
+    "pwd = os.getcwd()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This is the content of your newly created scrapy project\n",
+    "os.listdir(\"tutorial\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Now have to write a spider class in this directory\n",
+    "os.listdir(\"tutorial/tutorial/spiders\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Writing a spider class, which will follow a set of rules and collect specific items. \n",
+    "# It is the definition of the crawler:\n",
+    "os.chdir(pwd)\n",
+    "class_description = \"\"\"\n",
+    "import scrapy\n",
+    "\n",
+    "class QuotesSpider(scrapy.Spider):\n",
+    "    name = \"quotes\"\n",
+    "\n",
+    "    def start_requests(self):\n",
+    "        urls = [\n",
+    "            'http://quotes.toscrape.com/page/1/',\n",
+    "            'http://quotes.toscrape.com/page/2/',\n",
+    "        ]\n",
+    "        for url in urls:\n",
+    "            yield scrapy.Request(url=url, callback=self.parse)\n",
+    "\n",
+    "    def parse(self, response):\n",
+    "        items = {}\n",
+    "        quotes = response.xpath('//div[@class=\"quote\"]/span[@class=\"text\"]/text()').extract()\n",
+    "        authors = response.xpath('//div[@class=\"quote\"]/span/small[@class=\"author\"]/text()').extract()\n",
+    "        for author, quote in zip (authors,quotes):\n",
+    "            items[author]=quote\n",
+    "        yield items\n",
+    "\"\"\"\n",
+    "with open(\"tutorial/tutorial/spiders/quotes_spider.py\",'w') as file_handle:\n",
+    "    file_handle.write(class_description)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# You should see a new spider class here\n",
+    "os.listdir(\"tutorial/tutorial/spiders\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# All you have to do now is run the spider by calling its name from within the project directory!\n",
+    "os.chdir(\"tutorial\")\n",
+    "\n",
+    "# List all available spiders\n",
+    "!scrapy list\n",
+    "\n",
+    "# run the 'quotes'-spider\n",
+    "!scrapy crawl quotes -o quotes.json"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# check out your newly created data set!\n",
+    "os.chdir(pwd)\n",
+    "os.listdir(\"tutorial\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data description\n",
+    "\n",
+    "Now that you know how scrapy works and what it does, you are ready to look at some real world data.\n",
+    "\n",
+    "Due to COVID-19 a significant number of people had to recreate their work environment at home. \n",
+    "To that end, these people required things they could not work without - such as *whiteboards*!\n",
+    "\n",
+    "Open the data and investigate what type of format is given to you."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Question\n",
+    "\n",
+    "We will try to answer the question: **Does the size of the whiteboard impact the reviewer's satisfaction?**\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading scrapy data\n",
+    "Scrapy provides the data as a list of dictionaries - which is basically a Python term for a list of json items\n",
+    "(a.k.a key value pairs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# setting up pandas\n",
+    "!pip install pandas\n",
+    "!pip install json\n",
+    "# setting up matplotlib for plotting. \"Nothin's better than a figure\" - anonymous\n",
+    "!pip install matplotlib"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# this is needed for viewing graphs inside Jupyter notebook\n",
+    "%matplotlib inline\n",
+    "import seaborn as sns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# loading the lines of the json-file and converting the string to a list of dictionaries\n",
+    "import json\n",
+    "\n",
+    "with open('amazon_reviews.json') as f:\n",
+    "    lines = f.readlines()\n",
+    "\n",
+    "\n",
+    "jsonString = \"\".join(lines)\n",
+    "data = json.loads(jsonString)\n",
+    "data[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# A list of dictionaries is exactly what pandas can transform into a dataframe\n",
+    "import pandas as pd\n",
+    "df = pd.DataFrame(data)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Processing the data \n",
+    "\n",
+    "Cleaning, reducing and transforming the dataframe into the right information is crucial for extraction knwoledge \n",
+    "in the down stream processes.\n",
+    "\n",
+    "Since we are only interested in star vs. size dependency a.k.a a histogram expressing how for each size the relative rating changes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# grouping by the size of the whiteboard\n",
+    "stars_size = df[[\"stars\",\"size\"]]\n",
+    "groupedStars = stars_size.groupby(\"size\")\n",
+    "stars = groupedStars.get_group(\"60 x 45 cm\")\n",
+    "stars"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Formating the data for display and further statistical analysis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# creating a better structure to work for statistically applications\n",
+    "import numpy as np\n",
+    "from collections import OrderedDict\n",
+    "review_stars = OrderedDict()\n",
+    "for group, df in groupedStars:\n",
+    "    width = int(group.split(\" \")[0])\n",
+    "    height = int(group.split(\" \")[2])\n",
+    "    area_of_board = width * height\n",
+    "    review_stars[area_of_board] = np.array(df[\"stars\"])\n",
+    "review_stars"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Who is happiest with their white board?\n",
+    "# A size dependency - reduce the data you want to plot\n",
+    "size_average_rating_eps_txt = np.array([\n",
+    "    (size/10000,np.average(starList),np.std(starList),starList.__len__())\n",
+    "    for size, starList in review_stars.items()]).transpose()\n",
+    "size_average_rating_eps_txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# perform a linear regression\n",
+    "from sklearn.linear_model import LinearRegression\n",
+    "reg = LinearRegression().fit(size_average_rating_eps_txt[0].reshape(-1, 1), size_average_rating_eps_txt[1].reshape(-1, 1))\n",
+    "m = reg.coef_\n",
+    "n = reg.intercept_\n",
+    "x_fit = np.linspace(0,3,num=100)\n",
+    "y_fit = np.array(x_fit*m + n).reshape(-1,1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# plot your result\n",
+    "import matplotlib.pyplot as plt \n",
+    "fig = plt.figure(figsize=(8, 8))\n",
+    "ax = fig.add_subplot(111)\n",
+    "\n",
+    "plt.plot(x_fit,y_fit,\"k\")\n",
+    "\n",
+    "plt.plot(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],'.'\n",
+    "         ,mfc='red',  mec='green', ms=20, mew=4\n",
+    "        )\n",
+    "\n",
+    "plt.errorbar(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1], \n",
+    "             size_average_rating_eps_txt[2]\n",
+    "             , marker='o', fmt='none', ecolor='b',\n",
+    "             mfc='red',  mec='green', ms=20, mew=4\n",
+    "            )\n",
+    "\n",
+    "for x,y,txt in zip(size_average_rating_eps_txt[0],size_average_rating_eps_txt[1],size_average_rating_eps_txt[-1]):\n",
+    "    plt.annotate(str(int(txt)),(x,y)\n",
+    "                ,textcoords=\"offset points\",xytext=(-10,10)\n",
+    "                ,ha='center', fontsize=20\n",
+    "               )\n",
+    "    \n",
+    "ax.spines['right'].set_visible(False)\n",
+    "ax.spines['top'].set_visible(False)\n",
+    "ax.set_xlim(left=0)\n",
+    "ax.yaxis.grid()\n",
+    "ax.set_xlabel(\"Size of whiteboard in $m^2$\")\n",
+    "ax.set_ylabel(\"Star rating\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Answer\n",
+    "\n",
+    "We observe a downwards trend in the regime of small whiteboard. \n",
+    "This behavior is followed by two data points with a signification deviation from another and from the overall mean.\n",
+    "The largest size board displays a positive reviewer satisfaction. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Questions:\n",
+    "\n",
+    "## How does the satisfaction impact the perception of secondary reviewers regarding the helpfullness of the review?\n",
+    "\n",
+    "## Can you recognize any spikes in the review posting over time? Do these correlate with any particular event?\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
+%% Cell type:markdown id: tags:
+
+# Analyzing Amazon Reviews with Python
+
+Session material [amazon_reviews.json](amazon_reviews.json)
+
+Michael Paris, Humboldt Universität Berlin
+
+%% Cell type:code id: tags:
+
+``` 
+# settings for tutorial presentation with RISE
+from notebook.services.config import ConfigManager
+cm = ConfigManager()
+cm.update('livereveal', {
+              'width': '100%',
+              'height': '100%',
+              'scroll': True,
+              'enable_chalkboard': True,
+})
+```
+
+%% Cell type:markdown id: tags:
+
+## Scrapy
+
+The data presented to you in this course has been collected using [scrapy](https://scrapy.org/).
+Scrapy is an open source and collaborative framework for extracting the data you need from websites.
+
+If you would like to set up scrapy just execute the follwing cells for an example!
+
+Pro Tip: Run all commands in this jupyter-notebook from a dedicated environment (conda, virtualenv etc.)
+
+%% Cell type:code id: tags:
+
+``` 
+# package setup
+!pip install scrapy
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# creating a scrapy project
+!scrapy startproject tutorial
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# keeping tabs on the project directory
+import os
+pwd = os.getcwd()
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# This is the content of your newly created scrapy project
+os.listdir("tutorial")
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# Now have to write a spider class in this directory
+os.listdir("tutorial/tutorial/spiders")
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# Writing a spider class, which will follow a set of rules and collect specific items.
+# It is the definition of the crawler:
+os.chdir(pwd)
+class_description = """
+import scrapy
+
+class QuotesSpider(scrapy.Spider):
+    name = "quotes"
+
+    def start_requests(self):
+        urls = [
+            'http://quotes.toscrape.com/page/1/',
+            'http://quotes.toscrape.com/page/2/',
+        ]
+        for url in urls:
+            yield scrapy.Request(url=url, callback=self.parse)
+
+    def parse(self, response):
+        items = {}
+        quotes = response.xpath('//div[@class="quote"]/span[@class="text"]/text()').extract()
+        authors = response.xpath('//div[@class="quote"]/span/small[@class="author"]/text()').extract()
+        for author, quote in zip (authors,quotes):
+            items[author]=quote
+        yield items
+"""
+with open("tutorial/tutorial/spiders/quotes_spider.py",'w') as file_handle:
+    file_handle.write(class_description)
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# You should see a new spider class here
+os.listdir("tutorial/tutorial/spiders")
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# All you have to do now is run the spider by calling its name from within the project directory!
+os.chdir("tutorial")
+
+# List all available spiders
+!scrapy list
+
+# run the 'quotes'-spider
+!scrapy crawl quotes -o quotes.json
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# check out your newly created data set!
+os.chdir(pwd)
+os.listdir("tutorial")
+```
+
+%% Cell type:markdown id: tags:
+
+## Data description
+
+Now that you know how scrapy works and what it does, you are ready to look at some real world data.
+
+Due to COVID-19 a significant number of people had to recreate their work environment at home.
+To that end, these people required things they could not work without - such as *whiteboards*!
+
+Open the data and investigate what type of format is given to you.
+
+%% Cell type:markdown id: tags:
+
+## Question
+
+We will try to answer the question: **Does the size of the whiteboard impact the reviewer's satisfaction?**
+
+%% Cell type:markdown id: tags:
+
+## Loading scrapy data
+Scrapy provides the data as a list of dictionaries - which is basically a Python term for a list of json items
+(a.k.a key value pairs)
+
+%% Cell type:code id: tags:
+
+``` 
+# setting up pandas
+!pip install pandas
+!pip install json
+# setting up matplotlib for plotting. "Nothin's better than a figure" - anonymous
+!pip install matplotlib
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# this is needed for viewing graphs inside Jupyter notebook
+%matplotlib inline
+import seaborn as sns
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# loading the lines of the json-file and converting the string to a list of dictionaries
+import json
+
+with open('amazon_reviews.json') as f:
+    lines = f.readlines()
+
+
+jsonString = "".join(lines)
+data = json.loads(jsonString)
+data[0]
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# A list of dictionaries is exactly what pandas can transform into a dataframe
+import pandas as pd
+df = pd.DataFrame(data)
+df
+```
+
+%% Cell type:markdown id: tags:
+
+## Processing the data
+
+Cleaning, reducing and transforming the dataframe into the right information is crucial for extraction knwoledge
+in the down stream processes.
+
+Since we are only interested in star vs. size dependency a.k.a a histogram expressing how for each size the relative rating changes.
+
+%% Cell type:code id: tags:
+
+``` 
+# grouping by the size of the whiteboard
+stars_size = df[["stars","size"]]
+groupedStars = stars_size.groupby("size")
+stars = groupedStars.get_group("60 x 45 cm")
+stars
+```
+
+%% Cell type:markdown id: tags:
+
+## Formating the data for display and further statistical analysis
+
+%% Cell type:code id: tags:
+
+``` 
+# creating a better structure to work for statistically applications
+import numpy as np
+from collections import OrderedDict
+review_stars = OrderedDict()
+for group, df in groupedStars:
+    width = int(group.split(" ")[0])
+    height = int(group.split(" ")[2])
+    area_of_board = width * height
+    review_stars[area_of_board] = np.array(df["stars"])
+review_stars
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# Who is happiest with their white board?
+# A size dependency - reduce the data you want to plot
+size_average_rating_eps_txt = np.array([
+    (size/10000,np.average(starList),np.std(starList),starList.__len__())
+    for size, starList in review_stars.items()]).transpose()
+size_average_rating_eps_txt
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# perform a linear regression
+from sklearn.linear_model import LinearRegression
+reg = LinearRegression().fit(size_average_rating_eps_txt[0].reshape(-1, 1), size_average_rating_eps_txt[1].reshape(-1, 1))
+m = reg.coef_
+n = reg.intercept_
+x_fit = np.linspace(0,3,num=100)
+y_fit = np.array(x_fit*m + n).reshape(-1,1)
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# plot your result
+import matplotlib.pyplot as plt
+fig = plt.figure(figsize=(8, 8))
+ax = fig.add_subplot(111)
+
+plt.plot(x_fit,y_fit,"k")
+
+plt.plot(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],'.'
+         ,mfc='red',  mec='green', ms=20, mew=4
+        )
+
+plt.errorbar(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],
+             size_average_rating_eps_txt[2]
+             , marker='o', fmt='none', ecolor='b',
+             mfc='red',  mec='green', ms=20, mew=4
+            )
+
+for x,y,txt in zip(size_average_rating_eps_txt[0],size_average_rating_eps_txt[1],size_average_rating_eps_txt[-1]):
+    plt.annotate(str(int(txt)),(x,y)
+                ,textcoords="offset points",xytext=(-10,10)
+                ,ha='center', fontsize=20
+               )
+
+ax.spines['right'].set_visible(False)
+ax.spines['top'].set_visible(False)
+ax.set_xlim(left=0)
+ax.yaxis.grid()
+ax.set_xlabel("Size of whiteboard in $m^2$")
+ax.set_ylabel("Star rating")
+```
+
+%% Cell type:markdown id: tags:
+
+# Answer
+
+We observe a downwards trend in the regime of small whiteboard.
+This behavior is followed by two data points with a signification deviation from another and from the overall mean.
+The largest size board displays a positive reviewer satisfaction.
+
+%% Cell type:markdown id: tags:
+
+# Questions:
+
+## How does the satisfaction impact the perception of secondary reviewers regarding the helpfullness of the review?
+
+## Can you recognize any spikes in the review posting over time? Do these correlate with any particular event?
--- a/data/amazon_reviews.json
+++ b/data/amazon_reviews.json