Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Notebooks
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Requirements
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Deploy
Releases
Package Registry
Model registry
Operate
Terraform modules
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Passlida Saila
Notebooks
Commits
058a1237
Commit
058a1237
authored
4 years ago
by
Prof. Dr. Robert Jäschke
Browse files
Options
Downloads
Patches
Plain Diff
moved amazon reviews example to this repo
parent
9f32176d
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
2
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
amazon_reviews.ipynb
+401
-0
401 additions, 0 deletions
amazon_reviews.ipynb
data/amazon_reviews.json
+243
-0
243 additions, 0 deletions
data/amazon_reviews.json
with
644 additions
and
0 deletions
amazon_reviews.ipynb
0 → 100644
+
401
−
0
View file @
058a1237
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analyzing Amazon Reviews with Python\n",
"\n",
"Session material [amazon_reviews.json](amazon_reviews.json)\n",
"\n",
"Michael Paris, Humboldt Universität Berlin \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# settings for tutorial presentation with RISE\n",
"from notebook.services.config import ConfigManager\n",
"cm = ConfigManager()\n",
"cm.update('livereveal', {\n",
" 'width': '100%',\n",
" 'height': '100%',\n",
" 'scroll': True,\n",
" 'enable_chalkboard': True,\n",
"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scrapy\n",
"\n",
"The data presented to you in this course has been collected using [scrapy](https://scrapy.org/). \n",
"Scrapy is an open source and collaborative framework for extracting the data you need from websites. \n",
"\n",
"If you would like to set up scrapy just execute the follwing cells for an example!\n",
"\n",
"Pro Tip: Run all commands in this jupyter-notebook from a dedicated environment (conda, virtualenv etc.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# package setup\n",
"!pip install scrapy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# creating a scrapy project\n",
"!scrapy startproject tutorial"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# keeping tabs on the project directory\n",
"import os\n",
"pwd = os.getcwd()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This is the content of your newly created scrapy project\n",
"os.listdir(\"tutorial\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Now have to write a spider class in this directory\n",
"os.listdir(\"tutorial/tutorial/spiders\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Writing a spider class, which will follow a set of rules and collect specific items. \n",
"# It is the definition of the crawler:\n",
"os.chdir(pwd)\n",
"class_description = \"\"\"\n",
"import scrapy\n",
"\n",
"class QuotesSpider(scrapy.Spider):\n",
" name = \"quotes\"\n",
"\n",
" def start_requests(self):\n",
" urls = [\n",
" 'http://quotes.toscrape.com/page/1/',\n",
" 'http://quotes.toscrape.com/page/2/',\n",
" ]\n",
" for url in urls:\n",
" yield scrapy.Request(url=url, callback=self.parse)\n",
"\n",
" def parse(self, response):\n",
" items = {}\n",
" quotes = response.xpath('//div[@class=\"quote\"]/span[@class=\"text\"]/text()').extract()\n",
" authors = response.xpath('//div[@class=\"quote\"]/span/small[@class=\"author\"]/text()').extract()\n",
" for author, quote in zip (authors,quotes):\n",
" items[author]=quote\n",
" yield items\n",
"\"\"\"\n",
"with open(\"tutorial/tutorial/spiders/quotes_spider.py\",'w') as file_handle:\n",
" file_handle.write(class_description)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You should see a new spider class here\n",
"os.listdir(\"tutorial/tutorial/spiders\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# All you have to do now is run the spider by calling its name from within the project directory!\n",
"os.chdir(\"tutorial\")\n",
"\n",
"# List all available spiders\n",
"!scrapy list\n",
"\n",
"# run the 'quotes'-spider\n",
"!scrapy crawl quotes -o quotes.json"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check out your newly created data set!\n",
"os.chdir(pwd)\n",
"os.listdir(\"tutorial\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data description\n",
"\n",
"Now that you know how scrapy works and what it does, you are ready to look at some real world data.\n",
"\n",
"Due to COVID-19 a significant number of people had to recreate their work environment at home. \n",
"To that end, these people required things they could not work without - such as *whiteboards*!\n",
"\n",
"Open the data and investigate what type of format is given to you."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question\n",
"\n",
"We will try to answer the question: **Does the size of the whiteboard impact the reviewer's satisfaction?**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading scrapy data\n",
"Scrapy provides the data as a list of dictionaries - which is basically a Python term for a list of json items\n",
"(a.k.a key value pairs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# setting up pandas\n",
"!pip install pandas\n",
"!pip install json\n",
"# setting up matplotlib for plotting. \"Nothin's better than a figure\" - anonymous\n",
"!pip install matplotlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# this is needed for viewing graphs inside Jupyter notebook\n",
"%matplotlib inline\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# loading the lines of the json-file and converting the string to a list of dictionaries\n",
"import json\n",
"\n",
"with open('amazon_reviews.json') as f:\n",
" lines = f.readlines()\n",
"\n",
"\n",
"jsonString = \"\".join(lines)\n",
"data = json.loads(jsonString)\n",
"data[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# A list of dictionaries is exactly what pandas can transform into a dataframe\n",
"import pandas as pd\n",
"df = pd.DataFrame(data)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Processing the data \n",
"\n",
"Cleaning, reducing and transforming the dataframe into the right information is crucial for extraction knwoledge \n",
"in the down stream processes.\n",
"\n",
"Since we are only interested in star vs. size dependency a.k.a a histogram expressing how for each size the relative rating changes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# grouping by the size of the whiteboard\n",
"stars_size = df[[\"stars\",\"size\"]]\n",
"groupedStars = stars_size.groupby(\"size\")\n",
"stars = groupedStars.get_group(\"60 x 45 cm\")\n",
"stars"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Formating the data for display and further statistical analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# creating a better structure to work for statistically applications\n",
"import numpy as np\n",
"from collections import OrderedDict\n",
"review_stars = OrderedDict()\n",
"for group, df in groupedStars:\n",
" width = int(group.split(\" \")[0])\n",
" height = int(group.split(\" \")[2])\n",
" area_of_board = width * height\n",
" review_stars[area_of_board] = np.array(df[\"stars\"])\n",
"review_stars"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Who is happiest with their white board?\n",
"# A size dependency - reduce the data you want to plot\n",
"size_average_rating_eps_txt = np.array([\n",
" (size/10000,np.average(starList),np.std(starList),starList.__len__())\n",
" for size, starList in review_stars.items()]).transpose()\n",
"size_average_rating_eps_txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# perform a linear regression\n",
"from sklearn.linear_model import LinearRegression\n",
"reg = LinearRegression().fit(size_average_rating_eps_txt[0].reshape(-1, 1), size_average_rating_eps_txt[1].reshape(-1, 1))\n",
"m = reg.coef_\n",
"n = reg.intercept_\n",
"x_fit = np.linspace(0,3,num=100)\n",
"y_fit = np.array(x_fit*m + n).reshape(-1,1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot your result\n",
"import matplotlib.pyplot as plt \n",
"fig = plt.figure(figsize=(8, 8))\n",
"ax = fig.add_subplot(111)\n",
"\n",
"plt.plot(x_fit,y_fit,\"k\")\n",
"\n",
"plt.plot(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],'.'\n",
" ,mfc='red', mec='green', ms=20, mew=4\n",
" )\n",
"\n",
"plt.errorbar(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1], \n",
" size_average_rating_eps_txt[2]\n",
" , marker='o', fmt='none', ecolor='b',\n",
" mfc='red', mec='green', ms=20, mew=4\n",
" )\n",
"\n",
"for x,y,txt in zip(size_average_rating_eps_txt[0],size_average_rating_eps_txt[1],size_average_rating_eps_txt[-1]):\n",
" plt.annotate(str(int(txt)),(x,y)\n",
" ,textcoords=\"offset points\",xytext=(-10,10)\n",
" ,ha='center', fontsize=20\n",
" )\n",
" \n",
"ax.spines['right'].set_visible(False)\n",
"ax.spines['top'].set_visible(False)\n",
"ax.set_xlim(left=0)\n",
"ax.yaxis.grid()\n",
"ax.set_xlabel(\"Size of whiteboard in $m^2$\")\n",
"ax.set_ylabel(\"Star rating\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Answer\n",
"\n",
"We observe a downwards trend in the regime of small whiteboard. \n",
"This behavior is followed by two data points with a signification deviation from another and from the overall mean.\n",
"The largest size board displays a positive reviewer satisfaction. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Questions:\n",
"\n",
"## How does the satisfaction impact the perception of secondary reviewers regarding the helpfullness of the review?\n",
"\n",
"## Can you recognize any spikes in the review posting over time? Do these correlate with any particular event?\n"
]
}
],
"metadata": {
"language_info": {
"name": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
%% Cell type:markdown id: tags:
# Analyzing Amazon Reviews with Python
Session material
[
amazon_reviews.json
](
amazon_reviews.json
)
Michael Paris, Humboldt Universität Berlin
%% Cell type:code id: tags:
```
# settings for tutorial presentation with RISE
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
'width': '100%',
'height': '100%',
'scroll': True,
'enable_chalkboard': True,
})
```
%% Cell type:markdown id: tags:
## Scrapy
The data presented to you in this course has been collected using
[
scrapy
](
https://scrapy.org/
)
.
Scrapy is an open source and collaborative framework for extracting the data you need from websites.
If you would like to set up scrapy just execute the follwing cells for an example!
Pro Tip: Run all commands in this jupyter-notebook from a dedicated environment (conda, virtualenv etc.)
%% Cell type:code id: tags:
```
# package setup
!pip install scrapy
```
%% Cell type:code id: tags:
```
# creating a scrapy project
!scrapy startproject tutorial
```
%% Cell type:code id: tags:
```
# keeping tabs on the project directory
import os
pwd = os.getcwd()
```
%% Cell type:code id: tags:
```
# This is the content of your newly created scrapy project
os.listdir("tutorial")
```
%% Cell type:code id: tags:
```
# Now have to write a spider class in this directory
os.listdir("tutorial/tutorial/spiders")
```
%% Cell type:code id: tags:
```
# Writing a spider class, which will follow a set of rules and collect specific items.
# It is the definition of the crawler:
os.chdir(pwd)
class_description = """
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = {}
quotes = response.xpath('//div[@class="quote"]/span[@class="text"]/text()').extract()
authors = response.xpath('//div[@class="quote"]/span/small[@class="author"]/text()').extract()
for author, quote in zip (authors,quotes):
items[author]=quote
yield items
"""
with open("tutorial/tutorial/spiders/quotes_spider.py",'w') as file_handle:
file_handle.write(class_description)
```
%% Cell type:code id: tags:
```
# You should see a new spider class here
os.listdir("tutorial/tutorial/spiders")
```
%% Cell type:code id: tags:
```
# All you have to do now is run the spider by calling its name from within the project directory!
os.chdir("tutorial")
# List all available spiders
!scrapy list
# run the 'quotes'-spider
!scrapy crawl quotes -o quotes.json
```
%% Cell type:code id: tags:
```
# check out your newly created data set!
os.chdir(pwd)
os.listdir("tutorial")
```
%% Cell type:markdown id: tags:
## Data description
Now that you know how scrapy works and what it does, you are ready to look at some real world data.
Due to COVID-19 a significant number of people had to recreate their work environment at home.
To that end, these people required things they could not work without - such as
*whiteboards*
!
Open the data and investigate what type of format is given to you.
%% Cell type:markdown id: tags:
## Question
We will try to answer the question:
**Does the size of the whiteboard impact the reviewer's satisfaction?**
%% Cell type:markdown id: tags:
## Loading scrapy data
Scrapy provides the data as a list of dictionaries - which is basically a Python term for a list of json items
(a.k.a key value pairs)
%% Cell type:code id: tags:
```
# setting up pandas
!pip install pandas
!pip install json
# setting up matplotlib for plotting. "Nothin's better than a figure" - anonymous
!pip install matplotlib
```
%% Cell type:code id: tags:
```
# this is needed for viewing graphs inside Jupyter notebook
%matplotlib inline
import seaborn as sns
```
%% Cell type:code id: tags:
```
# loading the lines of the json-file and converting the string to a list of dictionaries
import json
with open('amazon_reviews.json') as f:
lines = f.readlines()
jsonString = "".join(lines)
data = json.loads(jsonString)
data[0]
```
%% Cell type:code id: tags:
```
# A list of dictionaries is exactly what pandas can transform into a dataframe
import pandas as pd
df = pd.DataFrame(data)
df
```
%% Cell type:markdown id: tags:
## Processing the data
Cleaning, reducing and transforming the dataframe into the right information is crucial for extraction knwoledge
in the down stream processes.
Since we are only interested in star vs. size dependency a.k.a a histogram expressing how for each size the relative rating changes.
%% Cell type:code id: tags:
```
# grouping by the size of the whiteboard
stars_size = df[["stars","size"]]
groupedStars = stars_size.groupby("size")
stars = groupedStars.get_group("60 x 45 cm")
stars
```
%% Cell type:markdown id: tags:
## Formating the data for display and further statistical analysis
%% Cell type:code id: tags:
```
# creating a better structure to work for statistically applications
import numpy as np
from collections import OrderedDict
review_stars = OrderedDict()
for group, df in groupedStars:
width = int(group.split(" ")[0])
height = int(group.split(" ")[2])
area_of_board = width * height
review_stars[area_of_board] = np.array(df["stars"])
review_stars
```
%% Cell type:code id: tags:
```
# Who is happiest with their white board?
# A size dependency - reduce the data you want to plot
size_average_rating_eps_txt = np.array([
(size/10000,np.average(starList),np.std(starList),starList.__len__())
for size, starList in review_stars.items()]).transpose()
size_average_rating_eps_txt
```
%% Cell type:code id: tags:
```
# perform a linear regression
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(size_average_rating_eps_txt[0].reshape(-1, 1), size_average_rating_eps_txt[1].reshape(-1, 1))
m = reg.coef_
n = reg.intercept_
x_fit = np.linspace(0,3,num=100)
y_fit = np.array(x_fit*m + n).reshape(-1,1)
```
%% Cell type:code id: tags:
```
# plot your result
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111)
plt.plot(x_fit,y_fit,"k")
plt.plot(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],'.'
,mfc='red', mec='green', ms=20, mew=4
)
plt.errorbar(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],
size_average_rating_eps_txt[2]
, marker='o', fmt='none', ecolor='b',
mfc='red', mec='green', ms=20, mew=4
)
for x,y,txt in zip(size_average_rating_eps_txt[0],size_average_rating_eps_txt[1],size_average_rating_eps_txt[-1]):
plt.annotate(str(int(txt)),(x,y)
,textcoords="offset points",xytext=(-10,10)
,ha='center', fontsize=20
)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlim(left=0)
ax.yaxis.grid()
ax.set_xlabel("Size of whiteboard in $m^2$")
ax.set_ylabel("Star rating")
```
%% Cell type:markdown id: tags:
# Answer
We observe a downwards trend in the regime of small whiteboard.
This behavior is followed by two data points with a signification deviation from another and from the overall mean.
The largest size board displays a positive reviewer satisfaction.
%% Cell type:markdown id: tags:
# Questions:
## How does the satisfaction impact the perception of secondary reviewers regarding the helpfullness of the review?
## Can you recognize any spikes in the review posting over time? Do these correlate with any particular event?
This diff is collapsed.
Click to expand it.
data/amazon_reviews.json
0 → 100644
+
243
−
0
View file @
058a1237
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment