Skip to content
Snippets Groups Projects
Commit 058a1237 authored by Prof. Dr. Robert Jäschke's avatar Prof. Dr. Robert Jäschke
Browse files

moved amazon reviews example to this repo

parent 9f32176d
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Analyzing Amazon Reviews with Python
Session material [amazon_reviews.json](amazon_reviews.json)
Michael Paris, Humboldt Universität Berlin
%% Cell type:code id: tags:
```
# settings for tutorial presentation with RISE
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
'width': '100%',
'height': '100%',
'scroll': True,
'enable_chalkboard': True,
})
```
%% Cell type:markdown id: tags:
## Scrapy
The data presented to you in this course has been collected using [scrapy](https://scrapy.org/).
Scrapy is an open source and collaborative framework for extracting the data you need from websites.
If you would like to set up scrapy just execute the follwing cells for an example!
Pro Tip: Run all commands in this jupyter-notebook from a dedicated environment (conda, virtualenv etc.)
%% Cell type:code id: tags:
```
# package setup
!pip install scrapy
```
%% Cell type:code id: tags:
```
# creating a scrapy project
!scrapy startproject tutorial
```
%% Cell type:code id: tags:
```
# keeping tabs on the project directory
import os
pwd = os.getcwd()
```
%% Cell type:code id: tags:
```
# This is the content of your newly created scrapy project
os.listdir("tutorial")
```
%% Cell type:code id: tags:
```
# Now have to write a spider class in this directory
os.listdir("tutorial/tutorial/spiders")
```
%% Cell type:code id: tags:
```
# Writing a spider class, which will follow a set of rules and collect specific items.
# It is the definition of the crawler:
os.chdir(pwd)
class_description = """
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = {}
quotes = response.xpath('//div[@class="quote"]/span[@class="text"]/text()').extract()
authors = response.xpath('//div[@class="quote"]/span/small[@class="author"]/text()').extract()
for author, quote in zip (authors,quotes):
items[author]=quote
yield items
"""
with open("tutorial/tutorial/spiders/quotes_spider.py",'w') as file_handle:
file_handle.write(class_description)
```
%% Cell type:code id: tags:
```
# You should see a new spider class here
os.listdir("tutorial/tutorial/spiders")
```
%% Cell type:code id: tags:
```
# All you have to do now is run the spider by calling its name from within the project directory!
os.chdir("tutorial")
# List all available spiders
!scrapy list
# run the 'quotes'-spider
!scrapy crawl quotes -o quotes.json
```
%% Cell type:code id: tags:
```
# check out your newly created data set!
os.chdir(pwd)
os.listdir("tutorial")
```
%% Cell type:markdown id: tags:
## Data description
Now that you know how scrapy works and what it does, you are ready to look at some real world data.
Due to COVID-19 a significant number of people had to recreate their work environment at home.
To that end, these people required things they could not work without - such as *whiteboards*!
Open the data and investigate what type of format is given to you.
%% Cell type:markdown id: tags:
## Question
We will try to answer the question: **Does the size of the whiteboard impact the reviewer's satisfaction?**
%% Cell type:markdown id: tags:
## Loading scrapy data
Scrapy provides the data as a list of dictionaries - which is basically a Python term for a list of json items
(a.k.a key value pairs)
%% Cell type:code id: tags:
```
# setting up pandas
!pip install pandas
!pip install json
# setting up matplotlib for plotting. "Nothin's better than a figure" - anonymous
!pip install matplotlib
```
%% Cell type:code id: tags:
```
# this is needed for viewing graphs inside Jupyter notebook
%matplotlib inline
import seaborn as sns
```
%% Cell type:code id: tags:
```
# loading the lines of the json-file and converting the string to a list of dictionaries
import json
with open('amazon_reviews.json') as f:
lines = f.readlines()
jsonString = "".join(lines)
data = json.loads(jsonString)
data[0]
```
%% Cell type:code id: tags:
```
# A list of dictionaries is exactly what pandas can transform into a dataframe
import pandas as pd
df = pd.DataFrame(data)
df
```
%% Cell type:markdown id: tags:
## Processing the data
Cleaning, reducing and transforming the dataframe into the right information is crucial for extraction knwoledge
in the down stream processes.
Since we are only interested in star vs. size dependency a.k.a a histogram expressing how for each size the relative rating changes.
%% Cell type:code id: tags:
```
# grouping by the size of the whiteboard
stars_size = df[["stars","size"]]
groupedStars = stars_size.groupby("size")
stars = groupedStars.get_group("60 x 45 cm")
stars
```
%% Cell type:markdown id: tags:
## Formating the data for display and further statistical analysis
%% Cell type:code id: tags:
```
# creating a better structure to work for statistically applications
import numpy as np
from collections import OrderedDict
review_stars = OrderedDict()
for group, df in groupedStars:
width = int(group.split(" ")[0])
height = int(group.split(" ")[2])
area_of_board = width * height
review_stars[area_of_board] = np.array(df["stars"])
review_stars
```
%% Cell type:code id: tags:
```
# Who is happiest with their white board?
# A size dependency - reduce the data you want to plot
size_average_rating_eps_txt = np.array([
(size/10000,np.average(starList),np.std(starList),starList.__len__())
for size, starList in review_stars.items()]).transpose()
size_average_rating_eps_txt
```
%% Cell type:code id: tags:
```
# perform a linear regression
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(size_average_rating_eps_txt[0].reshape(-1, 1), size_average_rating_eps_txt[1].reshape(-1, 1))
m = reg.coef_
n = reg.intercept_
x_fit = np.linspace(0,3,num=100)
y_fit = np.array(x_fit*m + n).reshape(-1,1)
```
%% Cell type:code id: tags:
```
# plot your result
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111)
plt.plot(x_fit,y_fit,"k")
plt.plot(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],'.'
,mfc='red', mec='green', ms=20, mew=4
)
plt.errorbar(size_average_rating_eps_txt[0], size_average_rating_eps_txt[1],
size_average_rating_eps_txt[2]
, marker='o', fmt='none', ecolor='b',
mfc='red', mec='green', ms=20, mew=4
)
for x,y,txt in zip(size_average_rating_eps_txt[0],size_average_rating_eps_txt[1],size_average_rating_eps_txt[-1]):
plt.annotate(str(int(txt)),(x,y)
,textcoords="offset points",xytext=(-10,10)
,ha='center', fontsize=20
)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlim(left=0)
ax.yaxis.grid()
ax.set_xlabel("Size of whiteboard in $m^2$")
ax.set_ylabel("Star rating")
```
%% Cell type:markdown id: tags:
# Answer
We observe a downwards trend in the regime of small whiteboard.
This behavior is followed by two data points with a signification deviation from another and from the overall mean.
The largest size board displays a positive reviewer satisfaction.
%% Cell type:markdown id: tags:
# Questions:
## How does the satisfaction impact the perception of secondary reviewers regarding the helpfullness of the review?
## Can you recognize any spikes in the review posting over time? Do these correlate with any particular event?
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment