From 36d8727923e90e16bc73210f7d4088301a2a8bbe Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Robert=20J=C3=A4schke?= <jaeschke@l3s.de>
Date: Tue, 21 Feb 2023 17:03:23 +0100
Subject: [PATCH] added another mp2 notebook

---
 README.org                               |    2 +
 notebooks/World_Risk_and_Happiness.ipynb | 2274 ++++++++++++++++++++++
 2 files changed, 2276 insertions(+)
 create mode 100644 notebooks/World_Risk_and_Happiness.ipynb

diff --git a/README.org b/README.org
index fcd938f..e6754f2 100644
--- a/README.org
+++ b/README.org
@@ -68,3 +68,5 @@ Exemplary (and excellent) term papers from students of our module:
   Raoul Weber
 - [[file:notebooks/Weinbewertungen_Vivino.ipynb][Weinbewertungen Vivino]] :: /Untersuchung von Weinbewertungen des
   Online-Weinmarktplatzes Vivino/ by Heike Wilhelm
+- [[file:notebooks/World_Risk_and_Happiness.ipynb][World Risk and Happiness]] :: /World Risk Poll 2021 and World
+  Happiness Report 2021/ by Helene Hellmich
diff --git a/notebooks/World_Risk_and_Happiness.ipynb b/notebooks/World_Risk_and_Happiness.ipynb
new file mode 100644
index 0000000..0bad1d2
--- /dev/null
+++ b/notebooks/World_Risk_and_Happiness.ipynb
@@ -0,0 +1,2274 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e45f35b4",
+   "metadata": {},
+   "source": [
+    "<center>Institut für Bibliotheks- und Informationswissenschaft, Humboldt-Universität zu Berlin</center>\n",
+    "<h1 align=\"center\">Modul Datenanalyse & -auswertung: World Risk Poll 2021 and World Happiness Report 2021</h1>\n",
+    "<h2 align=\"center\">Helene Hellmich</h2>\n",
+    "<h2 align=\"center\">2022</h2>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25bb6246",
+   "metadata": {},
+   "source": [
+    "## Table of Contents\n",
+    "1. [Introduction](#1Einleitung)\n",
+    "2. [Description of the components of the data set](#2Beschreibung)   \n",
+    " 2.1. [The World Risk Poll 2021](#21WorldRiskPoll)    \n",
+    " 2.2. [Attributes of the data set](#22BeschreibungAttribute)\n",
+    "3. [Descriptive analysis of the data set](#3Analyse)  \n",
+    " 3.1. [Analysis of the attibutes with nominal scale type](#31NominalAttribute)<br>\n",
+    " 3.2. [Analysis of the attibutes with ordinal scale type](#32OrdinalAttribute) <br>\n",
+    " 3.3. [Analysis of the attibutes with numerical scale type](#33NumericalAttribute) <br>\n",
+    "4. [Relationships between *WorriedIndex* and other variables](#4Beziehungen)\n",
+    "5. [Inductive analysis](#5InductiveAnalyse)\n",
+    "6. [Worry and Happiness](#6Happiness)\n",
+    "7. [Discussion](#7Diskussion)\n",
+    "8. [References](#8Literatur)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21b79597",
+   "metadata": {},
+   "source": [
+    "## 1. Introduction <a class=\"anchor\" id=\"1Einleitung\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "132500c0",
+   "metadata": {},
+   "source": [
+    "The colloquial expression \"don't worry, be happy\" implies a connection between worry and happiness. This proverbial truth shall be tested by looking at data from the \"World Risk Poll 2021\" and the \"World Happiness Report 2021\". The main focus will be on some parts of the data of the \"World Risk Poll 2021\", which provides a worry index. The *World Worry Index* indicates how worried respondents are about risks or potential harm and can be used to create a worry rank of different countries. The following analysis will take a closer look at the *World Worry Index* of the \"World Risk Poll 2021\" to determine if there is a relationship between responses or demographic factors of the respondents and the worry index. As the worry index describes a ranking of different countries, special attention will be paid to the global region the respondents come from. The \"World Risk Poll 2021\" includes many more questions that inform the *World Worry Index* besides the demographic information about respondents and the questions used in the analysis, but including those is not within the scope of this analysis. Instead the ranking of each country obtained from the *World Worry Index* will be compared to the ranking in the \"World Happiness Report\", which is derived from the happiness *ladder score*. The relationship between *ladder score* and *World Worry Index* will also be explored. The *ladder score*, similar to the *World Worry Index*, is calculated using various sources of information and again the analysis of those is not in the scope of this analyis.\n",
+    "\n",
+    "In the follwing the \"World Risk Poll 2021\" will be described in more detail and the \"World Happiness Report 2021\" will be described briefly. From the \"World Risk Poll 2021\" several attributes will be described in more detail and relationships between the *World Worry Index* and other attributes will be explored. Additionally, the relationships between the *World Worry Index* and the happiness *ladder score* as well as the country rankings of worry and happiness will be examined. Finally, as short discussion will summarise the findings and explore limitations. \n",
+    "\n",
+    "Understanding the underlying demographic factors of worry and different factors that cause worry can help to highlight the need for social sustainability and hopefully increase awareness of how worry is distributed around the world. Considering appropriate changes in behaviours and policies could lead to increased happiness so that we can continue to say \"don't worry, be happy\". "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "513deb70",
+   "metadata": {},
+   "source": [
+    "## 2. Description of the components of the data set <a class=\"anchor\" id=\"2Beschreibung\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13f8c672",
+   "metadata": {},
+   "source": [
+    "### 2.1 The World Risk Poll 2021<a class=\"anchor\" id=\"21WorldRiskPoll\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ccb797aa",
+   "metadata": {},
+   "source": [
+    "The Lloyd's Register Foundation is one of the organisations that has tried to assess risk on a global scale by conducting a \"World Risk Poll\" in 2019 as well as in 2021. The \"World Risk Poll\" of the Lloyd's Register Foundation explores what risks people around the world experience and how they perceive these risks. Understanding risk is described as a first step to improve safety in the world. The 2019 \"World Risk Poll\" was the first \"World Risk Poll\" conducted at this scale and was followed by a similar questionnaire in 2021. The research is to be repeated at least two more times in the future. \n",
+    "\n",
+    "The World Risk Poll 2021 was published by the Lloyd’s Register Foundation and was conducted by Gallup in 121 countries, resulting in 125,000 interviews. Due to the pandemic these interviews were mostly conducted via the telephone. The results of the 2021 poll and their documentation(data dictionary) are available at <https://wrp.lrfoundation.org.uk/data-resources/> as .csv or .sav file, when applicable the results from 2019 have also been included in this data set. The data of the \"World Risk Poll 2019\", compiled by the Lloyd's Register Foundation, has been collected mostly during face to face interviews, interviewing 150,000 people in 142 countries, including regions of the world that have not been surveyed with regards to the topic of risk before. A spreadsheet of all the responses to the questionnaire from 2019 are available at <https://wrp.lrfoundation.org.uk/data-resources/>. Due to the different set up, live-interviews in 2019 and telephone-interviews in 2021 and some changes in the questions asked in each round, the data collected can vary from one year to the next. \n",
+    "\n",
+    "The compilation of data from 2019 and 2021, included in the data set \"World Risk Poll 2021\", provides the basis of 3 indexes: The *World Worry Index*, The *Experience of Harm Index* and The *Resilience Index*. The Resilience Index was published very recently, in September 2022. A separate document which summarises the 3 indexes per country can be found at <https://wrp.lrfoundation.org.uk/2021-risk-indexes/> and is provided in a .csv format. It offers the three indexes sorted by country and includes 144 countries (mixing countries that were included both or one of the 2019 and 2021 surveys).\n",
+    "\n",
+    "The \"World Happiness Report 2021\" mostly uses data from the Gallup World Poll as basis for their happiness score and the resulting ranking of countries according to their happiness. In 2021 some of the data of the Lloyd’s Register Foundations \"World Risk Poll\" has also been used. The \"World Happiness Report 2021\" was published by the Sustainable Development Solutions Network. Information about the World Happiness Report 2021 can be found here: https://worldhappiness.report/ed/2021/ and the data can be downloaded from the website kaggle: https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021.\n",
+    "\n",
+    "To descibe the data set \"World Risk Poll 2021\" further, first it will be loaded into the jupiter notebook from the SPSS file. The data set will then be cleaned by letting go of any columns that are not necessary for this analysis as well as changing some column names to prepare it for further use. The data available from the Lloyd's Register Foundation has already been cleaned and missing values have been replaced by NaN (not a value). To further simplify the dataset, the follwing analysis will only include respondents that replied to the questionnaire in 2021, therefore all rows referring to the year 2019 will be excluded. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8c5b65bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import code from thinkstats2\n",
+    "from os.path import basename, exists\n",
+    "\n",
+    "def download(url):\n",
+    "    filename = basename(url)\n",
+    "    if not exists(filename):\n",
+    "        from urllib.request import urlretrieve\n",
+    "\n",
+    "        local, _ = urlretrieve(url, filename)\n",
+    "        print(f'Downloaded {local}')\n",
+    "\n",
+    "\n",
+    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n",
+    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")\n",
+    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/homeworks/utils.py\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "368a2147",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import packages\n",
+    "%matplotlib inline\n",
+    "\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "sns.set(style='white')#sets the style of all diagrams plotted with mathplot\n",
+    "\n",
+    "import utils\n",
+    "from utils import decorate\n",
+    "from thinkstats2 import Pmf, Cdf, Hist"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e3babbb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import thinkstats2\n",
+    "import thinkplot"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "83128f92",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas \n",
+    "pandas.__version__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0bc7dc50",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# read the data file\n",
+    "def read_data(file_path):\n",
+    "    \"\"\"Reads data from the given file path.\n",
+    "    \"\"\"\n",
+    "    return pd.read_spss(file_path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8cb13767",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install pyreadstat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "82c5f272",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#load the data file\n",
+    "data_all = pd.read_spss(\"lrf_wrp_2021_resi_data_trimmed.sav\")\n",
+    "print(data_all.shape)\n",
+    "data_all.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "934e5a55",
+   "metadata": {},
+   "source": [
+    "**The original data set**\n",
+    "\n",
+    "The data set of the \"World Risk Poll 2021\" contains 280106 rows which corresponds to the number of respondents of the interviews (interviewed in 2021 as well as 2019). The 208 colums contain for example demographic information about each respondent, the *Worried Index, Experienced Index* and *Resilience Index* and the respondents responses to the questions about risk. In the following some of the attributes will be described in more detail. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95bb7636",
+   "metadata": {},
+   "source": [
+    "### 2.2 Attributes of the data set<a class=\"anchor\" id=\"22BeschreibungAttribute\"></a>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "05cae10a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# list of all the attributes, but not all 208 displayed\n",
+    "print(data_all.keys())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffe45cbf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# displays all 208 attributes\n",
+    "print(data_all.columns.tolist())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7e9b3130",
+   "metadata": {},
+   "source": [
+    "**Attributes of the data set**\n",
+    "\n",
+    "From the documentation about the dataset provided by the Lloyd’s Register Foundation we know that most of the 208 attributes in this data set refer to world or continent regions and are not required in this analysis. These attributes are dropped in the following table as well as other colums that here are not further used. For more information about these attributes see the documentation (data dictionary). To make the naming more consistent, some of the column names have to be changed. The new names will be used to describe the attribute from now on. \n",
+    "\n",
+    "**Age & Year**\n",
+    "\n",
+    "The variable 'Age' has to be converted to a numeric data type (see code below). This means that the responses of 998 respondents who chose (Refused) and did not tell their age will be converted into NaN. The same is true for the 93 respondents who were 99 years old or older (as the 99+ is read as a string and is not converted to a number here). As there were so few respondents aged over 99 and their exact ages are not known, I will drop those respondents and respondents who did not say their age from the dataset with the code following below.\n",
+    "Further, only respondents from 2021 will be included in the analysis (see code below)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fc736a6c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# The variable \"Age\" is coded as the wrong data type so below is converted to numeric type. \n",
+    "data_all['Age'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e2417452",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_all['Age'].value_counts().sort_index()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "563022bc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# changing 'Age' to numeric value - (Refused) and 99+ will be turned into NaN values as they contain strings\n",
+    "data_all.Age = data_all.Age.apply (pd.to_numeric, errors='coerce') \n",
+    "print (data_all.Age)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9791c9a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#counting number of rows with NaN in the \"Age\" column\n",
+    "data_all['Age'].isna().sum()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4936cf34",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converted \"Age\" into a float number. \n",
+    "data_all['Age'] = data_all['Age'].astype(float)\n",
+    "\n",
+    "data_all['Age'].dtypes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e949128b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#create new dataframe \"data\" with only seclected rows \n",
+    "data = data_all[data_all[\"Year\"] == 2021]#.copy()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9fe18af4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# range of ages is now 15-98, other values are NaN and therefore excluded\n",
+    "data['Age'].value_counts().sort_index()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b309273a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# selecting colums\n",
+    "data = data[['Country', 'GlobalRegion', 'Age', 'Gender', 'Education', 'IncomeFeelings', 'EMP_2010', 'INCOME_5', 'Q2_1', \"Q4A\", \"Q4B\", \"Q4C\", \"Q4D\", \"Q4E\", \"Q4F\", \"Q4G\", 'Q6', 'Q10Q11Recode', 'resilience_index', 'Worried.Index', 'Experienced.Index']]\n",
+    "print(data.shape)\n",
+    "#data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "703cb803",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# find out data types - relevant later for correlation \n",
+    "data.dtypes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "849bd81b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(data.keys()) #or data.columns would do the same"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f3af9604",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# rename some columns\n",
+    "data.rename(columns={'INCOME_5': 'Income5', 'EMP_2010': 'Employment', 'Q2_1': 'GreatestRisk', 'Q6': 'InternetUse', 'Q10Q11Recode': 'IncomeLoss', 'resilience_index': 'ResilienceIndex', 'Worried.Index': 'WorriedIndex', 'Experienced.Index': 'ExperiencedIndex'}, inplace=True)\n",
+    "print(data.shape)\n",
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eec295c5",
+   "metadata": {},
+   "source": [
+    "**Final data set description & attributes**\n",
+    "\n",
+    "The data set now contains 125911 rows representing the respondents of the questionnaire in 2021 and 21 colums, one for each attribute. As the respondents had the option not to reply to questions (Refused) or to chose DK (Don't know) as an answer the number of respondents for each description/diagram might vary. The data set may also include NaN values in any row or column which will also change the number of respondents used for analysis. Each of the attributes is further described in the data dictionary provided by the Lloyd’s Register Foundation with the download of the data set. The list below provides an overview of the 21 attributes. If the original name of an attribute has been changed it will be provided in brackets.\n",
+    "\n",
+    "**Demographic Information**<br>\n",
+    "* **Country**: string variable with the actual name of each country, the poll includes 144 different countries but in 2021 only 121 countries were surveyed. <br>\n",
+    "* **GlobalRegion**: the global region the respondent lives in including the following 15 regions: Eastern Africa, Central/Western Africa, North Africa, Southern Africa, Latin America & Caribbean, Northern America, Central Asia, East Asia, South-eastern Asia, South Asia, Middle East, Eastern Europe, Northern/Western Europe, Southern Europe, Australia and New Zealand.<br>\n",
+    "* **Age**: is the respondent's age at time of the interview from 15-98, ages 99+ and (Refused) have been excluded for the following analysis. <br>\n",
+    "* **Gender**: gender of the respondent: male or female<br>\n",
+    "* **Education**: the education level of the respondent 1 = Primary (0-8 years), 2 = Secondary (9-15 years), 3 = Tertiary (16 years or more) or (DK/Refused).<br>\n",
+    "* **IncomeFeelings**: respondents feelings about household income: Living comfortably on present income, Getting by on present income, Finding it difficult on present income, Finding it very difficult on present income or (DK) or (Refused).\n",
+    "* **Employment** (EMP_2010): employment Status of the respondent: Employed full time for an employer, Employed full time for self Employed, Part time do not want full time, Unemployed, Employed part time want full time, Out of workforce<br>\n",
+    "* **Income5** (INCOME_5): Per capita income quintiles: Poorest 20%, Second 20%, Middle 20%, Fourth 20%, Richest 20%.\n",
+    "\n",
+    "**Reply to questions from the poll**<br>\n",
+    "* **GreatestRisk** (Q2_1): Greatest Source of Risk to Safety in Your Daily Life<br>\n",
+    "* **InternetUse** (Q6): Used Internet, Including Social Media, in Past 30 Days<br>\n",
+    "* **IncomeLoss** (Q10Q11Recode): Suppose you lost all of your household income - how long would you be able to cover basic needs?<br>\n",
+    "\n",
+    "**Reply to question 4 from the poll**<br>\n",
+    "\n",
+    "All have the responses: 1 = Very worried, 2 = Somewhat worried, 3 = Not worried or (DK) or(Refused)\n",
+    "* **Q4A**: Worried Food You Eat Could Cause Serious Harm<br>\n",
+    "* **Q4B**: Worried Water You Drink Could Cause Serious Harm<br>\n",
+    "* **Q4C**: Worried Violent Crime Could Cause Serious Harm<br>\n",
+    "* **Q4D**: Worried Severe Weather Events Could Cause Serious Harm<br>\n",
+    "* **Q4E**: Worried Traffic or Roadside Accident Could Cause Serious Harmv<br>\n",
+    "* **Q4F**: Worried Mental Health Issues Could Cause Serious Harm<br>\n",
+    "* **Q4G**: Worried The Work You Do Could Cause Serious Harm\t<br>\n",
+    "\n",
+    "**Indexes**<br>\n",
+    "* **ResilienceIndex** (resilience_index): Resilience Index with range: 0.0-1.0<br>\n",
+    "* **WorriedIndex** (Worried.index) : Worried Index with range: -0.0-1.0<br>\n",
+    "* **ExperiencedIndex** (Experienced.Index): Experienced Index with range: -0.0-1.0<br>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7739749f",
+   "metadata": {},
+   "source": [
+    "**Selecting attributes**\n",
+    "\n",
+    "The data set is now further refinded to 15 attributes that, after some testing seemed to be most relevant for analysing the *WorriedIndex*. The sub-questions A-G for question 4 are considered as one group and 8 further attributes remain to be described and analysed besides question 4. The attributes *Education*, *Income5*, *Employment*, *GreatestRisk*, *InternetUse* and *IncomeLoss* will be dropped. The 15 chosen attributs will be described in more detail below. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2eb7a66",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# selecting colums\n",
+    "data = data[['Country', 'GlobalRegion', 'Age', 'Gender', 'IncomeFeelings', 'Q4A', 'Q4B', 'Q4C', 'Q4D', 'Q4E', 'Q4F', 'Q4G', 'ResilienceIndex', 'ExperiencedIndex', 'WorriedIndex']]\n",
+    "print(data.shape)\n",
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2f798b1c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#counting number of rows with NaN in the \"Age\" column\n",
+    "data['Age'].isna().sum()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0aea99a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#counting number of rows that contain any NaN values\n",
+    "\n",
+    "#dropNaN_rows = data.dropna()\n",
+    "#print(dropNaN_rows.shape)\n",
+    "#dropNaN_rows.head()\n",
+    "\n",
+    "# 104816 rows would be left of 125911 if all rows containing any NaN would be dropped = 16%"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "647dbe0b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# list of final selected attributes\n",
+    "print(data.keys())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1d77f41d",
+   "metadata": {},
+   "source": [
+    "## 3. Descriptive analysis of the data set <a class=\"anchor\" id=\"3Analyse\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ec9c048c",
+   "metadata": {},
+   "source": [
+    "### 3.1 Analysis of the attibutes with nominal scale type<a class=\"anchor\" id=\"31NominalAttribute\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c142f8ca",
+   "metadata": {},
+   "source": [
+    "The attributes *Country*, *GlobalRegion* and *Gender* have a **nominal** scale type as their values are non-numerical. (Categories as answers are possible but the categories don't have an order). Below each of the attributes is described in more detail. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fa7d9040",
+   "metadata": {},
+   "source": [
+    "**Country**\n",
+    "\n",
+    "In total respondents from 121 different countries have been included in the questionnaire of the World Risk Poll in 2021. (This is slightly less than the number of countries in 2019 which was 144.)\n",
+    " From the *value_counts()* of the attribute *Country* we know that in most contries around 1000 people have been intervied. With the exception of China with the greatest number of respondents (N=3500), followed by India (N=2001) and Russia (N=2001). Jamaica and Iceland have a lower number of respondents with 505 and 500 respondents respectively.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2625573a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['Country'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ad618490",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['Country'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2ad41e8b",
+   "metadata": {},
+   "source": [
+    "**GlobalRegion**\n",
+    "\n",
+    "The countries are divided into 15 global regions with different numbers of countries and therefore differing numbers of respondents in each category. The global region with most respondents is Latin America & Caribbean with 17535 respondents and the smallest category with 2000 respondents is Australia and New Zealand.\n",
+    "\n",
+    "According to the Lloyd's Register Foundation (2022) the geographic regions are very similar to those used by the United Nations Statistics Division (https://unstats.un.org/unsd/methodology/m49/). Only the region \"Middle East\" was renamed and modified by the Lloyd's Register Foundation. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7af5096",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['GlobalRegion'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "572df4ac",
+   "metadata": {},
+   "source": [
+    "**Gender**\n",
+    "\n",
+    "Of the 125911 respondents 66494 were female and 59417 were male, so slighly more women were interviewed than men. It has to be noted that the gender options were not sensitive to diversity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0b6eca0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['Gender'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1c558b1",
+   "metadata": {},
+   "source": [
+    "### 3.2 Analysis of the attibutes with ordinal scale type <a class=\"anchor\" id=\"32OrdinalAttribute\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e8a57eb6",
+   "metadata": {},
+   "source": [
+    "An **ordinal** scale can be found in the following attributes: *IncomeFeelings* and *Q4A-Q4G*. Categories on an ordinal scale have an order to them, but  numerical operations cannot be performed on them."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56cb6b3b",
+   "metadata": {},
+   "source": [
+    "**IncomeFeelings**\n",
+    "\n",
+    "18978 respondents feel that they find it very difficult to live on their present income and 30444 respondents feel that they live comfortably on their present income. The greatest number of participants (N=48337) feel that they are getting by on their present income."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "73dfd694",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['IncomeFeelings'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "674a92d0",
+   "metadata": {},
+   "source": [
+    "**Q4**\n",
+    "\n",
+    "Question Q4 was one of the questions of the survey and was asked as follows: \"In general, how WORRIED are you that each of the following things could cause you serious harm? Are you very worried, somewhat worried, or not worried?\" It refered to 7 different categories or hazards (Food, Water, Crime, Weather, Traffic accidents, Mental health, Work) and the respondents indicated their worry by chosing one of three categories: Not worried, Somewhat worried or Very worried.\n",
+    "\n",
+    "The description of the largest and smallest group below do not take into account the number of people who replied DK (don't know) or (Refused) and did not give an answer. \n",
+    "\n",
+    "+ Q4A: Food You Eat - the largest group of respondents (n=113720) were not worried, the smallest group (n= 63028) were very worried.\n",
+    "+ Q4B: Water You Drink - the largest group of respondents (n=141735) were not worried, the smallest group (n= 57374) were very worried.\n",
+    "+ Q4C: Violent Crime - the largest group of respondents was very worried (n=104532), the smallest group (n=84104) was somewhat worried\n",
+    "+ Q4D: Severe Weather Events - the largest group of respondents was very worried (n=101868), the smallest group (n=78040) was not worried\n",
+    "+ Q4E: Traffic or Roadside Accident - the largest group of respondents was very worried (n=48866), the smallest group (n=30302) was not worried\n",
+    "+ Q4F: Mental Health Issues - the largest group of respondents (n=132677) were not worried, the smallest group (n= 62699) were very worried.\n",
+    "+ Q4G: The Work You Do - the largest group of respondents (n=40352) were not worried, the smallest group (n= 12990) were very worried.\n",
+    "\n",
+    "This shows that in comparison respondents worried more about 'Violent Crime' and 'Traffic or Roadside Accidents'. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "64eea666",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_all[['Q4A', 'Q4B', 'Q4C', 'Q4D', 'Q4E', 'Q4F', 'Q4G']].apply(pd.Series.value_counts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4114ddf",
+   "metadata": {},
+   "source": [
+    "### 3.3 Analysis of the attibutes with numerical scale type<a class=\"anchor\" id=\"33NumericalAttribute\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2f2faef7",
+   "metadata": {},
+   "source": [
+    "Attributes with **numerical** scale types are: *Age*, *ResilienceIndex*, *ExperiencedIndex* and *WorriedIndex*. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fedb7b55",
+   "metadata": {},
+   "source": [
+    "**Age**\n",
+    "\n",
+    "After cleaning the data we have information of the age of 125388 respondents (from 125911 total respondents). The youngest respondent is 15 years old and the oldest respondent is 98 years old, the mean age is 42 and the standard deviation is 17.4 and the age distribution is visualised in the histogram below. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4206ef9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['Age'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ab607d3f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# age range is now from 15-98\n",
+    "data['Age'].value_counts().sort_index()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a5766975",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(11, 7)) \n",
+    "sns.histplot(data=(data['Age'].dropna()), kde=True, bins=20, stat=\"density\", color = 'skyblue')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2662dec",
+   "metadata": {},
+   "source": [
+    "Using a larger amount of bins (20 instead of 10) shows the age distribution more finely grained. Especially the age-group just above 40 is represented in greater numbers than the curve of the kernel density estimation would give away. However, the overall trend of a decline in numbers with growing age is clear from the diagram. This means that a larger number of respondents was aged around 15 - 45, whereas less respondents in the older age groups were interviewd. \n",
+    "\n",
+    "As interviews were conducted all over the world and the different countries are represented in roughly equal numbers of respondents this distribution would be expected as world population overall is growing. This means more people in the world would be expected to be younger, and less people older. \n",
+    "\n",
+    "It would be expected that different countries and thus different global regions have different age distributions as their population might be aging and declining like in europe or growing. To test this, the following histograms are used. From the histograms it is clear, that the age distribution varies a lot by global region, showing that regarding the age of the respondents it seems likely that representative samples have been drawn from the total population. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f6b61db5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#plots the histograms and kde of all global regions separately for comparison\n",
+    "g = sns.FacetGrid(data, col=\"GlobalRegion\", height=3, col_wrap=4)\n",
+    "g.map(sns.histplot, \"Age\", kde=\"true\", stat='probability', bins=10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3364e587",
+   "metadata": {},
+   "source": [
+    "**ResilienceIndex**\n",
+    "\n",
+    "The resilience index of the World Risk Poll 2021 is an average made up of different aspects of resilience: Individual, Household, Community, and Social. Each of the aspects is assesed in various questions from the World Risk Poll 2021. If the resilience index has a high value this indicates greater resilience.  \n",
+    "\n",
+    "Of the 125388 respondents only for 121222 a ResilienceIndex has been provided. The ResilienceIndex can range from 0-1 which are also the minimum and maximum value from the data and the mean is 0.57. The standard deviation is 0.17. The shape of the kde seems to be a normal distribution with a very slight tail on the left. This means only few participants are very resilient."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4f6bc1e5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['ResilienceIndex'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a7853af4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(11, 7)) \n",
+    "sns.histplot(data=data, x=(data['ResilienceIndex'].dropna()), bins=15, kde=\"true\", color = \"g\");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "30aac326",
+   "metadata": {},
+   "source": [
+    "**ExperiencedIndex**\n",
+    "\n",
+    "The Experience index is generated from responses to certain questions from the World Risk Poll and aims to asses how individuals perceive risk. A high value of ExperienceIndex indicates a greater experience of harm.\n",
+    "\n",
+    "The *ExperiencedIndex* has been calculated for 121815 respondents. It ranges from -0.004(minimum value) to 0.996 (maximum value). The mean is 0.23 and the standard deviation 0.25. The *ExperiencedIndex* looks exponentially distributed with a long tail on the right, showing that most respondents did not experience great risk and only very view respondents experienced a lot of risk. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e018597c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['ExperiencedIndex'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "32d54415",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(11, 7)) \n",
+    "sns.histplot(data=data, x=(data['ExperiencedIndex'].dropna()), bins=15, color = \"y\");"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d518557f",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex**\n",
+    "\n",
+    "The *World Worry Index* (here the variable *WorriedIndex*) provides a summary of how worried a respondent is overall regarding the seven hazards that were included in the \"World Risk Poll\". The *WorriedIndex* is generated from question Q4 that asks respondents how worried they are about the hazards: Food, Water, Crime, Weather, Traffic accidents, Mental health and Work. The higher the value of the worry index the greater is the worry of the respondent. \n",
+    "\n",
+    "The *WorriedIndex* is provided for 118782 respondents. The *WorriedIndex* can range from 0 - 1 and in the data set the minimum value is 0.006116 and the maximum value 1.006116. The mean is 0.43 and the standard deviation 0.22. From the visualisations it seems to have a normal distribution. With a strong peak around the mean and a slightly longer tail to the right. This indicates that most of the participants are not so worried and less participants are very worried."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1eff0c67",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data['WorriedIndex'].describe() "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a135b575",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(11, 7)) \n",
+    "sns.histplot(data=data, x=(data['WorriedIndex'].dropna()), bins=12, palette = 'colorblind');"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78fc4d38",
+   "metadata": {},
+   "source": [
+    "## 4. Relationships between *WorriedIndex* and other variables <a class=\"anchor\" id=\"4Beziehungen\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9e6c7862",
+   "metadata": {},
+   "source": [
+    "As the focus of the analysis is the *World Worry Index* in the following the relationships of the variables specifically with the variable *WorriedIndex* will be examined. First I will use box plots to visualise relationships between nominal variables (*Country, GlobalRegion, Gender*) and *WorriedIndex*. The ordinal variables use categories in the original data set and those will be converted into numbers to be able to perform the calculations and visualisations on them. Afterwards the ordinal and numerical variables and their relationship with the *WorriedIndex* will be examined. To do that, the correlation coefficients will be calculated and the relationships will be visualised using scatter plots and lines of best fit. The p-values will be used to examine the statistical significance of the relationships. Several hypothesis and their corresponding null hypothesis are formulated and will be tested. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8afec302",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & Country / GlobalRegion / Gender**\n",
+    "\n",
+    "At first I will describe the relationship between *WorriedIndex* and the attributes with a nominal scale type by using boxplots to compare the different categories. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6ef07166",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#plot WorriedIndex of countries sorted by mean WorriedIndex\n",
+    "grouped = data.loc[:,['Country', 'WorriedIndex']] \\\n",
+    "    .groupby(['Country']) \\\n",
+    "    .mean() \\\n",
+    "    .sort_values(by='WorriedIndex')\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(17, 30)) \n",
+    "sns.boxplot(x=data.WorriedIndex, y=data.Country, order=grouped.index);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ef5a1d1",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & Country**\n",
+    "\n",
+    "The above plot shows that the *WorriedIndex* can vary quite a lot for each different country. The respondents who worried least come from United Arab Emirates, Sweden, Denmark, Uzbekistan, Iceland, Finland, Saudi Arabia. The most worried respondents are from Mali, Guinea, Sierra Leone, Mozambique and Congo Brazzaville.\n",
+    "\n",
+    "**WorriedIndex & GlobalRegion** \n",
+    "\n",
+    "This trend of changing *WorriedIndex* around the world can also be seen in the boxplot of *WorriedIndex* and *GlobalRegion* below. Least worried are respondents from \"Autralia and New Zealand\" as well as from \"Norther/Western Europe\". Most worried are respondents from \"Central/Western Africa\", \"Eastern Africa\" and \"Southern Africa\". "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8715a360",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "grouped = data.loc[:,['GlobalRegion', 'WorriedIndex']] \\\n",
+    "    .groupby(['GlobalRegion']) \\\n",
+    "    .mean() \\\n",
+    "    .sort_values(by='WorriedIndex')\n",
+    " \n",
+    "sns.boxplot(x=data.WorriedIndex, y=data.GlobalRegion, order=grouped.index);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8cfa129e",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & Gender** \n",
+    "\n",
+    "According to their gender the respondents also shows a slight difference in how much they worry. Female respondents are slightly more worried than male respondents (see boxplot below). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "49557182",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.boxplot(x=\"WorriedIndex\", y=\"Gender\", data=data);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98092163",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & IncomeFeelings / Q4**\n",
+    "\n",
+    "To be able to include ordinal data in the following analysis the ordinal values have to be converted to number as they are categorical in the original dataset. This is the case for the attributes *IncomeFeelings* and *Q4A*-*Q4G*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bebe240f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_IncomeFeelings(data):\n",
+    "    data['IncomeFeelings'] = data['IncomeFeelings'].replace(['(DK)', '(Refused)', 'Finding it very difficult on present income', 'Finding it difficult on present income', 'Getting by on present income', 'Living comfortably on present income'], \n",
+    "                                                  [99, 99, 1, 2, 3, 4], inplace=False)\n",
+    "    \n",
+    "    data['IncomeFeelings'] = data['IncomeFeelings'].astype('Int64')\n",
+    "    data['IncomeFeelings'].fillna(0, inplace=True)\n",
+    "    data['IncomeFeelings'] = data['IncomeFeelings'].astype(int)\n",
+    "    \n",
+    "    data['IncomeFeelings'] = data['IncomeFeelings'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_IncomeFeelings(data)    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "de9f91b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_Q4A(data):\n",
+    "    data['Q4A'] = data['Q4A'].replace(['(DK)', '(Refused)', 'Very worried', 'Somewhat worried', 'Not worried'], \n",
+    "                                                  [99, 99, 3, 2, 1], inplace=False)\n",
+    "    \n",
+    "    data['Q4A'] = data['Q4A'].astype('Int64')\n",
+    "    data['Q4A'].fillna(0, inplace=True)\n",
+    "    data['Q4A'] = data['Q4A'].astype(int)\n",
+    "    \n",
+    "    data['Q4A'] = data['Q4A'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_Q4A(data)    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc7ed5b5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_Q4B(data):\n",
+    "    data['Q4B'] = data['Q4A'].replace(['(DK)', '(Refused)', 'Very worried', 'Somewhat worried', 'Not worried'], \n",
+    "                                                  [99, 99, 3, 2, 1], inplace=False)\n",
+    "    \n",
+    "    data['Q4B'] = data['Q4B'].astype('Int64')\n",
+    "    data['Q4B'].fillna(0, inplace=True)\n",
+    "    data['Q4B'] = data['Q4B'].astype(int)\n",
+    "    \n",
+    "    data['Q4A'] = data['Q4A'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_Q4B(data) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11e3237f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_Q4C(data):\n",
+    "    data['Q4C'] = data['Q4C'].replace(['(DK)', '(Refused)', 'Very worried', 'Somewhat worried', 'Not worried'], \n",
+    "                                                  [99, 99, 3, 2, 1], inplace=False)\n",
+    "    \n",
+    "    data['Q4C'] = data['Q4C'].astype('Int64')\n",
+    "    data['Q4C'].fillna(0, inplace=True)\n",
+    "    data['Q4C'] = data['Q4C'].astype(int)\n",
+    "    \n",
+    "    data['Q4C'] = data['Q4C'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_Q4C(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4e67a746",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_Q4D(data):\n",
+    "    data['Q4D'] = data['Q4D'].replace(['(DK)', '(Refused)', 'Very worried', 'Somewhat worried', 'Not worried'], \n",
+    "                                                  [99, 99, 3, 2, 1], inplace=False)\n",
+    "    \n",
+    "    data['Q4D'] = data['Q4D'].astype('Int64')\n",
+    "    data['Q4D'].fillna(0, inplace=True)\n",
+    "    data['Q4D'] = data['Q4D'].astype(int)\n",
+    "    \n",
+    "    data['Q4D'] = data['Q4D'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_Q4D(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "29f2e020",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_Q4E(data):\n",
+    "    data['Q4E'] = data['Q4E'].replace(['(DK)', '(Refused)', 'Very worried', 'Somewhat worried', 'Not worried'], \n",
+    "                                                  [99, 99, 3, 2, 1], inplace=False)\n",
+    "    \n",
+    "    data['Q4E'] = data['Q4E'].astype('Int64')\n",
+    "    data['Q4E'].fillna(0, inplace=True)\n",
+    "    data['Q4E'] = data['Q4E'].astype(int)\n",
+    "    \n",
+    "    data['Q4E'] = data['Q4E'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_Q4E(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d85b4541",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_Q4F(data):\n",
+    "    data['Q4F'] = data['Q4F'].replace(['(DK)', '(Refused)', 'Very worried', 'Somewhat worried', 'Not worried'], \n",
+    "                                                  [99, 99, 3, 2, 1], inplace=False)\n",
+    "    \n",
+    "    data['Q4F'] = data['Q4F'].astype('Int64')\n",
+    "    data['Q4F'].fillna(0, inplace=True)\n",
+    "    data['Q4F'] = data['Q4F'].astype(int)\n",
+    "    \n",
+    "    data['Q4F'] = data['Q4F'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_Q4F(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1f9dade7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# converts colum categories to numberical values\n",
+    "def convert_Q4G(data):\n",
+    "    data['Q4G'] = data['Q4G'].replace(['(DK)', '(Refused)', 'Very worried', 'Somewhat worried', 'Not worried'], \n",
+    "                                                  [99, 99, 3, 2, 1], inplace=False)\n",
+    "    \n",
+    "    data['Q4G'] = data['Q4G'].astype('Int64')\n",
+    "    data['Q4G'].fillna(0, inplace=True)\n",
+    "    data['Q4G'] = data['Q4G'].astype(int)\n",
+    "    \n",
+    "    data['Q4G'] = data['Q4G'].replace(99, np.nan) # Refused answers, shouldn't be included = NaN\n",
+    "    \n",
+    "convert_Q4G(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c730659a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "245b19c6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#make sure all attributes are converted to float type\n",
+    "def convert_types(df):\n",
+    "    data['Age'] = data['Age'].astype(float)\n",
+    "    data['IncomeFeelings'] = data['IncomeFeelings'].astype(float)\n",
+    "    data['Q4A'] = data['Q4A'].astype(float)\n",
+    "    data['Q4B'] = data['Q4B'].astype(float)\n",
+    "    data['Q4C'] = data['Q4C'].astype(float)\n",
+    "    data['Q4D'] = data['Q4D'].astype(float)\n",
+    "    data['Q4E'] = data['Q4E'].astype(float)\n",
+    "    data['Q4F'] = data['Q4F'].astype(float)\n",
+    "    data['Q4G'] = data['Q4G'].astype(float)\n",
+    "    data['ResilienceIndex'] = data['ResilienceIndex'].astype(float)\n",
+    "    data['WorriedIndex'] = data['WorriedIndex'].astype(float)\n",
+    "    data['ExperiencedIndex'] = data['ExperiencedIndex'].astype(float)\n",
+    "    \n",
+    "convert_types(data)\n",
+    "data.dtypes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cdecc9f4",
+   "metadata": {},
+   "source": [
+    "**Correlation coefficients for relationships with *WorriedIndex***\n",
+    "\n",
+    "The Spearman's rank correlation is used as Downey(2014) describes this as a more robust method for data that might have a non-linear relationship or that has skewed distributions or outliers.\n",
+    "\n",
+    "The correlation coefficient can indicate strong positive relationships between two variables if it is near 1 or strong negative relationships if it is near -1. So between 1 and 0.6 it will be seen as a strong correlation, between 0.6 and 0.4 as a moderate correlation and between 0.4 and 0 as a weak correlation, likewise for negative correlations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc07779c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#show correlation coefficients for all attributes in relation to WorriedIndex\n",
+    "data.corr(method=\"spearman\")['WorriedIndex']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c5444f0",
+   "metadata": {},
+   "source": [
+    "Only the correlation coefficients related to question 4 that are directly related to the *WorriedIndex* have, as should be expected, relatively high corellation coefficients. With correlation coefficients ranging from 0.66 to 0.7 all of the variables (Food you eat, Water you drink, Violent crime, Severe weather events, Traffic or roadside accidents, Mental health issues) have a strong correlation with *WorriedIndex*. The only exception is Q4G (the work you do), that only has a coefficient of 0.29, which can be considered a weak correlation with *WorriedIndex*. None of the other variables have very high correlation coefficients, so there are weak correlations between those variables and the *WorriedIndex*. \n",
+    "\n",
+    "**Strong positive correlation:**\n",
+    "\n",
+    "+ *WorriedIndex & Q4A, Q4B, Q4C, Q4D, Q4E, Q4F*\n",
+    "\n",
+    "This indicates that if worry about food, water, violent crime, severe weather events, traffic or roadside accidents and mental health issues increase, the WorryIndex also increases.\n",
+    "\n",
+    "**Weak positive correlation:**\n",
+    "\n",
+    "+ *WorriedIndex & Q4G*\n",
+    "+ *WorriedIndex & ExperiencedIndex*\n",
+    "\n",
+    "This indicates that: \n",
+    "\n",
+    "+ If worry about work increases up the *WorriedIndex* would also increase slightly\n",
+    "+ If *ExperiencedIndex* inceases up the *WorriedIndex* would also increase slightly. \n",
+    "\n",
+    "**Weak negative correlation:**\n",
+    "+ *WorriedIndex & Age*\n",
+    "+ *WorriedIndex & IncomeFeelings*\n",
+    "+ *WorriedIndex & ResilienceIndex*\n",
+    "\n",
+    "This indicates that:\n",
+    "\n",
+    "+ If *Age* increases, the *WorriedIndex* would decrease slightly, making older respondents less worried than younger ones.\n",
+    "+ If the *IncomeFeelings* increase, the *WorriedIndex* would decrease slightly, meaning that the more comfortable respondents felt that their income would cover their cost of living the more their worrry would decrease. \n",
+    "+ If *ResilienceIndex* increases, the *WorriedIndex* would decrease slightly. This means that a respondent who is more resilient would would worry slightly less. \n",
+    "\n",
+    "In the next chapter these relationships will be visualised using scatterplots. To see if there is a linear relationship between the variables, regression is used and the p-values are computed to analyse wether the results are statistically significant. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34528bbe",
+   "metadata": {},
+   "source": [
+    "## 5. Inductive analysis <a class=\"anchor\" id=\"5InductiveAnalyse\"></a>\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf60e4d9",
+   "metadata": {},
+   "source": [
+    "Even thought some of the correlation coefficients are very small, indicating only weak relationships between the *WorriedIndex* and the analysed variables, they will be included in the following hypothesis alongside with those varibales that have a stong correlation with WorryIndex. \n",
+    " \n",
+    "**Hypothesis**\n",
+    "\n",
+    "+ Hypothesis 1: if *Q4A-Q4F* increases, *WorriedIndex* also increases.\n",
+    "+ Hypothesis 2: if *Q4G* increases, the *WorriedIndex* would increase slightly.\n",
+    "+ Hypothesis 3: if *ExperiencedIndex* increases, the *WorriedIndex* would increase slightly.\n",
+    "+ Hypothesis 4: if *Age* increases, the *WorriedIndex* would decrease slightly.\n",
+    "+ Hypothesis 5: if *IncomeFeelings* increse, the *WorriedIndex* would decrease slightly.\n",
+    "+ Hypothesis 6: if *ResilienceIndex* increases, the *WorriedIndex* would decrease slightly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b3b35dd1",
+   "metadata": {},
+   "source": [
+    "To examine the relationship between WorriedIndex and the other attributes the following null hypothesis were formulated.\n",
+    "\n",
+    "The **null hypothesis** are: \n",
+    "\n",
+    "+ H01: Q4A-Q4F do not influence the WorriedIndex.\n",
+    "+ H02: Q4G does not influence the WorriedIndex.\n",
+    "+ H03: ExperiencedIndex does not influence the WorriedIndex.\n",
+    "+ H04: Age does not influence the WorriedIndex.\n",
+    "+ H05: IncomeFeelings does not influence the WorriedIndex.\n",
+    "+ H06: The ResilienceIndex does not influence the WorriedIndex.\n",
+    "\n",
+    "To test the hypothesis, the p-value will be found for the WorriedIndex with each variable. If the p < 0.05, the result is statistically significant and the null hypothesis can be rejected, confirming the hypothesis that has been formulated. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f38c2c26",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & Q4A-Q4F**\n",
+    "\n",
+    "The following description of the relationship between *WorriedIndex* and *Q4A-Q4F* (worry about A-Food you eat, B-Water you drink, C-Violent crime, D-Severe weather events, E-Traffic or roadside accidents, F-Mental health issues) is based on the 6 diagrams and the corresponding calculations below. \n",
+    "\n",
+    "All of the scatter plots show that there does not seem to be a linear relationship between the 6 hazards (food, water, crime, weather, traffic accidents, mental health) that respondents worry about and *WorriedIndex*, as the data points are scattered all over the diagrams. They all have very similar positive slopes between 2.34 and 2.68. The intercepts are around 0.77 to 1.11. This means that if the *WorriedIndex* increased by one unit, the worry about one of the risks would increase by that amount. The greatest increase of 1.11 is for E-Traffic or roadside accidents and the smallest for A-Food, B-Water and F-Mental health issues.\n",
+    "\n",
+    "The p-values for all 6 variables *Q4A-Q4F* is 0. This is strange as the p-value could be expected to be very small but is usually not 0. This might indicate that there is some form of error. Either in the data provided, the type of data used, the relationship of the data not being linear or in the way the data was used in this analysis. It might also be the case that the p-value is displayed as 0 due to the large quantity of data used here, meaning that the number would be so small that the computer cannot distinguish it from 0. Assuming that 0 is the correct p-value means the results are statistically significant (p-values below 0.05), leading to a rejection of the null hypothesis H01: *Q4A-Q4F* do not influence the *WorriedIndex*. This means that *Q4A-Q4F* have a relationship with *WorriedIndex* that is statistically significant, meaning that the worry respondents have about food, water, crime, weather, traffic accidents and mental health issues is related to their worry index. Each of the factors had a strong positive correlation and therefore as the worry of one of the factors (food, water, crime, weather, traffic accidents and mental health issues) increases, the *WorriedIndex* also increases."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4ec5f500",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "from scipy.stats import linregress\n",
+    "\n",
+    "subset_Q4A = data.dropna(subset=['WorriedIndex', 'Q4A'])\n",
+    "xs = subset_Q4A['WorriedIndex']\n",
+    "ys = subset_Q4A['Q4A']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ee85535",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "Q4A_jitter = data['Q4A'] + np.random.normal(0, 2, size=len(data['Q4A']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, Q4A_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 5])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Q4A')\n",
+    "plt.axis([0, 1, 0, 5]);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "152ae7db",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_Q4B = data.dropna(subset=['WorriedIndex', 'Q4B'])\n",
+    "xs = subset_Q4B['WorriedIndex']\n",
+    "ys = subset_Q4B['Q4B']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "da220c18",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "Q4B_jitter = data['Q4B'] + np.random.normal(0, 2, size=len(data['Q4B']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, Q4B_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 5])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Q4B')\n",
+    "plt.axis([0, 1, 0, 5]);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a3dfaf69",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_Q4C = data.dropna(subset=['WorriedIndex', 'Q4C'])\n",
+    "xs = subset_Q4C['WorriedIndex']\n",
+    "ys = subset_Q4C['Q4C']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2ae8e5a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "Q4C_jitter = data['Q4C'] + np.random.normal(0, 2, size=len(data['Q4C']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, Q4C_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 5])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Q4C')\n",
+    "plt.axis([0, 1, 0, 5]);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bb85425c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_Q4D = data.dropna(subset=['WorriedIndex', 'Q4D'])\n",
+    "xs = subset_Q4D['WorriedIndex']\n",
+    "ys = subset_Q4D['Q4D']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "85e7a5ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "Q4D_jitter = data['Q4D'] + np.random.normal(0, 2, size=len(data['Q4D']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, Q4D_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 5])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Q4D')\n",
+    "plt.axis([0, 1, 0, 5]);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5cca1321",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_Q4E = data.dropna(subset=['WorriedIndex', 'Q4E'])\n",
+    "xs = subset_Q4E['WorriedIndex']\n",
+    "ys = subset_Q4E['Q4E']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "443750dd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "Q4E_jitter = data['Q4E'] + np.random.normal(0, 2, size=len(data['Q4E']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, Q4E_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 5])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Q4E')\n",
+    "plt.axis([0, 1, 0, 5]);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4177752c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_Q4F = data.dropna(subset=['WorriedIndex', 'Q4F'])\n",
+    "xs = subset_Q4F['WorriedIndex']\n",
+    "ys = subset_Q4F['Q4F']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e562278",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "Q4F_jitter = data['Q4F'] + np.random.normal(0, 2, size=len(data['Q4F']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, Q4F_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 5])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Q4G')\n",
+    "plt.axis([0, 1, 0, 5]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "050a467d",
+   "metadata": {},
+   "source": [
+    "**WorryIndex & Q4G**\n",
+    "\n",
+    "The scatter plot of *WorryIndex & Q4G* (worry about work causing serious harm) does also not seem to indicate a linear relationship between the two. The data points are scattered all over the diagram. The line of best fit, however, does indicate a positive correlation. This is in line with the weak positive correlation shown by the correlation coefficient of 0.29. \n",
+    "The slope is 2.42, so similar to *Q4A-Q4F* that ranged between 2.34 and 2.68 for the slopes. The intercept is 1.11, the same as *Q4E* (worry about traffic causing serious harm). If the worry index went up by one unit the worry about traffic would go up by 1.1. The p-value again is 0, either indicating an error or, if it is taken to be true, meaning that the result is statistically significant and the null hypothesis, H02: *Q4G* does not influence the WorriedIndex, can be rejected. This means, that the worry about traffic causing serious harm increasing and the WorriedIndex increasing are related. But remembering the low correlation coefficient this relationship is not very strong."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "40bab9e5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_Q4G = data.dropna(subset=['WorriedIndex', 'Q4G'])\n",
+    "xs = subset_Q4G['WorriedIndex']\n",
+    "ys = subset_Q4G['Q4E']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "286a0953",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "Q4G_jitter = data['Q4G'] + np.random.normal(0, 2, size=len(data['Q4G']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, Q4G_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 5])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Q4G')\n",
+    "plt.axis([0, 1, 0, 5]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75297192",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & ExperiencedIndex**\n",
+    "\n",
+    "The scatter plott below shows that there is no linear relationship between the two attributes as the data points are spread all over the figure. This implies that the fitted line is not a good model for the relationship. However the slope is positive confirming a positive relationship between *WorriedIndex* and *ExperiencedIndex*. As the *WorriedIndex* goes up by one unit, the Experienced index goes up by 0.47 (slope). The p-value is 0.0, which either indicates that there is an error in the calculation or that the result is statistically significant as p-values below 0.05 are assumed to be statistically significant. The intercept is 0.032. So if the *WorriedIndex* is 0 the *ExperiencedIndex* would be 0.032, therefore also very low. The null hypothesis H03: *ExperiencedIndex* does not influence the *WorriedIndex*, can be rejected and experience of risk and worry about risk are related, meaning that if one increases the other also increases. However, the correlation coefficient indicated that there is only a weak postitive relationship between *WorriedIndex* and *ExperiencedIndex*."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "639fc6b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_exp = data.dropna(subset=['WorriedIndex', 'ExperiencedIndex'])\n",
+    "xs = subset_exp['WorriedIndex']\n",
+    "ys = subset_exp['ExperiencedIndex']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0736ab6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "age_jitter = data['ExperiencedIndex'] + np.random.normal(0, 2, size=len(data['ExperiencedIndex']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, age_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 1])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Experienced Index')\n",
+    "plt.axis([0, 1, 0, 1]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e02a381",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & Age**\n",
+    "\n",
+    "As the points of the scatter plot below are all over the figure, the relationship might not be linear, so the line of best-fit might not be the best model for the relationship. But the line of best-fit has a slope of -10.62 which shows a negative relationship between *Age* and *WorriedIndex*. As the *WorriedIndex* increases by one unit, the Age of the respondents decreases by almost 11 years. And the intercept is at 46.4, meaning that if the *WorriedIndex* was 0 the *Age* would be around 46 years. Again, the p-value is 0.0. This means the null hypothesis, H04: *Age* does not influence the *WorriedIndex*, can be rejected as the p-value is lower than 0.05. There is a negative correlation between *WorriedIndex* and *Age*. The effect, however, is very weak, as the low correlation coefficient indicated. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e27920c4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_age = data.dropna(subset=['WorriedIndex', 'Age'])\n",
+    "xs = subset_age['WorriedIndex']\n",
+    "ys = subset_age['Age']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "371bc1c4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter the data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "age_jitter = data['Age'] + np.random.normal(0, 2, size=len(data['Age']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, age_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 15, 100])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.7)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Age in years')\n",
+    "plt.axis([0, 1, 15, 100]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9601f5e8",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & IncomeFeelings**\n",
+    "\n",
+    "As before the relationship may not be linear due to the points being spread all over. The slope is -1.25, confirming a negative relationship and the p-value again is 0. If it is not an error, than this means that the null hypothesis, H05: *IncomeFeelings* does not influence the *WorriedIndex*, can be rejected. There is a relationship between *WorriedIndex* and *IncomeFeelings*, but it is not very strong as the very low correlation coefficient indicates a very weak relationship between the two variables. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a1cef53",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop missing values, regression, find p-value\n",
+    "subset_feel = data.dropna(subset=['WorriedIndex', 'IncomeFeelings'])\n",
+    "xs = subset_feel['WorriedIndex']\n",
+    "ys = subset_feel['IncomeFeelings']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "454d9008",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#jitter data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "feel_jitter = data['IncomeFeelings'] + np.random.normal(0, 2, size=len(data['IncomeFeelings']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, feel_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 6])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.5)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('IncomeFeelings')\n",
+    "plt.axis([0, 1, 0, 6]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1b68a3fa",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & ResilienceIndex**\n",
+    "\n",
+    "Again the data points for the two variables are scattered so they do not seem to have a linear relationship. \n",
+    "The slope is -0.19 so if the *WorriedIndex* increases by one unit the *ResilienceIndex* decreases by 0.19. The p-value is again 0.0. And the intercept is 0.66, so if the *WorriedIndex* was 0 the *ResilienceIndex* would be 0.66. As the p-value is below 0.05, assuming that it is really 0, the null hypothesis, H06: The *ResilienceIndex* does not influence the *WorriedIndex*, can be rejected and a relationship between *WorriedIndex* and *ResilienceIndex* exists, even though it is a weak relationship due to the very small correlation coefficient. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dbc85089",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# df = data[data['WorriedIndex'].notnull() & data['ResilienceIndex'].notnull()]\n",
+    "subset_res = data.dropna(subset=['WorriedIndex', 'ResilienceIndex'])\n",
+    "xs = subset_res['WorriedIndex']\n",
+    "ys = subset_res['ResilienceIndex']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c14d6d80",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# jitter data\n",
+    "worry_jitter = data['WorriedIndex'] + np.random.normal(0, 2, size=len(data['WorriedIndex']))\n",
+    "res_jitter = data['ResilienceIndex'] + np.random.normal(0, 2, size=len(data['ResilienceIndex']))\n",
+    "\n",
+    "# make the scatter plot\n",
+    "plt.plot(worry_jitter, res_jitter, 'o', markersize=2, alpha=0.08)\n",
+    "plt.axis([0, 1, 0, 1])\n",
+    "\n",
+    "# plot the line of best fit\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "plt.plot(fx, fy, '-', alpha=0.5)\n",
+    "\n",
+    "# label the axes\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Resilience Index')\n",
+    "plt.axis([0, 1, 0, 1]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3dceeff1",
+   "metadata": {},
+   "source": [
+    "**Summary of inductive analysis**\n",
+    "\n",
+    "The p-value for all hypothesis tests was 0 which could mean that there was some error in the data. Perhaps, the p-values were extremely small (which is most likely) or some error occured in the data provided, the type of data used, the analysis or it might be due to the non-linear relationship of the data. But if the p-values are taken to be the correct results of the hypothesis test, a p-value of 0 would mean that all the result were statistically significant  and all null hypothesis should be rejected. Even if the null hypothesis are rejected, the correlation between *WorriedIndex* and the variables *Q4G, Age, IncomeFeelings, ResilienceIndex* and *ExperienceIndex* would be very weak. This is indicated by the low or very low correlation coefficients. That would mean that *WorriedIndex* changes very slightly if the respondends worry about risk at work, experienced risk, if they are resilient as well as how they feel about their income, or how long they could live if their lost their income and how old they are. The correlation between *WorriedIndex* and *Q4A-Q4F* is stronger, as we see from the larger correlation coefficients and if respondents worry about food, water, crime, weather, traffic accidents or mental health increases the *WorriedIndex* would also increase."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf1ca175",
+   "metadata": {},
+   "source": [
+    "**WorriedIndex & Q4A-Q4G & GlobalRegion**\n",
+    "\n",
+    "For better visibility the follwing diagrams were plotted in two lines. The diagrams below show the role the global region of the respondents plays in the relationship between *WorriedIndex* and the worry that one of the 6 hazards, that were found to have a strong correlation with *WorriedIndex*(food, water, traffic accidents, weather, mental health), could cause them serious harm. With the used pairgrid unfortunately the *WorriedIndex* is now displayed on the y axis but this way *Q4A-Q4E* can be compared well. It shows that the slope may vary quite a bit by *GlobalRegion*, indicating that the amount of worry about each hazard and thus the *WorriedIndex* change according to the global region. This relationship will not be tested further here as it would excede the scope of this analysis. It is also visible that 'Central/Western Africa' has always the greatest WorriedIndex, indicating a high level of worry, while for example 'Nothern/Western Europe' has a low WorriedIndex and thus respondents from this region are less worried."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3952e959",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# plot correlation for a series of variables\n",
+    "# plot line of best fit for all categories of one variable\n",
+    "# plot legend at in a specific place\n",
+    "h = sns.PairGrid(data, y_vars=[\"WorriedIndex\"], x_vars=[\"Q4A\", \"Q4B\",\"Q4C\"], hue=\"GlobalRegion\", height=4)\n",
+    "h.map(sns.regplot)\n",
+    "plt.legend(bbox_to_anchor=(1.3, 1))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d36087ee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# plot correlation for a series of variables\n",
+    "# plot line of best fit for all categories of one variable\n",
+    "# plot legend at in a specific place\n",
+    "h = sns.PairGrid(data, y_vars=[\"WorriedIndex\"], x_vars=[\"Q4D\", \"Q4E\",\"Q4F\"], hue=\"GlobalRegion\", height=4)\n",
+    "h.map(sns.regplot)\n",
+    "plt.legend(bbox_to_anchor=(1.3, 1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7bcd76f2",
+   "metadata": {},
+   "source": [
+    "## 6. Worry and happiness<a class=\"anchor\" id=\"6Happiness\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4212669e",
+   "metadata": {},
+   "source": [
+    "In this chapter I will compare worry to happiness by combining two data sets. The first one is the data from the \"World Happiness Report 2021\" and the second one is a data set derived from the \"World Risk Poll 2021\" used above. It aggregates the *World Worry Index* (as well as the *Experience Index* and the *Resilience Index*) by country and thus makes a ranking of countries possible. Similarly the attribute *Ladder score* from the \"World Happiness Report 2021\" corresponds to a ranking of the countries. \n",
+    "To use the two data sets together, they will first be loaded, then an attribute for the rank will be added for each worry and happiness to the respective data sets. Finally the two datasets will be combined and this new aggregated data will be analysed. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a7200eb8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#load happiness data\n",
+    "data_happy = pd.read_csv(\"world-happiness-report-2021_rank.csv\")\n",
+    "print(data_happy.shape)\n",
+    "data_happy.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50670fbd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# adding an attribute 'Happy_rank' ranking the 149 listed countries from 1 to 149\n",
+    "# 1 means the most happy and 149 means the least happy\n",
+    "data_happy['Happy_rank'] = range(1,150) \n",
+    "data_happy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4bb48985",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# adding an attribute 'Country' with the same content as 'Country name'\n",
+    "# this makes it possible to sort both tables by the name of the country\n",
+    "data_happy['Country']=data_happy['Country name']\n",
+    "data_happy.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "85b4d3c7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#read worry index data\n",
+    "data_worry = pd.read_csv(\"WRP 2021 Indexes_WRP 2021 Indexes_Tabelle_original.csv\")\n",
+    "print(data_worry.shape)\n",
+    "data_worry.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c0a8bcd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# adding an attribute 'Worry_Rank' ranking the 144 listed countries from 1 to 144\n",
+    "# 1 means the most worried and 144 means the least worried\n",
+    "data_worry['Worry_rank'] = range(1,145)\n",
+    "data_worry"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea59e099",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the two data sets are merged, sorting the rows using the variable 'Country' \n",
+    "df= pd.merge(data_worry, data_happy, on=\"Country\", how='outer')\n",
+    "# chose columns that are required for the analysis\n",
+    "df= df[['Country', \"Regional indicator\", 'Worry Index', \"Worry_rank\", 'Ladder score', 'Happy_rank']]\n",
+    "\n",
+    "# displays all rows and columns \n",
+    "# pd.set_option(\"display.max_rows\", None, \"display.max_columns\", None)\n",
+    "print(df.shape)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42a1d271",
+   "metadata": {},
+   "source": [
+    "**Number of countries**\n",
+    "\n",
+    "As the two data sets \"World Happiness Report 2021\" and the \"World Risk Poll 2021\" indexes provide a different number of countries (149 for the \"World Happiness Report\" and 144 for the \"World Risk Poll\") and each contain a few missing values, these will be removed from the data to allow comparison of the countries where possible. \n",
+    "This leaves a data set with 139 rows corresponding to 139 countries that were in both data sets and contained no NaN values. This means that some of the ranks are left out as the countries are not listed in the data set any more. It is also worth noting that the index data set of the \"World Risk Poll 2021\" contains countries that only had respondents from the 2019 survey and are thus not included in the analysis of the *WorriedIndex* in the chapters above. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f3f0284",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# dropping all rows that include NaN values. \n",
+    "# now there are 139 countries listed in the data set\n",
+    "df = df.dropna()\n",
+    "print(df.shape)\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf5f8892",
+   "metadata": {},
+   "source": [
+    "**Description of the happy/worry data set**\n",
+    "\n",
+    "The new data set contains 139 rows refering to 139 different countries. It has 6 colums corresponding to 6 attributes (*Country, Regional indicator, Worry index, Worry_rank, Ladder score* and *Happy_rank*) which are described briefly below. \n",
+    "\n",
+    "**Ladder score & Worry index:** \n",
+    "\n",
+    "The minimum value of *Ladder score* that represents each countries happiness is 2.52 and the maximum value is 7.84. The mean is 5.58 and the standard deviation is 1.07. For the *Worry Index* the minimum value is 0.2 and the maximum value is 0.66. The mean for the *Worry Index* is 0.43 and the standard deviation 0.1. \n",
+    "\n",
+    "As the *Ladder score* increases the the happiness and the *Happy_rank* decrease, meaning that countries with a low score are the most happy and countries with a high score are the least happy. Finland, Denmark and Iceland are the happiest countries and Afghanistan, Zimbabwe and Rwanda the least happy. \n",
+    "\n",
+    "The *Worry Index* increases as the worry and the Worry_rank decrease, meaning that countries with a low score are the most worried and countries with a high score are the least worried. Mali, Guinea and Mozambique are the most worried countries with Worry_rank 1 to 3 and Denmark, Uzbekistan and Sweden are the least worried.\n",
+    "\n",
+    "*Ladder score* and *Worry Index* both have roughly a normal distribution. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "78ac7543",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['Ladder score'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "55c9de1f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['Worry Index'].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c8debe3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(1, 2, figsize=(17, 6))\n",
+    "sns.histplot(data=df, x=df['Worry Index'], kde=\"true\", bins=12, palette = 'colorblind',ax=ax[0])\n",
+    "sns.histplot(data=df, x=df['Ladder score'], kde=\"true\", bins=12, color= 'skyblue', ax=ax[1]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a4ddf092",
+   "metadata": {},
+   "source": [
+    "**Regional indicator**\n",
+    "\n",
+    "Information about the region of each country is also included in the data set. The *Regional Indicator* is, however, added from the data provided in the \"World Happiness Report 2021\" and is not exactly the same as the *GlobalRegions* in the data of the \"World Risk Report 2021\". There are 10 regions identified in the data set that each contain differing numbers of countries. Sub-Saharan Africa, containing 31 countries, is the category with the most countries and 'East Asia' and 'North America and ANZ' each only contain 4 countries of the total 139 countries described in the data set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1bcd4520",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['Regional indicator'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6a4b6e6b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['Regional indicator'].describe()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "015e73cf",
+   "metadata": {},
+   "source": [
+    "**Relationship worry / happiness & Regional Indicator**\n",
+    "\n",
+    "The box plots of *Ladder score* and *Worry Index* below show that there can be some variation to worry and happiness depending not only on the individual countries but on the global region of the country. \n",
+    "\n",
+    "Following this, the question if there is a relationship between worry and happiness will be explored by comparing the variables *Worry_rank* and *Happy_rank* as well as the variables *Worry Index* and *Ladder score*. Finally, it will be explored how the correlation of *Worry Index* and *Ladder score* is affected by the *Regional indicator*. \n",
+    "\n",
+    "To do this, at first the correlation coefficients are found for each pair. As before the Spearman corellation coefficient is used. The correlation coefficient indicates a strong correlation with values between 1 and 0.6, moderate correlation between 0.4 and 0.6 and a weak correlation between 0.4 and 0 and the same theshold values apply to a negative coefficient. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f803131",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.boxplot(x=\"Worry Index\", y=\"Regional indicator\", data=df);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f0e212af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.boxplot(x=\"Ladder score\", y=\"Regional indicator\", data=df);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dc95f716",
+   "metadata": {},
+   "source": [
+    "**Relationship between worry & happiness**\n",
+    "\n",
+    "To analyse worry and happiness the relationships between *Worry_rank* and *Happy_rank* as well as the *Worry Index* and *Ladder score* will be explored below. Their correlation coefficients will be examined and then scatter plots and line of best fit will be used to visualise the relationship. The p-value will help to evaluate the statistical significane of the tests for the null hypothesis. Both sets of results will be evaluated together below.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e0ee6226",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.corr(method=\"spearman\")\n",
+    "df.corr()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1491263e",
+   "metadata": {},
+   "source": [
+    "**Correlation**\n",
+    "\n",
+    "The correlation coefficient for *Worry_rank* and *Happy_rank* is -0.56, which is a moderate negative correlation between the variables. As one of the ranks increases the other decreases. \n",
+    "The correlation coefficient for *Worry Index* and *Ladder score* is 0.55 which indicates a moderate positive correlation between the two meaning as one of the variables increases the other also increases. \n",
+    "\n",
+    "The **hypothesis** are therefore: \n",
+    "+ Hypothesis WH-H1: if *Worry_rank* increases then *Happy_rank* decreases moderately.\n",
+    "+ Hypothesis WH-H2: if *Worry Index* increases then *Ladder score* increases moderately.\n",
+    "\n",
+    "The corresponding **null hypothesis** are: \n",
+    "\n",
+    "+ WH-0H1: There is no correlation between *Worry_rank* and *Happy_rank*.\n",
+    "+ WH-0H2: There is no correlation between *Worry Index* and *Ladder score*."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "647e8233",
+   "metadata": {},
+   "source": [
+    "**Worry_rank & Happy_Rank**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6957d939",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scipy.stats import linregress\n",
+    "\n",
+    "subset = df.dropna(subset=['Worry_rank', 'Happy_rank'])\n",
+    "xs = subset['Worry_rank']\n",
+    "ys = subset['Happy_rank']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67e4f59b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "happyr_jitter = ys + np.random.normal(0, 0.3, len(xs))\n",
+    "plt.plot(xs, happyr_jitter, 'o', markersize=5, alpha=0.3)\n",
+    "\n",
+    "plt.xlabel('Worry rank')\n",
+    "plt.ylabel('Happy rank')\n",
+    "\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "\n",
+    "plt.plot(fx, fy, '-', color='C2')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb8e0765",
+   "metadata": {},
+   "source": [
+    "**Worry Index & Ladder score**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b2e2113",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "subset_index = df.dropna(subset=['Worry Index', 'Ladder score'])\n",
+    "xs = subset_index['Worry Index']\n",
+    "ys = subset_index['Ladder score']\n",
+    "\n",
+    "res = linregress(xs, ys)\n",
+    "res"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bf2a96cd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "happyi_jitter = ys + np.random.normal(0, 0.3, len(xs))\n",
+    "plt.plot(xs, happyi_jitter, 'o', markersize=5, alpha=0.3)\n",
+    "\n",
+    "plt.xlabel('Worried Index')\n",
+    "plt.ylabel('Ladder Score')\n",
+    "\n",
+    "fx = np.array([xs.min(), xs.max()])\n",
+    "fy = res.intercept + res.slope * fx\n",
+    "\n",
+    "plt.plot(fx, fy, '-', color='C1')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44389773",
+   "metadata": {},
+   "source": [
+    "**Interpretation** \n",
+    "\n",
+    "The scatterplots show that while the data points are spread, especially with *Worried Index* and *Ladder Score*, a concentration of points can be made out in the middle of the plot. As the *Worry_rank* and *Happy_rank* depend on the *Worry Index* and *Ladder score*, it would be expected that the correlation behaves similarly. The slope for  *Worry_rank* and *Happy_rank* is -0.58. And the intercept 116, meaning if the *Worry_rank* was 0 the *Happy_rank* would be 116, which is close to the highest rank. And with Worry rank increasing by one, *Happy_rank* decreases by roughly 0.6. The p-vale is very close to 0 so it is statistically significant and the null hypothesis, WH-0H1: There is no correlation between *Worry_rank* and *Happy_rank*, can be rejected and a relationship between *Worry_rank* and *Happy_rank* is established.  \n",
+    "\n",
+    "The slope for *Worried Index* and *Ladder score* is -5.9, indicating that as worried index goes up by 1 unit, the *Ladder score* decreases by -5.9. The intercept is 8.1., which means that if the *Worried index* was 0, the *Ladder Score* would be around 8 so at around the maximum value. The p-value is very close to 0, therefore it is statistically significant and the null hypothesis, WH-0H2: There is no correlation between *Worry Index* and *Ladder score*, can be rejected and there is a relationship between *Worried Index* and *Ladder Score*. Therfore according to the data from the \"World Risk Report 2021\" and the \"World Happiness Report 2021\", a relationship between happiness and worry can be confirmed. \n",
+    "\n",
+    "**Influence of Regional indicator**\n",
+    "\n",
+    "Finally, the plots of the correlation between *Worried Index* and *Ladder Score* below separated by *Regional indicator*, show how differently happiness and worry is spread across different regions of the world. \n",
+    "If the points are concentrated near the top left corner the region has a high ladder score and a low worry index meaning that happiness is prevalent and worry is low. Examples for happy and unworried global regions are: 'Western Europe' and 'North America and ANZ'. If the data points are concentrated to the bottom right hand corner of the individual plots, the countries have a low ladder score and a high worry index, meaning that they are less happy and more worried. This is, for example, the case for the region 'Sub-Sahran Africa' and 'South Asia'. Unfortunately, testing these relationships and determining their statistical significance is not within the scope of this analysis, but could be an interresting extension to this work."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fbd66c81",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "region = sns.lmplot(x=\"Worry Index\", y=\"Ladder score\",col='Regional indicator', hue=\"Regional indicator\", data=df, col_wrap=4, height=4)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7ac7404",
+   "metadata": {},
+   "source": [
+    "## 7. Discussion<a class=\"anchor\" id=\"7Diskussion\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a68398e8",
+   "metadata": {},
+   "source": [
+    "The most significant finding from the analysis was that *Ladder score* and *Worry index* and thus *Happy_rank* and *Worry_rank* are strongly correlated and that the relationship was found to be statistically significant. Therefore a relationship between worry and happiness according to the *World Worry Index* of the \"World Risk Poll 2021\" and the *Ladder score* of the \"World Happiness Report 2021\" could be established. This means, that on the global scale, there is a relationship between worry and happiness. Countries that are less worried are more happy or less happy countries have more worry. Additionally, in the comparison of worry and happiness by country and global regions, differences of worry and happiness for different global regions were observed. These were, however, not tested as this was out of the scope of this analysis, but should be explored in further research.\n",
+    "\n",
+    "On an individual scale, the relationship between the respondents demographic data, their responses to questions in the \"World Risk Poll 2021\" and its indexes is less obvious. The relationship between *WorriedIndex* and the other examined attributes, especially the demographic attributes and the two other indexes with the WorriedIndex, were not very strong. Differences between countries and global regions in worry could be observed but were not tested for their significance as this was beyond the scope of this analysis. It was found that six of the seven factors identified by the \"World Risk Poll 2021\" (food, water, traffic accidents, weather, mental health) correlate with worry. And a weak correlation may exist between the worry that work might cause serious harm.\n",
+    "\n",
+    "Still there are some short-comings to the data used in the analysis above. To calculate the *WorriedIndex*, a matrix of only 7 options was provided to the participants and the amount of worry felt towards these was only rated from 1 to 3 (Very worried, Somewhat worried or Not worried). The analysis of the *WorriedIndex* did also not include all of the attributes provided in the \"World Risk Poll\" and was thus limited. Additionally, there might have been a problem with the data provided or the analysis conducted using the *WorriedIndex* in the first part of the analysis. The p-value for all tests in this part of the analysis was 0, which, as has been explained, might indicate an error. For the data sorted by country in the analysis of worry and happiness this issue did not occur and the p-values althought near to 0 were not exactly 0. \n",
+    "\n",
+    "The data provided could be further enhanced. For example the data provided by the \"World Happiness Report 2021\" did not include any demographic data, as data is not derived from interviews with respondents, but from compiling various information about each country. It would be interresting to be able to compare demographic data of particpants for worry and happiness and the relationships between happiness, experience of harm and resilience could also be explored. It should also be explored what overlap of data exists between the \"World Happiness Report 2021\" and the \"World Risk Poll 2021\" and if this affects the results of the analysis. \n",
+    "\n",
+    "Additionally, it has to be noted that no control group has been used to test the hypothesis. But the same data was used to make and test the hypothesis which ideally should not be the case. This means that all the results provided should be used with caution and no statements can be made about the causation of one varibale towards another (Downey, 2014). \n",
+    "\n",
+    "Finally, the *Worry Index* and *Ladder Score* can only assess worry and happiness on a surface level. Each of the seven hazards that might lead to worry and many more could be addressed on their own. Equally the challenges of each global region or country should be adressed individualy. The indexes and the above analysis  can only provide a rough overview about worry and happiness. Yet, hopefully it will still contribute to inspiring change in the world to increase happiness in the individuals and global regions that are most affected by worry, so that the phrase \"don't worry, be happy\" can continue to have meaning.  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0fa44667",
+   "metadata": {},
+   "source": [
+    "## 8. Referenes<a class=\"anchor\" id=\"8Literatur\"></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "893b101a",
+   "metadata": {},
+   "source": [
+    "Downey, A. B. (2014), *Think Stats : Exploratory data analysis (Second ed.)*. Sebastopol, CA: O'Reilly Media, Inc.\n",
+    "\n",
+    "Helliwell, J. F.; Layard; R.; Sachs J.; De Neve, J.-E.; eds. (2021). World Happiness Report 2021. New York: Sustainable Development Solutions Network.\n",
+    "\n",
+    "The Lloyd’s Register Foundation (2022), 2021 Lloyd’s Register Foundation World Risk Poll Methodology, The Lloyd’s Register Foundation. https://wrp.lrfoundation.org.uk/LRF_2021_Methodology_online_version.pdf\n",
+    "\n",
+    "The Lloyd’s Register Foundation (2022), World Risk Poll 2021: A Changed World? Perceptions and experiences of risk in the Covid age, The Lloyd’s Register Foundation. https://wrp.lrfoundation.org.uk/LRF_2021_report_risk-in-the-covid-age_online_version.pdf\n",
+    "\n",
+    "The Lloyd’s Register Foundation (2022), World Risk Poll 2021: A Resilient World? Understanding vulnerability in a changing climate,  The Lloyd’s Register Foundation. https://wrp.lrfoundation.org.uk/LRF_2022_report2-resilience_online_version.pdf\n",
+    "\n",
+    "The Lloyd’s Register Foundation. Understanding the Poll: Guidance for risk communicators. https://wrp.lrfoundation.org.uk/2019-world-risk-poll/understanding-the-poll/\n",
+    "\n",
+    "The Lloyd’s Register Foundation. About the Lloyd’s Register Foundation World Risk Poll: The first global study of people’s perceptions and experiences of risks to their personal safety. https://wrp.lrfoundation.org.uk/about-the-lloyds-register-foundation-world-risk-poll/\n",
+    "\n",
+    "The Lloyd’s Register Foundation. 2021 Risk indexes: At the heart of the World Risk Poll are three new indexes which track risk across the world. https://wrp.lrfoundation.org.uk/2021-risk-indexes/"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
-- 
GitLab