During our end-of-studies internship, we were asked to participate to this data challenge in order to get comfortable with the R language, the Shiny package and geographic data handling (the subject of our internship). In this document, we will present the work we’ve put into this project throughout the month of February.

Along with this report, we developed a Shiny App that you can use to explore the data and discover our analysis. We hosted the app on the Shiny server and it can be used actively for 25 hours.

We worked on this project as a group of three data scientist:

Choosing a theme


Within the proposed dataset, we started to look for a subject based on the file that explains each indicator precisely. We quickly knew we wanted to focus on diabetes because of the amount of indicators available. From there, we started to look for scientific articles which would link diabetes to certain factors that we could find in the available indicators. That way, we were able to select and keep only the interesting and revealing indicators. Before explaining the choice of each indicator, we need to define what diabetes is and why it’s interesting for our study.

What is diabetes

Diabetes is a disease caused by a malfunction of the regulation of blood glucose coming from food, which leads to a chronic hyperglycemia after eating. In normal cases, Beta pancreatic cells joined in islets of Langerhans are endocrine, which means that they can produce hormones that are able to regulate blood sugar level. Amongst these hormones is the famous Insulin. This hormone allows to stock glucose in adipocytes (fat cells), liver cells and skeletal muscle cells in which glucose is converted to glycogen or triglycerides (or both). Other hormones as Glucagon have the opposite role and release glucose in the blood when needed (between meals). This way, blood sugar level remain around 1 g/L between meals and is below 1.4 g/L in normal health conditions.

pancreas


However, this regulation system is disturbed for diabetic people. This disturbance is caused by different factors depending on the type of diabetes. (ref)

Type 1 diabetes

This type of diabetes is an auto-immune disease which is still a mystery in terms of causes. The auto-immune reaction simply destroys Beta pancreatic cells, unabling them to produce any Insulin and therefore causing hyperglycemia. Only less than 10% of diabetics are type 1.

We will not focus on this type because there are no factors known to have a potential influence on it.

Type 2 diabetes

Type 2 diabetes consists of an Insulin resistance corresponding to a decrease of sensitivity towards Insulin from cells that were supposed to stock converted Glucose (fat cells, liver and muscle cells). In this case, Insulin is often over-produced by pancreatic cells because of the lack of sensitivity, which is the body’s way of trying to regulate its blood sugar level, in vain. It has been proven that deleterious health choices and habits are the principal factors causing this type of diabetes. Common factors are obesity, a lack of physical activity, heredity factors, and others that will be detailed and explained below.

This type of diabetes involves many long term health problems including renal failure, strokes, heart attacks, neurodegeneration and arthritis.

There also is a pregnancy diabetes which we won’t go through but does exist. Even though all diabetes types are deadly, only type 2 can be avoided by watching out on health habits.

As we mentionned earlier, we can only extract outside factors for type 2 diabetes. Note that the diabetes prevalence indicator doesn’t differentiate type 1 and type 2 diabetes, but since type 1 diabetes is way less frequent than type 2 diabetes (only 10% of diabetes are type 1), we can study the correlation between diabetes in general with outside factors, with only a minor calculation error.

Why is it particularly interesting to study diabetes’ prevalence

1_11 skul



Even if treated with insulinotherapy, which is the administration of Insulin with or without other helping components to try to regulate blood sugar level, type 2 diabetes still kills many people and has an incredibly high prevalence (1/11 people is diabetic). The mortality rate is indeed 1.8 times superior for a type 2 diabetic person compared to a healthy individual (ref), and 1.6 million deaths were directly caused by this disease in 2016 (ref).

4x


Besides being terribly deadly, another mind blowing fact is that diabetes’ prevalence quadrupled between 1980 and 2017 (from 108 to 425 million people), and its economical impact is also exponential. Indeed, it seems that worldwide GNP was directly impacted of about 1700 billion dollars. Knowing that this disease cannot be inter-individually transmitted amplified our focus towards this lethal pathology, and it seems that these numbers would be lowered by the fact that they are based on only diagnosed or treated cases, but in the first few years of the disease, it is barely ever detected.

Problematic

Since diabetes is exponentially evolving in the world, some studies tried to explain it, and we often saw that these tend to describe socio-economically poor countries (poorer education, poorer healthcare system …) as being more propitious for developing diabetes (ref). Right away, we had the intuition that this assumption was not reliable because we thought that, in the case of diabetes, richer-country habits may favor diabetes development.

We will try to answer this problematic by finding correlations between diabetes and other indicators and how they evolve.

Choice of indicators

First we of course chose the indicator corresponding to diabetes’ prevalence, and we also immediately chose to select the obesity indicator because obesity and overweight are amongst the major triggering elements of diabetes (ref).

Then, we selected nutrition-related indicators such as the percentage of malnutrition which has been proven to cause nutritional stress, which is directly linked to the risk of obesity and therefore diabetes (ref).

We also found out that a lack of iodized salt can lead to hormonal disruptions which enable an optimal regulation of glycemia (blood sugar level) so we decided to also include it in our study (ref).

The next few chosen indicators were selected because they are linked to our antigenous repertory’s expense, like breastfeeding which substantially increases it and decreases the probability of developing diabetes, or a correct nutrition during childhood which also shapes the kid’s future nutrition habits (ref). Moreover, having access to over-protective sanitary accommodations has been linked to a decrease of the size of the antigenous repertory (ref), which is why we also chose to include sanitary-related indicators.

Then, we thought it would also be important and appropriate to include indicators that are part of everyday life for certain people, as tobacco and alcohol which have previously been linked to diabetes in many studies. Undeniably, an alcohol intake modifies blood sugar level and thus elevates the risks of having diabetes (ref). The exact mechanisms of the influence of smoking on diabetes hasn’t exactly be proven but there sure is a link (ref).

We also decided to include anemia in our choice of indicators because diabetic people have greater risks of developping anemia than healthy people (ref).

Finally, we of course selected each country’s Gross National Product (GNP) and income group to be able to study their impact on diabete’s prevalence as our problematic suggests.

Data preparation


In this section, we’ll present the data preparation we’ve done before getting into the visualization and analysis.

Packages

# Loading packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(leaflet)
library(rgdal)
library(FSA)

First, we’ve imported a few packages.

  • dplyr and tidyr: dplyr is a package for making tabular data manipulation easier by using a limited set of functions that can be combined to extract and summarize insights from data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis. reference

  • ggplot2: This is a very popular package for plotting. Since we’ve done some data visualization during this project, we’ve decided to use this widely used package to make beautiful plots. reference

  • leaflet: This package allows R users to plot maps and interact with them. We chose this one because it was recommended by Shiny. reference

  • rgdal: In order to manipulate country coloring easily, we followed the advice of Leaflet and went with a GeoJSON that we loaded with the rgdal package (again, following the recommendation of Leaflet). reference

  • FSA: This package loads the Dunn test which we used in our analysis.

Then, we went on with importing the dataset and prepping some personal datasets.

Loading and prepping the data

# Loading the dataset
DF <- read.table("data/raw/HNP_StatsData.csv",
  header = TRUE,
  sep = ",",
  quote = "\"",
  dec = ".",
  fill = TRUE,
  comment.char = "",
  check.names = FALSE,
  stringsAsFactors = FALSE
)

DF <- DF %>%
  rename(
    Country_Name = `Country Name`,
    Country_Code = `Country Code`,
    Indicator_Name = `Indicator Name`,
    Indicator_Code = `Indicator Code`
  )

# Dictionary to translate area name to area code
area_name_to_area_code <- unique(DF$Country_Code)
names(area_name_to_area_code) <- unique(DF$Country_Name)

# Dictionary to translate area code to area name
area_code_to_area_name <- unique(DF$Country_Name)
names(area_code_to_area_name) <- unique(DF$Country_Code)

# Dictionary to translate indicator code to indicator name
indicator_code_to_indicator_name <- unique(DF$Indicator_Name)
names(indicator_code_to_indicator_name) <- unique(DF$Indicator_Code)

We loaded the data into the variable DF, and renamed some columns so they’re more easily manageable. We also introduced dictionaries, named vectors to easily match codes and names. The next step was to filter the indicators: there were 405 indicators, and surely not as many were relevant to our problematic as we mentionned in the first section.

# List of indicators related to diabetes according to our research
indicators_diabetes <- c(
  "SH.STA.BRTW.ZS", # % of babies born underweighted
  "SN.ITK.DEFC", # number of malnourished people
  "SN.ITK.DEFC.ZS", # % of malnourished people
  "SN.ITK.SALT.ZS", # % houses consuming ioded salt
  "SH.ANM.ALLW.ZS", # % anemia prevalence 15-49 YO women
  "SH.ANM.CHLD.ZS", # % anemia prevalence - 5 YO children
  "SH.ANM.NPRG.ZS", # % anemia prevalence  15-49 YO non pregnant woman
  "SH.PRG.ANEM", # % anemia prevalence pregnant women
  "SH.STA.ANV4.ZS", # % pregnant women 4+ prenatal care
  "SH.STA.ANVC.ZS", # % pregnant women with prenatal care
  "SH.STA.DIAB.ZS", # % diabetes 20-79 YO
  "SH.STA.OWAD.FE.ZS", # overweighted 18+ women
  "SH.STA.OWAD.MA.ZS", # overweighted 18+ men
  "SH.STA.OWAD.ZS", # overweighted 18+ global
  "SH.STA.OWGH.FE.ZS", # overweighted - 5 ans women
  "SH.STA.OWGH.MA.ZS", # overweighted - 5 ans men
  "SH.STA.OWGH.ZS", # overweighted - 5 ans global
  "SH.ALC.PCAP.FE.LI", # alcool consumption women
  "SH.ALC.PCAP.LI",
  "SH.ALC.PCAP.MA.LI",
  "SH.PRV.SMOK", # % smoker
  "SH.PRV.SMOK.FE",
  "SH.PRV.SMOK.MA",
  "NY.GNP.PCAP.CD", # gross national product per capita
  "SH.H2O.BASW.RU.ZS", # % clean water access rural zone
  "SH.H2O.BASW.UR.ZS",
  "SH.H2O.BASW.ZS",
  "SH.H2O.SMDW.RU.ZS",
  "SH.H2O.SMDW.UR.ZS",
  "SH.H2O.SMDW.ZS",
  "SH.STA.BASS.RU.ZS", # sanitary access rural zone
  "SH.STA.BASS.UR.ZS",
  "SH.STA.BASS.ZS",
  "SH.STA.HYGN.RU.ZS", # handwashing access rural zone
  "SH.STA.HYGN.UR.ZS",
  "SH.STA.HYGN.ZS",
  "SH.STA.ODFC.RU.ZS", # % without toilets rural zone
  "SH.STA.ODFC.UR.ZS",
  "SH.STA.ODFC.ZS",
  "SH.STA.BFED.ZS", # % children - 6 months breastfed
  "SH.STA.IYCF.ZS" # % children 6 - 23 months correctly nurrished
)

# Keeping only diabetes-related information
DF_diab <- DF %>%
  filter(Indicator_Code %in% indicators_diabetes)

So we listed relevant indicators and kept only those, which greatly reduced the size of our dataset.

Some of us found more intuitive a 3D representation of the data, so we created DF_diab_3D on top of DF_diab, the structure of which is explained in the comments.

# DF_diab_3D
# Building a 3D array to host the data
#   1st dim : areas
#   2nd dim : indicators
#   3rd dim : dates
area_dictionary <- unique(DF_diab$Country_Code)
indicator_dictionary <- unique(DF_diab$Indicator_Code)
dates <- c(1960:2018)

nb_of_areas <- length(area_dictionary)
nb_of_indicators <- length(indicator_dictionary)
nb_of_years <- length(dates)

# Empty array
DF_diab_3D <- array(rep(NA, nb_of_areas * nb_of_indicators * nb_of_years),
  dim = c(nb_of_areas, nb_of_indicators, nb_of_years)
)
dimnames(DF_diab_3D) <- list(area_dictionary, indicator_dictionary, dates)

# Filling up the array
for (i in 1:nb_of_areas) {
  row_begin <- nb_of_indicators * (i - 1) + 1
  row_end <- nb_of_indicators * i
  area_code <- DF_diab$Country_Code[row_begin]
  DF_diab_3D[area_code, , ] <- as.matrix(DF_diab[row_begin:row_end, 5:63])
}

rm(area_dictionary, 
   indicator_dictionary, 
   dates,
   nb_of_areas, 
   nb_of_indicators, 
   nb_of_years, 
   i, 
   row_begin, 
   row_end, 
   area_code)

Here, we created DF_diab_3D which contains the data relevant to the diabetes’ study. You can access any information with DF_diab_3D[Country_Code, Indicator_Code, Year].

Lastly, for reasons we will explain in the Data visualization section, we created two matrices containing the average and trend of every time series.

# Getting the average of every time series
DF_diab_time_average <- apply(DF_diab_3D, c(1, 2), mean, na.rm = TRUE)

# Getting the trend of every time series
trend <- function(time_series) {
  if (all(is.na(time_series))) {
    return(NA)
  } else {
    time_series <- na.omit(time_series)
    data <- data.frame(years = as.numeric(names(time_series)), values = unname(time_series))
    return(unname(lm(values ~ years, data)$coefficient["years"]))
  }
}
DF_diab_time_trend <- apply(DF_diab_3D, c(1, 2), trend)

Visualization with GeoJSON

For the visualization, we wanted to be able to see color-coded countries on a map. For that, we are using the leaflet package, recommended by Shiny. In order to plot the world map and to be able to interact with it, we had to load a GeoJSON with the help of the rgdal package (recommended this time by Leaflet!).

# Loading the GeoJSON
countries <- readOGR("data/raw/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "D:\Users\gabrielle.devaux\Documents\Conferences\useR 2019\datathon\user2019datathonAPP\data\raw\ne_50m_admin_0_countries\ne_50m_admin_0_countries.shp", layer: "ne_50m_admin_0_countries"
## with 241 features
## It has 94 fields
## Integer64 fields read as strings:  POP_EST NE_ID
# List of correspondance between WBG country names and GeoJSON country names
correspondance_countries <- read.table(
  "data/raw/countries_correspondance.csv",
  header = TRUE,
  sep = ";",
  stringsAsFactors = FALSE,
  quote = "\""
)

# Renaming GeoJSON's ADMIN column
geojson_to_WBG_country <- na.omit(correspondance_countries)$Country_Name
names(geojson_to_WBG_country) <- na.omit(correspondance_countries)$id_on_map

ADMIN <- as.character(countries$ADMIN)
WBG <- unname(geojson_to_WBG_country[ADMIN])
WBG[is.na(WBG)] <- ADMIN[is.na(WBG)]

countries$ADMIN <- factor(WBG)
rm(geojson_to_WBG_country, ADMIN, WBG)

# geojson_countries has the list of the GeoJSON's country names
geojson_countries <- as.character(countries$ADMIN)

countries contains the GeoJSON which will allow us to plot the world map later on. We decided to rename the GeoJSON countries that matched a World Bank Group’s area in order to have increased compatibility. Lastly, we stored all the GeoJSON’s country names in geojson_countries.

Analysis dataset

For the analysis, we decided to link the diabetes’ prevalance to the listed income group of each country. In order to do that, we exctrated the Income group from the dataset (information available in the Countries tab of the excel file).

# country_diabetes_income
# Dataset to do a study on diabetes linked to income
country_income_group <- read.table("data/raw/HNP_StatsCountry.csv",
  header = TRUE,
  sep = ",",
  quote = "\"",
  dec = ".",
  fill = TRUE,
  comment.char = "",
  check.names = FALSE,
  stringsAsFactors = FALSE
)
country_income_group <- country_income_group %>%
  select("Table Name", "Income Group")
colnames(country_income_group) <- c("Country_Name", "Income_group")

country_diabetes_2017 <- DF_diab %>%
  filter(Indicator_Code == "SH.STA.DIAB.ZS") %>%
  select(Country_Name, "2017")

country_diabetes_income <- merge(country_diabetes_2017, country_income_group, by = "Country_Name")
colnames(country_diabetes_income)[2] <- "Diab"
country_diabetes_income[country_diabetes_income == ""] <- NA
country_diabetes_income <- country_diabetes_income %>% drop_na(Income_group)
country_diabetes_income$Income_group <- trimws(gsub("income", "", country_diabetes_income$Income_group))
country_diabetes_income$Income_group <- factor(country_diabetes_income$Income_group,
  levels = c("Low", "Lower middle", "Upper middle", "High"),
  ordered = TRUE
)

rm(country_income_group, country_diabetes_2017)

country_diabetes_income is a data frame containing everything we need to confront a country’s diabetes prevalence to its income group:

head(country_diabetes_income)
##     Country_Name  Diab Income_group
## 1    Afghanistan  9.59          Low
## 2        Albania 10.08 Upper middle
## 3        Algeria  6.73 Upper middle
## 4 American Samoa    NA Upper middle
## 5        Andorra  7.97         High
## 6         Angola  3.94 Lower middle
summary(country_diabetes_income)
##  Country_Name            Diab              Income_group
##  Length:213         Min.   : 0.990   Low         :33   
##  Class :character   1st Qu.: 5.480   Lower middle:45   
##  Mode  :character   Median : 7.210   Upper middle:56   
##                     Mean   : 8.607   High        :79   
##                     3rd Qu.:10.750                     
##                     Max.   :30.530                     
##                     NA's   :10

Later on, we will need this dataset striped of its outliers, which is why we are creating country_diabetes_income_clean:

# Removing NAs and outliers
country_diabetes_income_clean <- na.omit(country_diabetes_income)
outliers <- which(country_diabetes_income_clean$Diab %in% boxplot(Diab ~ Income_group, data = country_diabetes_income_clean, plot = FALSE)$out)
country_diabetes_income_clean <- country_diabetes_income_clean[-outliers,]
rm(outliers)

Then, in order to have a nice visualization of the income group on a map, we created a list containing vectors of countries belonging to each income group. This list, countries_income, takes the following format:

  • low: character vector containing all the countries in the low income group
  • lowermid: character vector containing all the countries in the lower-middle income group
  • uppermid: character vector containing all the countries in the upper-middle income group
  • high: character vector containing all the countries in the high income group
# Make subset of countries per income
countries_low <- country_diabetes_income %>% filter(Income_group == "Low") %>% pull(Country_Name)
# only keep the countries existing in the geojson file
countries_low <- countries_low[countries_low %in% geojson_countries]

countries_lowermid <- country_diabetes_income %>% filter(Income_group == "Lower middle") %>% pull(Country_Name)
countries_lowermid <- countries_lowermid[countries_lowermid %in% geojson_countries]

countries_uppermid <- country_diabetes_income %>% filter(Income_group == "Upper middle") %>% pull(Country_Name)
countries_uppermid <- countries_uppermid[countries_uppermid %in% geojson_countries]

countries_high <- country_diabetes_income %>% filter(Income_group == "High") %>% pull(Country_Name)
countries_high <- countries_high[countries_high %in% geojson_countries]

# countries_income is a list with the 4 vectors of coutries per income
countries_income <- list(
  low = countries_low,
  lowermid = countries_lowermid,
  uppermid = countries_uppermid,
  high = countries_high
)
rm(countries_low, countries_lowermid, countries_uppermid, countries_high)

# Colors for each income on the map and plots
palette <- c("peru", "salmon", "pink", "moccasin")

That way, you can access any group of countries very simply:

countries_income$low
##  [1] "Afghanistan"              "Benin"                   
##  [3] "Burkina Faso"             "Burundi"                 
##  [5] "Central African Republic" "Chad"                    
##  [7] "Comoros"                  "Congo, Dem. Rep."        
##  [9] "Eritrea"                  "Ethiopia"                
## [11] "Gambia, The"              "Guinea"                  
## [13] "Guinea-Bissau"            "Haiti"                   
## [15] "Liberia"                  "Madagascar"              
## [17] "Malawi"                   "Mali"                    
## [19] "Mozambique"               "Nepal"                   
## [21] "Niger"                    "Rwanda"                  
## [23] "Senegal"                  "Sierra Leone"            
## [25] "Somalia"                  "South Sudan"             
## [27] "Syrian Arab Republic"     "Tajikistan"              
## [29] "Tanzania"                 "Togo"                    
## [31] "Uganda"                   "Yemen, Rep."             
## [33] "Zimbabwe"

Data vizualisation


In this section, we’ll look into the dataset and discuss the different indicators we selected for our diabetes subject.

With a first look at the data, we were able to see that some indicators had enough values to be considered as time series, and that others didn’t. To prove that, we made available in the Overview tab of our Shiny App two time-series plots: for each of those, you can select, among the 42 diabetes-related indicators that we selected, an indicator to plot. That way, you will be able to have a global view of our indicators and which ones are quality indicators in terms of data availability. Note that the x and y axis limits adapt for each indicator, giving you an idea of the span of availability of the data.

To illustrate those time-series plots in this report, we picked five countries:

Those five countries belong to different income groups and span over several continents, so we thought that they were good contrastive examples.

interesting_countries <- c("United States", "Brazil", "China", "Somalia", "Cambodia")
country_diabetes_income[country_diabetes_income$Country_Name %in% interesting_countries, c("Country_Name", "Income_group")]
##      Country_Name Income_group
## 28         Brazil Upper middle
## 35       Cambodia Lower middle
## 45          China Upper middle
## 203       Somalia          Low
## 237 United States         High

First, let’s have a look at our data, that consists mostly of time series.

Observing time series and discussing availability

Let’s write a function to plot any time series:

time_series_plot <- function(country_names, indicator_code, years_interval = c(1960, 2018)) {
  area_codes <- area_name_to_area_code[country_names]
  indicator_interval <- c(
    min(DF_diab_3D[, indicator_code, ], na.rm = TRUE),
    max(DF_diab_3D[, indicator_code, ], na.rm = TRUE)
  )

  if (length(area_codes) == 1) {
    data <- data.frame(DF_diab_3D[area_codes, indicator_code, ])
  } else {
    data <- data.frame(t(DF_diab_3D[area_codes, indicator_code, ]))
  }
  colnames(data) <- country_names
  data$dates <- as.numeric(rownames(data))
  data <- gather(data, variable, value, -dates)
  
  title_plot <- indicator_code_to_indicator_name[indicator_code]
  title_plot <- strwrap(title_plot, width = 50)
  title_plot <- paste(title_plot, collapse = '\n')

  ggplot(data, aes(x = dates, y = value, colour = variable)) +
    geom_line(na.rm = TRUE) +
    geom_point(na.rm = TRUE) +
    xlim(years_interval) +
    ylim(indicator_interval) + 
      ggtitle(title_plot) + 
      theme(legend.position = 'bottom',
            plot.title = element_text(hjust = 0.5),
            legend.title = element_blank()) + 
      xlab("") + 
      ylab("")
}

And let’s first have a look at the GNI for each of those countries:

time_series_plot(interesting_countries, "NY.GNP.PCAP.CD")

Notice here that Somalia’s GNI was not calculated at all: we do not have any datapoint. On the other hand, we can notice that, as could be expected, USA’s GNI is the highest.

Now, let’s look at the evolution of diabetes over the years. It would be interesting to get a glimpse of how it evolved, right?

time_series_plot(interesting_countries, "SH.STA.DIAB.ZS")
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

Too bad, diabetes’ prevalence was only reported for the year of 2017! This is what I meant when I said that some indicators could be considered as time series, and some others couldn’t. Not to worry though, we knew going in that we only had one data point for diabetes’ prevalence. We decided to look at it as a recent study on diabetes’ prevalence, and to treat indicators as time-series only when it was possible.

So we have some indicators completely missing from certain countries (Somalia’s GNI), some indicators that were filled only once (diabetes’ prevalence), and some indicators that are scattered, and it’s unclear how to use them:

time_series_plot(interesting_countries, "SH.STA.OWGH.ZS")

In those cases, we have several options:

  • Not using that indicator at all
  • Using only the latest data point as a unique value for each country for this indicator
  • Using an average of the ‘time series’ as a unique value for each country for this indicator

At this point, if you explore the different time series in the Overview tab of our Shiny App, you’ll notice that most of the time series are pretty much linear, and so, could be sumed up by their mean and their trend (over the whole span of years or over the last twenty years for instance). This is why we created the DF_diab_time_average and DF_diab_time_trend matrices in the Data preparation section.

DF_diab_time_average[1:5, 1:4]
##     SN.ITK.SALT.ZS SH.STA.DIAB.ZS SH.STA.BFED.ZS NY.GNP.PCAP.CD
## ARB            NaN      12.077020       32.56044       4284.872
## CSS            NaN      11.859823       20.78333       4298.724
## CEB            NaN       6.863499            NaN       7953.446
## EAR       67.18727       9.326121       41.67644       1237.389
## EAS       84.94861       8.572328       42.05511       3263.828
DF_diab_time_trend[1:5, 1:4]
##     SN.ITK.SALT.ZS SH.STA.DIAB.ZS SH.STA.BFED.ZS NY.GNP.PCAP.CD
## ARB             NA             NA     -0.2830895      266.32814
## CSS             NA             NA             NA      193.36232
## CEB             NA             NA             NA      554.44337
## EAR             NA             NA      0.7954439       58.10336
## EAS             NA             NA     -2.4601487      173.67350

This data representation would enable us to treat every time series as a pair (trend, average) since most of them are linear. For the scattered time series, the average is pretty relevant.

And of course, it is absolutely possible to get this information for the last twenty years:

DF_diab_time_average_9818 <- apply(DF_diab_3D[, , as.character(1998:2018)], c(1, 2), mean, na.rm = TRUE)
DF_diab_time_trend_9818 <- apply(DF_diab_3D[, , as.character(1998:2018)], c(1, 2), trend)

As mentionned in the introduction of this report, we started the data challenge in February, and we also discovered the Shiny package during the development of our project. For this reason, the analysis part doesn’t actually use this data representation and the point is moot: we’ve thought about how we would handle the data and further discussion would be interesting, but irrelevant for the span of our analysis.

Relevant indicators

As mentionned earlier, in the Overview tab of our Shiny App, you can play around with the time series plot and discover every indicator. However, for this report, we’ve listed a few indicators that looked interesting and probably deserved more digging, if we had the time.

interesting_indicators <- c(
  "SH.STA.BASS.ZS",
  "SH.PRV.SMOK",
  "SH.STA.ODFC.ZS",
  "SH.STA.OWAD.ZS"
)

time_series_plot(interesting_countries, interesting_indicators[1])

time_series_plot(interesting_countries, interesting_indicators[2])

time_series_plot(interesting_countries, interesting_indicators[3])

time_series_plot(interesting_countries, interesting_indicators[4])

As you can see, those indicators give a pretty good image of the countries we selected, and we witness a good contrast among them.

Income groups

Since our main interest for the analysis is going to be the income group of the country, we decided to plot the income group on a map to have a intuitive view over each income group.

palette <- c("peru", "salmon", "pink", "moccasin")

m <- leaflet(countries) %>%
  setView(lng = 0, lat = 40, zoom = 1) %>%
  addPolygons(
    layerId = ~ADMIN,
    label = ~ADMIN,
    weight = 1,
    fillColor = "white",
    fillOpacity = 0.2,
    color = "#000519"
  )

m <- m %>% addPolygons(layerId = ~ADMIN, label = ~ADMIN, fillColor = palette[1], fillOpacity = 0.6,
                       color = "#000519", weight = 1,
                       data = subset(countries, countries$ADMIN %in% countries_income$low))

m <- m %>% addPolygons(layerId = ~ADMIN, label = ~ADMIN, fillColor = palette[2], fillOpacity = 0.6,
                       color = "#000519", weight = 1,
                       data = subset(countries, countries$ADMIN %in% countries_income$lowermid))

m <- m %>% addPolygons(layerId = ~ADMIN, label = ~ADMIN, fillColor = palette[3], fillOpacity = 0.6,
                       color = "#000519", weight = 1,
                       data = subset(countries, countries$ADMIN %in% countries_income$uppermid))

m <- m %>% addPolygons(layerId = ~ADMIN, label = ~ADMIN, fillColor = palette[4], fillOpacity = 0.6,
                       color = "#000519", weight = 1,
                       data = subset(countries, countries$ADMIN %in% countries_income$high))

m <- m %>% addLegend(position = "bottomright",
                     colors = palette,
                     labels = levels(country_diabetes_income$Income_group),
                     title = "Income groups",
                     opacity = 0.6)

m
Income groups
Low
Lower middle
Upper middle
High

Now that we have a good overview of the data in general and of the features we’re interested in (notably, the income group), let’s jump into our analysis.

Analysis


In this section, we’ll present our analysis on the subject, linking diabetes to the country’s income group.

A first look: boxplots

To be able to answer our problematic, which was to determine if a country’s richness has an impact on diabetes’ prevalence, we produced a boxplot of diabetes’ prevalence for each income group.

# Diabetes' prevalence by income group, boxplot 
ggplot(na.omit(country_diabetes_income),
       aes(x = Income_group, y = Diab, fill = Income_group)) +
  stat_boxplot(geom = 'errorbar', color = 'black', width = 0.3) +
  geom_boxplot() +
  scale_fill_manual(values = palette) +
  theme(legend.position = 'none',
        plot.title = element_text(size = 14, face = 'bold', hjust = 0.5),
        axis.title.y = element_text(size = 14),
        axis.text = element_text(size = 12)) +
  scale_x_discrete(name = "income groups",
                   labels =c('Low', 'Lower\nmiddle', 'Upper\nmiddle', 'High')) +
  scale_y_continuous(name = '% of diabetes prevalence') +
  ggtitle("")

Looking at this boxplot, we can see that previous beliefs about poorer countries being enclined to a higher diabetes’ prevalence were wrong. Indeed it looks like the opposite is true. Richer income groups have a higher diabetes’ prevalence.

However, we do see some outliers on this boxplot, and an analysis on a dataset deprived of outliers is more reliable, so let’s look at the same boxplot without the outliers:

# Diabetes' prevalence by income group, boxplot excluding outliers 
ggplot(country_diabetes_income_clean,
       aes(x = Income_group, y = Diab, fill = Income_group)) +
  stat_boxplot(geom = 'errorbar', color = 'black', width = 0.3) +
  geom_boxplot() +
  scale_fill_manual(values = palette) +
  theme(legend.position = 'none',
        plot.title = element_text(size = 14, face = 'bold', hjust = 0.5),
        axis.title.y = element_text(size = 14),
        axis.text = element_text(size = 12)) +
  scale_x_discrete(name = "income groups",
                   labels =c('Low', 'Lower\nmiddle', 'Upper\nmiddle', 'High')) +
  scale_y_continuous(name = '% of diabetes prevalence') +
  ggtitle("")

Our assumptions still seem correct on this new boxplot.

To be able to validate the boxplot’s assumptions, we can perform an ANOVA in order to check if each group is statistically independent from the others.

For each of the following tests, the interpretation is based on the p-value: if it’s lower than 0.05, we reject the null hypothesis. Otherwise, we don’t reject the null hypothesis.

ANOVA requirements

To be able to run an ANOVA test, there are a few requirements.

  • Independence of samples: We assume samples are randomly and independantly chosen

  • Normality test: We need to check for normal distribution of the samples to be able to apply an ANOVA. To make sure that the values are normally distributed, there’s the Shapiro test. We have to be careful that the values for each income group is normally distributed. If the values for one group aren’t, we cannot perform the test.

(Null hypothesis: the population is normally distributed)

# Shapiro test
high_income_countries <- country_diabetes_income$Income_group == "High"
shapiro.test(country_diabetes_income[high_income_countries, ]$Diab)
## 
##  Shapiro-Wilk normality test
## 
## data:  country_diabetes_income[high_income_countries, ]$Diab
## W = 0.89812, p-value = 2.902e-05

The p-value of this Shapiro test is below 0.05 so we reject our null hypothesis H(0). This is too bad, since this is a requirement to perform an ANOVA.

  • Homocedasticity test: We need to do a Bartlett test to check if our variables’ variance are homogeneous. This means that variances should be equal for all samples.

(Null hypothesis: homoscedasticity)

# Bartlett test 
bartlett.test(country_diabetes_income$Diab, country_diabetes_income$Income_group)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  country_diabetes_income$Diab and country_diabetes_income$Income_group
## Bartlett's K-squared = 15.233, df = 3, p-value = 0.001628

Again, we do not have the p-value necessary, and we reject our null hypothesis. Since this third requirement is stronger than the second, had it been met, we still could have performed an ANOVA (just to have a look, it would still have been followed by other analysis). But since both requirements are not met, we cannot.

The alternative in this case is to perform a non-parametric test.

Non-parametric test

Let’s directly dive into the Kruskal-Wallis test:

(Null hypothesis: the diabetes’ prevalence is the same in between income groups)

# Non-parametric kruskal test
kruskal.test(Diab ~ Income_group, data = country_diabetes_income_clean)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Diab by Income_group
## Kruskal-Wallis chi-squared = 36.554, df = 3, p-value = 5.718e-08

This test here confirms that there is a difference of diabetes’ prevalence in between income groups, which is what we assumed when we looked at the boxplots.

Let’s figure out exactly which groups are different from one another with the dunn test:

(Null hypothesis: the diabetes’ prevalence is the same in between each income groups)

# Dunn test
dunnTest(Diab ~ Income_group, data = country_diabetes_income_clean)
## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Holm method.
##                    Comparison         Z      P.unadj        P.adj
## 1                  High - Low  4.855476 1.200976e-06 6.004881e-06
## 2         High - Lower middle  1.652945 9.834215e-02 1.966843e-01
## 3          Low - Lower middle -3.034570 2.408792e-03 9.635167e-03
## 4         High - Upper middle -1.444483 1.486031e-01 1.486031e-01
## 5          Low - Upper middle -5.759947 8.414037e-09 5.048422e-08
## 6 Lower middle - Upper middle -2.816696 4.852037e-03 1.455611e-02
# Clearing up the results of the Dunn test
dunn_test <- dunnTest(Diab ~ Income_group, data = country_diabetes_income_clean)
mean_equality <- ifelse(dunn_test$res['P.adj'] < 0.05, 'no', 'yes')
results <- cbind(dunn_test$res['Comparison'], round(dunn_test$res['P.adj'], 5), mean_equality)
colnames(results) <- c("Income Groups", "p-value", "Mean equality ?")
results
##                 Income Groups p-value Mean equality ?
## 1                  High - Low 0.00001              no
## 2         High - Lower middle 0.19668             yes
## 3          Low - Lower middle 0.00964              no
## 4         High - Upper middle 0.14860             yes
## 5          Low - Upper middle 0.00000              no
## 6 Lower middle - Upper middle 0.01456              no

Here, we can see that’s it’s actually really the low income group that is different from all the other income groups.

Conclusion


Throughout this project, as mentionned in the problematic in the Theme section, we wanted to find the factors that would favor the development of diabetes. Sadly, due to a lack of time, our analysis was limited to studying diabetes with regards to income.

In the Analysis section, you will find a boxplot that displays this relation and a statistical test that confirms that there exists a real difference in diabetes’ prevalence in between countries from different income groups. The biggest discovery is that unlike what is believed for low income countries (that more people are sick due to poorer health facilities), more people are sick with diabetes in higher income countries. This makes sense since diabetes is caused by unhealthy habits only accessible with a higher income.

Our analysis could have continued on the development of a “diabetes profile”, extracting the main factors that cause diabetes. This would have digged deeper into this current analysis: what factors cause diabetes, and how are they favored by high income?

If you haven’t already, don’t forget to visit our Shiny App to explore our work!