Mục lục

Wine Quality Report

Ben Straub

Wine Quality Report was an applied qualifying exam for Penn State. I did it as a practice exam in preparation for taking my own applied masters exam, which is also provided on my rpubs account.

The dataset is from UCI machine learning database repository https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ The original data consist of variants of the Portuguese Vinho-Verde wine and has 1599 observations of Red wine and 4898 observations of White wine. For each we have the wine quality (scored between 0 and 10) and eleven chemical attributes (quantitative), which are as follows: Fixed acidity, Volatile acidity, Citric acid, Residual sugar, Chlorides, Free sulfur dioxide, Total sulfur dioxide, Density, PH, Sulphates, and Alcohol.

Part 3: Devise and implement a method to determine whether a wine is Red or White given its chemical attributes information. Describe any limitations with your method.

Part 1: Download and combine the red wine and white wine datasets, code the red wine by 1 and the white wine by 0. Put them in a usable format and display the first 10 observations. Describe any abnormal character of the dataset, and pre-process the dataset if necessary.

3. Preprocessing and Exploratory Data Analysis

Part 1: Download and combine the red wine and white wine datasets, code the red wine by 1 and the white wine by 0. Put them in a usable format and display the first 10 observations. Describe any abnormal character of the dataset, and pre-process the dataset if necessary.

3.1 Quality

Above displays the first 10 observations of the combined data frame of red and white wine files. There are 4898 observations of white wine and 1599 observations of red wine each frame with 13 variables giving us a 6497 by 13 dataframe. This report explores the relationship of wine between the variable quality and its chemical attributes. First, I will look at the variable quality. Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Below, I see that the bulk of the wine quality is at a quality of 5, 6 and 7. There is no observations below a quality of 3 and none above 9.

3.2 Correlation

Now, I look at the correlation between the continuous variables. A common concern in data analysis is multicollinearity, where one predictor variables is highly correlated with another variables. The problem with multicollinearity is that it makes parameter estimation unstable as well as difficult to understand the effect that the predictor has on the response. In this EDA, I seek out highly correlated variables and will remove them from the future analysis. Below are two correlation plots. The first shows the relationship between all 13 variables. The second shows the relationship after highly correlated variables are removed.

The first correlation plot shows that alcohol and density have a 0.9 negative linear correlation. and that free.sulfur.dioxide and total.sulfur dioxide has a 0.96 positive linear correlation. These highly correlated variables will prove problematic in the analysis. Therefore, density and free.sulfure dioxide will be eliminated from the analysis.

3.3 Outliers

The next concern are the presence of outliers in the data. Outliers can complicate the analysis as the model(s) could be skewed towards those extreme values. I utilize the outlier function available in the psych package to look for outliers. The function computes the Mahalanobis distance, which is \(D^2 = (x-\mu)^T \Sigma^{-1} (x-\mu)\) where \(\Sigma\) is the covariance of the X matrix. \(D^2\) is used as a way of detecting outliers in the distribution. Large \(D^2\) values, compared to the expected Chi Square values indicate an unusual response pattern. Below is the plot. There are a few concerns, but I will just focus on the last five obervations in the plot as theses are extreme values.

Observe five extreme values identified as 152, 259, 4381, 107 , 82 in the dataset. These will be temporariliy removed and put back into the analysis to assess the effect on the estimates and predictions.

3.4 Conclusions for EDA

I have done the following for the EDA:

Looked at the distribution of quality
Removed highly correlated variables – density and free.sulfure dioxide
Identified five outliers that will included in one analysis and left out in another
???

Part 2 (30 points). Explore the relationships between the Wine Quality and the chemical attributes.

Explore and describe the marginal relationships. (10 points)

The variable Quality is ranged from 0 to 10 in increments of 1. However, the observations in the data set only go from 3 to 9. I will look at the marginal relationships of quality and its chemical attributes through the lens of boxplots since there are only 7 groups. Remember the bulk of the data of quality is centered around rating of 5,6,7 with 3 and 9 having very few observations.


   3    4    5    6    7    8    9 
  30  216 2138 2836 1079  193    5

Fit a regression model to explore those relationships with the Wine Quality being the response variable. (10 points)


Call:
lm(formula = quality ~ ., data = wine1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4016 -0.4680 -0.0400  0.4602  3.0562 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -0.1614867  0.2883928  -0.560    0.576    
fixed.acidity        -0.0073736  0.0102901  -0.717    0.474    
volatile.acidity     -1.6569954  0.0798923 -20.740  < 2e-16 ***
citric.acid          -0.1054766  0.0795778  -1.325    0.185    
residual.sugar        0.0246347  0.0023526  10.471  < 2e-16 ***
total.sulfur.dioxide -0.0003267  0.0002633  -1.241    0.215    
pH                    0.0970343  0.0715077   1.357    0.175    
sulphates             0.5632967  0.0733636   7.678 1.85e-14 ***
alcohol               0.3313103  0.0097396  34.017  < 2e-16 ***
code                  0.2470158  0.0499257   4.948 7.70e-07 ***
log.chlorides        -0.1286734  0.0320861  -4.010 6.13e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7386 on 6486 degrees of freedom
Multiple R-squared:  0.2858,    Adjusted R-squared:  0.2847 
F-statistic: 259.6 on 10 and 6486 DF,  p-value: < 2.2e-16

Are the important relationships consistent across the wine types? (10 points)