Vadym Musiienko
  • Data Science
    • Wine Reviews
    • Acoustic Absorbance
    • Sleep Quality
    • TidyTuesday Projects
  • Other
    • Motivational Coach
    • Presentation

On this page

  • Introduction
  • Analisys
  • Conclusion

Decoding Wine

Analyzing Trends in Taste, Price, and Quality

Author

Vadym Musiienko

Published

September 18, 2024

Introduction

The world of wine is rich with complexity, offering a diverse range of flavors, prices, and qualities. In this project we will analyze trends in wine characteristics by examining a Kaggle dataset of over 150,000 wine reviews. Through this exploration, we aim to uncover insights about the relationship between price, quality, and the descriptive notes used by wine experts. By analyzing factors such as the average price of wine by country, the correlation between price and ratings, and the impact of specific flavor notes on wine quality, this project seeks to provide a comprehensive view of what influences wine preferences and how they correlate to market values.

Analisys

Import Libraries and Data

The tidyverse package is used for data visualization, while ggrepel adds functionality for labeling text on plots.

library(tidyverse)
library(ggrepel)

# Load the data
wine_data <- read_csv("data/winemag-data_first150k.csv")

Inspect the Dataset

Let’s take a look at the first 10 records.

head(wine_data, n = 10)
# A tibble: 10 × 11
    ...1 country description designation points price province region_1 region_2
   <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>    <chr>   
 1     0 US      This treme… Martha's V…     96   235 Califor… Napa Va… Napa    
 2     1 Spain   Ripe aroma… Carodorum …     96   110 Norther… Toro     <NA>    
 3     2 US      Mac Watson… Special Se…     96    90 Califor… Knights… Sonoma  
 4     3 US      This spent… Reserve         96    65 Oregon   Willame… Willame…
 5     4 France  This is th… La Brûlade      95    66 Provence Bandol   <NA>    
 6     5 Spain   Deep, dens… Numanthia       95    73 Norther… Toro     <NA>    
 7     6 Spain   Slightly g… San Román       95    65 Norther… Toro     <NA>    
 8     7 Spain   Lush cedar… Carodorum …     95   110 Norther… Toro     <NA>    
 9     8 US      This re-na… Silice          95    65 Oregon   Chehale… Willame…
10     9 US      The produc… Gap's Crow…     95    60 Califor… Sonoma … Sonoma  
# ℹ 2 more variables: variety <chr>, winery <chr>

Clean the Data

From inspecting the dataset, it appears that the country variable is of type , but we want it to be (a factor) so that it can be ordered. Additionally, we will be analyzing the price and rating (points) of the wine, so we want to remove any rows where these variables contain missing (NA) values. Finally, this analysis will not include regions and wineries, so we will exclude those columns.

# Clean up the data
wine_data <- wine_data |> filter(!(is.na(price) | is.na(country))) |>
  mutate(country = as.factor(country)) |>
  select(-c(region_1, region_2, winery))

Average price by country

One of the first interesting insights would be to determine which countries produce the most expensive and the cheapest wines. To find this out, let’s create a plot showing the average wine prices for each country in the dataset.

wine_data |> 
  group_by(country) |>
  summarize(avr_price = mean(price)) |>
  ggplot(aes(fct_reorder(country, avr_price, .desc = TRUE), avr_price,
             fill = country)) +
  geom_col(color = "black", width = 0.8, show.legend = FALSE) +
  labs(
    title = "Average price of wine",
    x = NULL,
    y = "Price in USD"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The countries that produce the most expensive wines are the US, France, England, Hungary, and Luxembourg, with average prices ranging from 40 to 50 USD. On the other hand, the cheapest wine producers are Montenegro, Lithuania, Bulgaria, Bosnia, and Ukraine, where wines typically cost between 10 and 15 USD.

Correlation between Price and Quality

Another interesting question to explore is the correlation between price and quality. Are more expensive wines actually better than cheaper ones? To find out, let’s visualize the relationship between prices and ratings (points).

wine_data |>
  filter(price < 500) |>
  group_by(price) |>
  summarize(avr_rating = mean(points)) |>
  ggplot(aes(price, avr_rating)) +
  geom_point(alpha = 0.5) +
  geom_smooth(color = "red") +
  labs(
    title = "Price vs. Quality",
    x = "Price (USD)",
    y = "Average rating (Points)"
  ) +
  theme_minimal()

There is definitely a correlation between price and rating, with more expensive wines tending to have higher ratings. However, the relationship is stronger in the 0 to 100 USD range, where price significantly impacts quality. Beyond 100 USD, wines are fairly similar in quality, with higher prices offering only a slight improvement.

Flavor notes affect on price and quality

Wine can have a wide variety of flavors, described using different flavor notes. To determine which flavor notes are the most desirable, we identified 24 of the most commonly used adjectives to describe wine flavors. You can find the article with this information here.

First, let’s see if these adjectives are actually used in the wine reviews and which ones are more common than others. We’ll plot the frequency of each adjective mentioned in the description section of the reviews.

# Create a vector with the adjectives
adjectives <- c("Acidic", "Astringent", "Barnyard", "Buttery", "Chewy",
                "Earthy", "Flabby", "Fruity", "Fruit-forward",
                "Herbaceous", "Jammy", "Juicy", "Musty", "Nutty",
                "Oaky", "Opulent", "Perfumed", "Racy", "Spicy",
                "Supple", "Tannic", "Toasty", "Vegetal", "Velvety")

# Create a regex pattern to search for adjectives
pattern <- paste0("(?i)", paste(adjectives, collapse = "|"))

# Create a data frame with wine reviews where those adjectives were used
wine_with_adj <- wine_data |> 
  mutate(
    adjective = factor(tolower(str_extract(description, pattern)))
    ) |>
  filter(!is.na(adjective))

# Plot the frequency of each adjective
wine_with_adj |> group_by(adjective) |>
  summarise(count = n()) |>
  ggplot(aes(count, fct_reorder(adjective, count, .desc = TRUE), 
             fill = adjective)) +
  geom_col(color = "black", width = 0.8, show.legend = FALSE) +
  labs(
    title = "Adjective Frequency in Wine Reviews",
    y = NULL,
    x = "Number of times mentioned"
  ) +
  theme_minimal()

We can see that most of these adjectives are widely used among wine experts. “Fruity,” “Spicy,” “Tannic,” and “Juicy” appear in over 5,000 reviews, making them very common. On the other hand, adjectives like “Musty,” “Flabby,” and “Fruit-forward” are relatively rare.

Now that we know wine experts frequently use these words, let’s explore which flavor notes are associated with more sophisticated wines. To visualize this, we’ll plot the average price and rating for each adjective.

# Plot average price and rating depending on a flavor note
wine_with_adj |> group_by(adjective) |>
  summarise(avr_price = mean(price), avr_points = mean(points)) |>
  ggplot(aes(avr_price, avr_points)) +
  geom_point(size = 3) +
  geom_label_repel(aes(label = adjective),
                   size = 2.5, color = "black", fill = "lightyellow", 
                   label.size = 0.25, label.padding = 0.25, 
                   box.padding = 0.25) +
  labs(
    title = "Average Price and Rating of Wine by Flavor Note",
    x = "Average Price (USD)",
    y = "Average Points (Points)"
  ) +
  theme_minimal()

The clear winner among the flavor notes is “opulent,” which is often associated with the most sophisticated wines. Opulent wines not only have the highest prices but also receive the best ratings. Close contenders are “velvety” and “astringent.” Both velvety and astringent wines are also quite expensive, with velvety wines having higher ratings. In contrast, the cheapest and lowest-quality wines are described as “musty,” “flabby,” and “vegetal.”

Berry notes

While reviewing the descriptions, we noticed that wine is often described as having various berry flavor notes. Let’s conduct a similar analysis to determine which berry flavor is considered the most “expensive.”

The regular expression I used for matching berries is: (?i)\b\w+erry\b.

# A data frame with wines that were described with some berry
berries <- wine_data |>
  filter(str_detect(description, "(?i)\\b\\w+erry\\b")) |>
  mutate(berry = tolower(str_extract(description, "(?i)\\b\\w+erry\\b")))

# The 6 most used berries used to describe wine (except for just "berry")
topberries <- berries |> group_by(berry) |>
  summarize(count = n()) |> 
  arrange(desc(count)) |>
  filter(berry != "berry") |>
  slice_head(n = 6)

# Plot the price of wine with different berry notes
berries |> 
  filter(berry %in% topberries$berry) |>
  group_by(berry)|>
  summarize(avr_price = mean(price)) |>
  ggplot(aes(fct_reorder(berry, avr_price, .desc = TRUE), avr_price,
             fill = berry)) +
  geom_col(color = "black", show.legend = FALSE) +
  coord_cartesian(ylim = c(25, 45)) +
  labs(
    title = "Average Price of Wine with Different Berry Notes",
    y = "Average price (USD)",
    x = NULL
  ) +
  geom_text(aes(label = berry), color = "black", 
            position = position_stack(vjust = 0.9),
            size = 3.5,
            alpha = 0.8) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

Wines with blackberry flavor tend to be the most expensive, with an average price of around $42. In contrast, strawberry-flavored wines are the cheapest, averaging $32.

Conclusion

This project explored the relationships between wine characteristics, including price, quality, and flavor notes, based on a dataset of over 150,000 wine reviews. By examining the average price of wine by country, we uncovered interesting trends in global wine pricing. We also explored the correlation between price and quality, revealing that while more expensive wines tend to have higher ratings, the relationship is less pronounced at higher price points. Further, our analysis of flavor notes showed that descriptors like “opulent,” “velvety,” and “astringent” are linked to higher-priced and higher-rated wines, while terms like “musty” and “flabby” correlate with lower-quality wines. Additionally, berry flavor notes such as “blackberry” were associated with the most expensive wines. Overall, this exploration provides a comprehensive look at the factors that influence wine preferences and their market value, offering useful insights for both wine enthusiasts and producers.

Kaggle Author