This post follows directly on from my last, “A First Look at Visualising Data With R and ggplot2”, so if you are new to ggplot2 check that one out first!

Today we are going to be continuing looking at the Palmer Archipelago dataset, this time for creating scatter plots.

As before the dataset being used can be downloaded directly (in csv format) from Kaggle or imported directly into R with the palmerpenguins package.

Artwork by @allison_horst

Artwork by @allison_horst.

Importing the Data

As previously we will be needing the ggthemes and ggplot2 packages. This time we will also be using the dplyr package for data manipulation and the ggExtra package for plotting distributions.

library(dplyr)
library(ggplot2)
library(ggthemes)
library(ggExtra)

penguins <- read.csv("penguins_size.csv")

Next we convert the ‘species’, ‘island’ and ‘sex’ variables to factors.

penguins[c('species', 'island', 'sex')] <- 
  lapply(penguins[c('species', 'island', 'sex')], as.factor)
penguins <- penguins |> na.omit() |>
  rename('Island' = 'island', 'Species' = 'species', 'Sex' = 'sex') |>
  filter(!(Sex == "."))

Simple Scatter

To create a basic scatter plot we can use geom_point().

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
  geom_point()

Basic scatter

Making Improvements

We can improve this plot by adding labels, colour and a theme.

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
  geom_point(shape = 16, colour = "#FF4F00")+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)", 
      title = "Culmen Length vs. Depth in Penguins in the Palmer Archipelago") +
  theme_hc() +
  geom_rangeframe()

Improved scatter

Looking at our plot it is clear that there appear to be some distinct clusters. Before investigating these, let’s add a line of best fit with geom_smooth(method="lm").

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
  geom_point(shape = 16, colour = "#FF4F00")+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth in Penguins in the Palmer Archipelago") +
  geom_smooth(method="lm")+
  theme_economist_white()+
  geom_rangeframe()

print(cor(penguins$culmen_depth_mm, penguins$culmen_length_mm))

Trend scatter

Output:

-0.2286256

Marginal Plots

We can add marginal plots with ggMarginal from the ggExtra package to show the distribution of the data.

plot <- penguins |>
        ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm))+
        geom_point(shape = 16, colour = "#FF4F00")+
        labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
            title = "Culmen Length vs. Depth in Penguins in the Palmer Archipelago") +
        geom_smooth(method="lm")+
        theme_calc()

ggMarginal(plot, type="histogram", fill = "#FF4F00", size=5, bins = 12)

Marginal histogram

ggMarginal(plot, type="boxplot", fill = "#FF4F00", size=15)

Marginal boxplot

ggMarginal(plot, type="density", fill = "#FF4F00", size=10)

Marginal density

Investigating Clustering

To investigate the clusters we can add colour = Species to aes.

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Species))+
  geom_point(shape = 16)+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth by Species") +
  geom_smooth(method="lm")+
  scale_fill_few()+
  theme_few()

Species clusters

correlation <- penguins |>
  group_by(Species) |>
    summarise(correlation = cor(culmen_length_mm, culmen_depth_mm))

print(correlation)

Output:

Species    correlation
Adelie     0.3858132
Chinstrap  0.6535362
Gentoo     0.6540233
penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Island))+
  geom_point(shape = 16)+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth by Island") +
  geom_smooth(method="lm")+
  theme_foundation()

Island clusters

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Sex))+
  geom_point()+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth by Sex") +
  geom_smooth(method="lm")+
  theme_solarized()

Sex plot

penguins |>
  ggplot(aes(x = culmen_length_mm, y = culmen_depth_mm, colour = Sex, shape=Species))+
  geom_point()+
  labs(x = "Culmen Length (mm)", y = "Culmen Depth (mm)",
      title = "Culmen Length vs. Depth by Sex and Species") +
  geom_smooth(method="lm")+
  theme_excel_new()

Sex and species

Facet Plots

ggplot(penguins, aes(x=culmen_length_mm, y = culmen_depth_mm, colour = Species))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(Island~Species, scales="free", space="free_x") + 
  labs(x="Culmen Length (mm)", y="Culmen Depth (mm)",
       title="Culmen Length vs Depth by Species and Island")+
  theme_base()

Facet by island and species

ggplot(penguins, aes(x=culmen_length_mm, y = culmen_depth_mm, colour = Sex))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(Sex~Species, scales="free", space="free_x") + 
  labs(x="", y="Penguin Count",
       title="Culmen Length vs Depth by Species and Sex")+
  theme_stata()

Facet by sex and species

Conclusion

I hope that this post has been a useful introduction to scatter plots with ggplot2. Why not try investigating the relationships between some of the other numeric variables for this data such as body mass or flipper length?