Skip to main content

scatterplot-with-text-repel

Scatterplot with labels and text repel​

  • A custom scatterplot with auto-positioned labels to explore the palmerpenguins dataset made with Python and Matplotlib

Load libraries​

Today's visualization is made with matplotlib and the library adjustText to automatically adjust the position of labels in the plot. In addition, load_penguins() is imported from the palmerpenguins library to have access to the popular palmerpenguins dataset, and the Faker class is imported from the faker library to generate random names.

Note: the adjustText library can be installed with pip install adjustText

# !pip install --upgrade setuptools
# !pip install adjustText faker palmerpenguins

import matplotlib.pyplot as plt
import numpy as np

from adjustText import adjust_text
from faker import Faker
from palmerpenguins import load_penguins

Load and prepare data​

The dataset used today was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. This dataset was popularized by Allison Horst in her R package palmerpenguins with the goal to offer an alternative to the iris dataset for data exploration and visualization.

penguins = load_penguins()
penguins.head()
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
0AdelieTorgersen39.118.7181.03750.0male2007
1AdelieTorgersen39.517.4186.03800.0female2007
2AdelieTorgersen40.318.0195.03250.0female2007
3AdelieTorgersenNaNNaNNaNNaNNaN2007
4AdelieTorgersen36.719.3193.03450.0female2007

Today's vizualisation does not require much data preparation. The only preparation step is to drop observations with missing values.

penguins = penguins.dropna()

Basic Scatterplot​

Let's start by defining some important values that are used throughout the plot.

# Colors
BG_WHITE = "#fbf9f4"
GREY_LIGHT = "#b4aea9"
GREY50 = "#7F7F7F"
GREY30 = "#4d4d4d"
BLUE_DARK = "#1B2838"
BLUE = "#2a475e"
COLORS = ["#386cb0", "#fdb462", "#7fc97f" ] # A color for each species

# A list with the species names
SPECIES = sorted(penguins["species"].unique())

# Horizontal lines
HLINES = [40, 50, 60]

# Vertical and horizontal tick marks
YTICKS = [40, 50, 60]
XTICKS = [160, 170, 180, 190, 200, 210, 220, 230, 240]

# List of random names
# Generate Faker instance and set seed or reproducibility
faker = Faker()
# Set seed for reproducibility of the random names
faker.seed_instance(11)
NAMES = [faker.first_name() for i in range(len(penguins))]

# The markers we use in the scatterplot
MARKERS = ["o", "^", "s"] # circle, triangle, square

Then, let's initialize the chart layout, add a background color and the basic scatterplot:

# Initialize layout ----------------------------------------------
fig, ax = plt.subplots(figsize= (14, 10))

# Background color
fig.patch.set_facecolor(BG_WHITE)
ax.set_facecolor(BG_WHITE)

# Horizontal lines in the background
for h in HLINES:
ax.axhline(h, color=GREY50, ls=(0, (5, 5)), alpha=0.8, zorder=0)

# Add scatterplot -----------------------------------------------
# Loop through species, colors, and markers.
# Each species receives a different combination of color and marker.
for species, color, marker in zip(SPECIES, COLORS, MARKERS):
data = penguins[penguins["species"] == species]
ax.scatter(
"flipper_length_mm", "bill_length_mm", s=50, color=color,
marker=marker, alpha=0.8, data=data
)

png

That's a pretty good start, but let's make it better!

Add labels with no overlap​

What's truely missing here are labels. It's very frustrating not knowing which item is hidden under a data point, isn't it!?

It is pretty challenging to add many labels on a plot since labels tend to overlap each other, making the figure unreadable. Fortunately, the adjustText package is here to help us. It provides an algorithm that will automatically place the labels for us. Let's do it!

# Add name labels ------------------------------------------------
# Only names that start with the letter "C" are added.
# `ax.text()` outputs are appended to the `TEXTS` list.
# This list is passed to `adjust_text()` to repel the labels and add arrows.
TEXTS = []
for i in range(len(penguins)):
if NAMES[i].startswith("C"):
x = penguins["flipper_length_mm"].iloc[i]
y = penguins["bill_length_mm"].iloc[i]
text = NAMES[i]
TEXTS.append(ax.text(x, y, text, color=GREY30, fontsize=14, fontname="Arial"))


# Adjust text position and add arrows ----------------------------
# 'expand_points' is a tuple with two multipliers by which to expand
# the bounding box of texts when repelling them from points

# 'arrowprops' receives a dictionary with all the properties we want for the arrows
adjust_text(
TEXTS,
expand_points=(2, 2),
arrowprops=dict(
arrowstyle="->",
color=GREY50,
lw=2
),
ax=fig.axes[0]
)
fig

Note the ax=fig.axes[0] in the adjust_text() function call. That is not always necessary. In this case, the plot is built in several steps for demonstration purposes and so the axis object must be passed explicitly. When ax isn't explicitly passed, adjust_text() will use plt.gca() to obtain the axis. If all the code used to generate your plot goes in the same chunk of code than adjust_text), then there's no need to pass any axis.

Final chart​

The chart above is pretty close from being publication ready. What's needed now is a good title, a legend to make color and shapes more insightful, and some axis customization:

# Add title and subtitle -----------------------------------------
# Title
fig.suptitle(
"Palmer Penguins Data Visualization",
x = 0.122,
y = 0.975,
ha="left",
fontsize=32,
fontname="Helvetica",
color=BLUE,
weight="bold",
)

# Subtitle
ax.set_title(
"Scatter plot of flipper length vs bill length",
loc="left",
ha="left",
fontsize=24,
fontname="Arial",
color=BLUE_DARK,
weight="bold",
pad=10
)

# Add legend -----------------------------------------------------

# We store the result of `ax.legend()` for further customizations
legend = ax.legend(
loc=(0.85, 0.025), # bottom-right
labelspacing=1.5, # add space between labels
markerscale=1.5, # increase marker size
frameon=False # don't put a frame
)

# Iterate through text elements and species names
for text, species in zip(legend.get_texts(), SPECIES):
text.set_text(species) # Set species name
text.set_fontname("Arial") # Change default font
text.set_fontsize(16) # Change default font size

# Grab title and customize its appearence.
legend.set_title("Species")
legend_title = legend.get_title()
legend_title.set_fontname("Helvetica")
legend_title.set_fontweight("bold")
legend_title.set_fontsize(18)
legend_title.set_ha("left")

# Customize layout -----------------------------------------------
# Hide spines
ax.spines["right"].set_color("none")
ax.spines["top"].set_color("none")

# Customize spines color
ax.spines["left"].set_color(GREY_LIGHT)
ax.spines["left"].set_linewidth(2)
ax.spines["bottom"].set_color(GREY_LIGHT)
ax.spines["bottom"].set_linewidth(2)

# Customize labels and ticks
ax.tick_params(length=0)

ax.set_yticks(YTICKS)
ax.set_yticklabels(YTICKS, fontname="Helvetica", size=15)
ax.set_ylabel("bill Length (mm)", fontname="Helvetica", size=18, weight="bold")

ax.set_xticks(XTICKS)
ax.set_xticklabels(XTICKS, fontname="Helvetica", size=15)
ax.set_xlabel("flip length (mm)", fontname="Helvetica", size=18, weight="bold")

fig

png

References​

This page showcases the work of Tuo Wang that introduces packages to make ggplot2 plots more beautiful. You can find the original R code on Tuo's blog here.

Thanks also to TomΓ‘s Capretto who translated this work from R to Python! πŸ™πŸ™