As I near the end of my PhD, my advisor Dr David Kahle submitted my information to the Mathematics Genealogy Project. The MGP tracks academic lineages through dissertation advisors — like a doctoral family tree.

It was surprisingly cool to explore the academic lineage I come from. With some help from ChatGPT, I put together a script to scrape and visualize my genealogy. Most of the code came together through a series of back-and-forths with ChatGPT, so I won’t walk through every line here — but you can expand the code block below if you’re curious. In short…
Code (click to expand)
View Code
library(tidyverse)
library(rvest)
library(xml2)
library(stringr)
library(igraph)
library(ggraph)
library(memoise)
# ---------------------------
# 1) Single-Pass Page Parser
# ---------------------------
# Memoized so each page is fetched only once
raw_parse_mgp_page <- function(url) {
Sys.sleep(0.5)
message("Fetching page: ", url)
page_html <- tryCatch(read_html(url), error=function(e) NULL)
if (is.null(page_html)) {
return(list(name=NA_character_, year=NA_character_, advisors=character(0)))
}
# Name
h2_node <- html_node(page_html, "h2")
nm <- NA_character_
if (!is.null(h2_node) && !is.na(h2_node)) {
nm <- str_squish(html_text(h2_node, trim=TRUE))
}
# Year
div_node <- html_node(page_html, xpath="//div[contains(., 'Ph.D. ')]")
yr <- NA_character_
if (!is.null(div_node) && !is.na(div_node)) {
txt <- html_text(div_node, trim=TRUE)
found <- str_extract(txt, "\\b\\d{4}\\b")
if (!is.na(found)) {
yr <- found
}
}
# Advisors
adv_pars <- html_nodes(page_html, xpath="//p[contains(., 'Advisor')]")
adv_links <- html_nodes(adv_pars, "a")
advs <- character(0)
if (length(adv_links) > 0) {
hrefs <- na.omit(html_attr(adv_links, "href"))
abs_links <- vapply(hrefs, function(x) url_absolute(x, url), character(1))
pat <- "id\\.php\\?id=[0-9]+$"
advs <- abs_links[grepl(pat, abs_links)]
}
list(name=nm, year=yr, advisors=advs)
}
parse_mgp_page <- memoise(raw_parse_mgp_page)
# ---------------------------
# 2) One-level parse for secondary advisors
# ---------------------------
collect_secondary_one_level <- function(sec_url, visited, name_map, year_map) {
if (sec_url %in% visited) {
return(list(
visited=visited,
name_map=name_map,
year_map=year_map,
edges=data.frame(child=character(0), advisor=character(0))
))
}
visited <- c(visited, sec_url)
info_sec <- parse_mgp_page(sec_url)
name_map[[sec_url]] <- info_sec$name
year_map[[sec_url]] <- info_sec$year
edges_here <- data.frame(child=character(0), advisor=character(0))
if (length(info_sec$advisors)>0) {
firstA <- info_sec$advisors[1]
edges_here <- data.frame(child=sec_url, advisor=firstA, stringsAsFactors=FALSE)
if (!(firstA %in% visited)) {
visited <- c(visited, firstA)
infoA <- parse_mgp_page(firstA)
name_map[[firstA]] <- infoA$name
year_map[[firstA]] <- infoA$year
}
}
list(visited=visited, name_map=name_map, year_map=year_map, edges=edges_here)
}
# ---------------------------
# 3) collect_ancestors()
# - Full recursion on first advisor
# - partial or full recursion for secondaries
# ---------------------------
collect_ancestors <- function(start_url,
visited=NULL,
name_map=NULL,
year_map=NULL,
partial_secondary=TRUE) {
if (is.null(visited)) visited <- character(0)
if (is.null(name_map)) name_map <- list()
if (is.null(year_map)) year_map <- list()
if (start_url %in% visited) {
return(list(
visited=visited,
name_map=name_map,
year_map=year_map,
edges=data.frame(child=character(0), advisor=character(0))
))
}
visited <- c(visited, start_url)
info <- parse_mgp_page(start_url)
name_map[[start_url]] <- info$name
year_map[[start_url]] <- info$year
if (length(info$advisors)>0) {
edges_here <- data.frame(
child = rep(start_url, length(info$advisors)),
advisor = info$advisors,
stringsAsFactors=FALSE
)
} else {
edges_here <- data.frame(child=character(0), advisor=character(0))
}
all_edges <- edges_here
if (length(info$advisors) > 0) {
# fully recurse on first advisor
firstA <- info$advisors[1]
r_first <- collect_ancestors(
firstA, visited, name_map, year_map,
partial_secondary=partial_secondary
)
visited <- r_first$visited
name_map <- r_first$name_map
year_map <- r_first$year_map
all_edges <- rbind(all_edges, r_first$edges)
# secondary advisors
if (length(info$advisors) > 1) {
for (sec in info$advisors[-1]) {
if (partial_secondary) {
# parse them once
r_sec <- collect_secondary_one_level(sec, visited, name_map, year_map)
} else {
# same recursion
r_sec <- collect_ancestors(sec, visited, name_map, year_map,
partial_secondary=FALSE)
}
visited <- r_sec$visited
name_map <- r_sec$name_map
year_map <- r_sec$year_map
all_edges <- rbind(all_edges, r_sec$edges)
}
}
}
list(
visited=visited,
name_map=name_map,
year_map=year_map,
edges=all_edges
)
}
# ---------------------------
# 4) Build a single genealogical "tree" for childUrl
# => color = child's last name
# ---------------------------
get_last_name <- function(full_name) {
if (is.na(full_name) || !nzchar(full_name)) return("Unknown")
parts <- str_split(full_name, "\\s+")[[1]]
tail(parts,1)
}
build_one_tree <- function(child_url, partial_secondary=TRUE) {
# parse child's page once to get child's last name
child_info <- parse_mgp_page(child_url)
child_name <- child_info$name
child_last <- get_last_name(child_name)
if (!nzchar(child_last)) child_last <- "Unknown"
# gather
res <- collect_ancestors(child_url, partial_secondary=partial_secondary)
visited_urls <- res$visited
edges_df <- res$edges
node_df <- data.frame(
url = visited_urls,
name = sapply(visited_urls, function(u) res$name_map[[u]]),
year = sapply(visited_urls, function(u) res$year_map[[u]]),
color = child_last, # entire set => child's last name
stringsAsFactors=FALSE
)
edges_igraph <- edges_df %>% rename(from=advisor, to=child)
list(nodes=node_df, edges=edges_igraph)
}
# ---------------------------
# 5) Merge genealogies => single igraph with color mixture
# ---------------------------
merge_two_trees <- function(treeA, treeB) {
merged_nodes <- full_join(treeA$nodes, treeB$nodes, by="url", suffix=c(".A",".B"))
merged_nodes <- merged_nodes %>%
mutate(
final_name = coalesce(name.A, name.B),
final_year = coalesce(year.A, year.B),
color_combined = case_when(
!is.na(color.A) & !is.na(color.B) ~ "shared",
!is.na(color.A) ~ color.A,
!is.na(color.B) ~ color.B,
TRUE ~ "unknown"
)
) %>%
select(
url,
name = final_name,
year = final_year,
color = color_combined
)
merged_edges <- bind_rows(treeA$edges, treeB$edges) %>% distinct()
merged_edges <- merged_edges %>%
filter(from %in% merged_nodes$url, to %in% merged_nodes$url)
g <- graph_from_data_frame(merged_edges, directed=TRUE, vertices=merged_nodes)
return(g)
}
# ---------------------------
# 6) A single function that either:
# - Plots a single genealogical tree, or
# - Merges two genealogies & plots them
# ---------------------------
plotGenealogy <- function(childA_url,
childB_url=NULL,
partialA=TRUE,
partialB=TRUE) {
# If only one genealogical set => no legend needed
single_mode <- is.null(childB_url)
# Build graph
if (single_mode) {
treeA <- build_one_tree(childA_url, partial_secondary=partialA)
g <- graph_from_data_frame(treeA$edges, directed=TRUE, vertices=treeA$nodes)
} else {
# two genealogies => merge
treeA <- build_one_tree(childA_url, partial_secondary=partialA)
treeB <- build_one_tree(childB_url, partial_secondary=partialB)
g <- merge_two_trees(treeA, treeB)
}
# if "label" is missing => name+year
if (! "label" %in% vertex_attr_names(g)) {
V(g)$label <- paste0(
ifelse(is.na(V(g)$name), "", V(g)$name),
ifelse(is.na(V(g)$year), "", paste0(" (", V(g)$year, ")"))
)
}
# layered approach => 'sugiyama'
layout_sugiyama <- create_layout(g, layout="tree")
p <- ggraph(layout_sugiyama) +
geom_edge_link(
arrow=arrow(length=unit(2,"mm")),
end_cap=circle(2,"mm"),
alpha=0.6,
check_overlap=TRUE
) +
geom_node_point(aes(color=color), size=4) +
geom_node_text(aes(label=label), hjust=0, nudge_x=0.1, check_overlap=TRUE, size = 3.5) +
# geom_node_text(aes(label = label), hjust = 0, nudge_x = 0.1, size = 3.5) +
theme_void()
# Gather the color categories
node_cols <- unique(V(g)$color)
# special => "shared","unknown"
special_cols <- c("shared","unknown")
last_names <- setdiff(node_cols, special_cols)
# Build color map
color_map <- c()
i <- 1
# quick palette
color_for_name <- function(nm, i) {
if (i==1) return("red")
if (i==2) return("blue")
# fallback => others
palette <- c("green","orange","cyan","brown","hotpink","black")
if ((i-2) <= length(palette)) {
return(palette[i-2])
}
return("black")
}
for (ln in last_names) {
color_map[ln] <- color_for_name(ln, i)
i <- i+1
}
color_map["shared"] <- "purple"
color_map["unknown"] <- "grey"
# label map
label_map <- c()
for (ln in last_names) {
label_map[ln] <- ln
}
label_map["shared"] <- "shared"
label_map["unknown"] <- "unknown"
# add scale
p <- p + scale_color_manual(
name = "Genealogy",
values = color_map,
breaks = names(label_map),
labels = label_map
)
# If single genealogical set => hide the legend
if (single_mode) {
p <- p + theme(legend.position="none")
}
xvals <- layout_sugiyama$x
yvals <- layout_sugiyama$y
# compute expanded range
xmid <- mean(range(xvals))
ymid <- mean(range(yvals))
xhalf <- (max(xvals)-min(xvals))/2 * 1.5
yhalf <- (max(yvals)-min(yvals))/2 * 1.5
xlim <- c(min(xvals), xmid + xhalf)
ylim <- c(min(yvals), max(yvals))
p <- p + coord_cartesian(xlim=xlim, ylim=ylim, expand=TRUE, clip="off")
print(p)
invisible(p)
}
The script uses rvest
and xml2
to scrape advisor and graduation data from individual profiles on the MGP site. It builds a directed graph where each edge represents an advisor-advisee relationship. The graph construction is recursive: it follows the first listed advisor all the way up the tree, while secondary advisors are optionally included just one level deep to avoid visual clutter.
To manage multiple fetches efficiently, the scraper is memoized with memoise
, and lineage data is tracked using URL-based mappings for names and years. The graph is visualized with igraph
and ggraph
, and each node is colored by genealogical origin — for example, red for one person’s tree, blue for another, and purple for shared nodes. The plotting layout uses a tree for clean hierarchy and merge points.
Dr David Kahle’s Lineage
Here’s a look at my advisor’s academic lineage. I followed the first advisor all the way up the tree, but limited second advisors to just one level to keep the graph from getting too cluttered.
Kahle <- "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=159440"
plotGenealogy(Kahle)
Dr Rodney Sturdivant’s Lineage
And here’s a look at my co-advisor Dr. Sturdivant, who also has an interesting but separate academic tree.
Sturdivant <- "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=34072"
plotGenealogy(Sturdivant)
We can see some big names in both trees.
Comparing Genealogies
These genealogies don’t overlap…
plotGenealogy(childA_url = Kahle, childB_url = Sturdivant)
…but it’s fascinating when they do — especially across countries, centuries, and disciplines. Here’s a look at Dr Kahle with Dr Hadley Wickham - coauthors of the ggmap package.
Wickham <- "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=145799"
plotGenealogy(childA_url = Kahle, childB_url = Wickham)
For good measure, though messy, I show Dr Kahle’s full tree by fully recursing through all advisors by setting partialA = FALSE
.
plotGenealogy(Kahle, partialA = FALSE)
Final Thoughts
This turned out to be a fun little side project — part historical research, part data wrangling. The Mathematics Genealogy Project is a neat way to explore where your academic lineage comes from. If you’re curious, it’s worth poking around — you might find some unexpected connections.