Mathematical Genealogy

As I near the end of my PhD, my advisor Dr David Kahle submitted my information to the Mathematics Genealogy Project. The MGP tracks academic lineages through dissertation advisors — like a doctoral family tree.

_{Academic lineage via the Mathematics Genealogy Project.}

It was surprisingly cool to explore the academic lineage I come from. With some help from ChatGPT, I put together a script to scrape and visualize my genealogy. Most of the code came together through a series of back-and-forths with ChatGPT, so I won’t walk through every line here — but you can expand the code block below if you’re curious. In short…

Code (click to expand)

View Code

library(tidyverse)  
library(rvest)
library(xml2)
library(stringr)
library(igraph)
library(ggraph)
library(memoise)

# ---------------------------
# 1) Single-Pass Page Parser
# ---------------------------
# Memoized so each page is fetched only once
raw_parse_mgp_page <- function(url) {
  Sys.sleep(0.5)
  message("Fetching page: ", url)
  
  page_html <- tryCatch(read_html(url), error=function(e) NULL)
  if (is.null(page_html)) {
    return(list(name=NA_character_, year=NA_character_, advisors=character(0)))
  }
  
  # Name
  h2_node <- html_node(page_html, "h2")
  nm <- NA_character_
  if (!is.null(h2_node) && !is.na(h2_node)) {
    nm <- str_squish(html_text(h2_node, trim=TRUE))
  }
  
  # Year
  div_node <- html_node(page_html, xpath="//div[contains(., 'Ph.D. ')]")
  yr <- NA_character_
  if (!is.null(div_node) && !is.na(div_node)) {
    txt <- html_text(div_node, trim=TRUE)
    found <- str_extract(txt, "\\b\\d{4}\\b")
    if (!is.na(found)) {
      yr <- found
    }
  }
  
  # Advisors
  adv_pars  <- html_nodes(page_html, xpath="//p[contains(., 'Advisor')]")
  adv_links <- html_nodes(adv_pars, "a")
  advs <- character(0)
  if (length(adv_links) > 0) {
    hrefs     <- na.omit(html_attr(adv_links, "href"))
    abs_links <- vapply(hrefs, function(x) url_absolute(x, url), character(1))
    pat       <- "id\\.php\\?id=[0-9]+$"
    advs      <- abs_links[grepl(pat, abs_links)]
  }
  
  list(name=nm, year=yr, advisors=advs)
}
parse_mgp_page <- memoise(raw_parse_mgp_page)

# ---------------------------
# 2) One-level parse for secondary advisors
# ---------------------------
collect_secondary_one_level <- function(sec_url, visited, name_map, year_map) {
  if (sec_url %in% visited) {
    return(list(
      visited=visited,
      name_map=name_map,
      year_map=year_map,
      edges=data.frame(child=character(0), advisor=character(0))
    ))
  }
  visited <- c(visited, sec_url)
  
  info_sec <- parse_mgp_page(sec_url)
  name_map[[sec_url]] <- info_sec$name
  year_map[[sec_url]] <- info_sec$year
  
  edges_here <- data.frame(child=character(0), advisor=character(0))
  if (length(info_sec$advisors)>0) {
    firstA <- info_sec$advisors[1]
    edges_here <- data.frame(child=sec_url, advisor=firstA, stringsAsFactors=FALSE)
    
    if (!(firstA %in% visited)) {
      visited <- c(visited, firstA)
      infoA <- parse_mgp_page(firstA)
      name_map[[firstA]] <- infoA$name
      year_map[[firstA]] <- infoA$year
    }
  }
  
  list(visited=visited, name_map=name_map, year_map=year_map, edges=edges_here)
}

# ---------------------------
# 3) collect_ancestors()
#    - Full recursion on first advisor
#    - partial or full recursion for secondaries
# ---------------------------
collect_ancestors <- function(start_url,
                              visited=NULL,
                              name_map=NULL,
                              year_map=NULL,
                              partial_secondary=TRUE) {
  if (is.null(visited))  visited  <- character(0)
  if (is.null(name_map)) name_map <- list()
  if (is.null(year_map)) year_map <- list()
  
  if (start_url %in% visited) {
    return(list(
      visited=visited,
      name_map=name_map,
      year_map=year_map,
      edges=data.frame(child=character(0), advisor=character(0))
    ))
  }
  visited <- c(visited, start_url)
  
  info <- parse_mgp_page(start_url)
  name_map[[start_url]] <- info$name
  year_map[[start_url]] <- info$year
  
  if (length(info$advisors)>0) {
    edges_here <- data.frame(
      child   = rep(start_url, length(info$advisors)),
      advisor = info$advisors,
      stringsAsFactors=FALSE
    )
  } else {
    edges_here <- data.frame(child=character(0), advisor=character(0))
  }
  all_edges <- edges_here
  
  if (length(info$advisors) > 0) {
    # fully recurse on first advisor
    firstA <- info$advisors[1]
    r_first <- collect_ancestors(
      firstA, visited, name_map, year_map,
      partial_secondary=partial_secondary
    )
    visited   <- r_first$visited
    name_map  <- r_first$name_map
    year_map  <- r_first$year_map
    all_edges <- rbind(all_edges, r_first$edges)
    
    # secondary advisors
    if (length(info$advisors) > 1) {
      for (sec in info$advisors[-1]) {
        if (partial_secondary) {
          # parse them once
          r_sec <- collect_secondary_one_level(sec, visited, name_map, year_map)
        } else {
          # same recursion
          r_sec <- collect_ancestors(sec, visited, name_map, year_map,
                                     partial_secondary=FALSE)
        }
        visited   <- r_sec$visited
        name_map  <- r_sec$name_map
        year_map  <- r_sec$year_map
        all_edges <- rbind(all_edges, r_sec$edges)
      }
    }
  }
  
  list(
    visited=visited,
    name_map=name_map,
    year_map=year_map,
    edges=all_edges
  )
}

# ---------------------------
# 4) Build a single genealogical "tree" for childUrl
#    => color = child's last name
# ---------------------------
get_last_name <- function(full_name) {
  if (is.na(full_name) || !nzchar(full_name)) return("Unknown")
  parts <- str_split(full_name, "\\s+")[[1]]
  tail(parts,1)
}

build_one_tree <- function(child_url, partial_secondary=TRUE) {
  # parse child's page once to get child's last name
  child_info   <- parse_mgp_page(child_url)
  child_name   <- child_info$name
  child_last   <- get_last_name(child_name)
  if (!nzchar(child_last)) child_last <- "Unknown"
  
  # gather
  res <- collect_ancestors(child_url, partial_secondary=partial_secondary)
  
  visited_urls <- res$visited
  edges_df     <- res$edges
  
  node_df <- data.frame(
    url   = visited_urls,
    name  = sapply(visited_urls, function(u) res$name_map[[u]]),
    year  = sapply(visited_urls, function(u) res$year_map[[u]]),
    color = child_last,   # entire set => child's last name
    stringsAsFactors=FALSE
  )
  edges_igraph <- edges_df %>% rename(from=advisor, to=child)
  
  list(nodes=node_df, edges=edges_igraph)
}

# ---------------------------
# 5) Merge genealogies => single igraph with color mixture
# ---------------------------
merge_two_trees <- function(treeA, treeB) {
  merged_nodes <- full_join(treeA$nodes, treeB$nodes, by="url", suffix=c(".A",".B"))
  merged_nodes <- merged_nodes %>%
    mutate(
      final_name = coalesce(name.A, name.B),
      final_year = coalesce(year.A, year.B),
      color_combined = case_when(
        !is.na(color.A) & !is.na(color.B) ~ "shared",
        !is.na(color.A)                   ~ color.A,
        !is.na(color.B)                   ~ color.B,
        TRUE                              ~ "unknown"
      )
    ) %>%
    select(
      url,
      name  = final_name,
      year  = final_year,
      color = color_combined
    )
  
  merged_edges <- bind_rows(treeA$edges, treeB$edges) %>% distinct()
  merged_edges <- merged_edges %>%
    filter(from %in% merged_nodes$url, to %in% merged_nodes$url)
  
  g <- graph_from_data_frame(merged_edges, directed=TRUE, vertices=merged_nodes)
  return(g)
}

# ---------------------------
# 6) A single function that either:
#    - Plots a single genealogical tree, or
#    - Merges two genealogies & plots them
# ---------------------------
plotGenealogy <- function(childA_url,
                          childB_url=NULL,
                          partialA=TRUE,
                          partialB=TRUE) {
  # If only one genealogical set => no legend needed
  single_mode <- is.null(childB_url)

  # Build graph
  if (single_mode) {
    treeA <- build_one_tree(childA_url, partial_secondary=partialA)
    g <- graph_from_data_frame(treeA$edges, directed=TRUE, vertices=treeA$nodes)
  } else {
    # two genealogies => merge
    treeA <- build_one_tree(childA_url, partial_secondary=partialA)
    treeB <- build_one_tree(childB_url, partial_secondary=partialB)
    g     <- merge_two_trees(treeA, treeB)
  }

  # if "label" is missing => name+year
  if (! "label" %in% vertex_attr_names(g)) {
    V(g)$label <- paste0(
      ifelse(is.na(V(g)$name), "", V(g)$name),
      ifelse(is.na(V(g)$year), "", paste0(" (", V(g)$year, ")"))
    )
  }

  # layered approach => 'sugiyama'
  layout_sugiyama <- create_layout(g, layout="tree")

  p <- ggraph(layout_sugiyama) +
    geom_edge_link(
      arrow=arrow(length=unit(2,"mm")),
      end_cap=circle(2,"mm"),
      alpha=0.6,
      check_overlap=TRUE
    ) +
    geom_node_point(aes(color=color), size=4) +
    geom_node_text(aes(label=label), hjust=0, nudge_x=0.1, check_overlap=TRUE, size = 3.5) +
    # geom_node_text(aes(label = label), hjust = 0, nudge_x = 0.1, size = 3.5) +

    theme_void()

  # Gather the color categories
  node_cols <- unique(V(g)$color)
  # special => "shared","unknown"
  special_cols <- c("shared","unknown")
  last_names <- setdiff(node_cols, special_cols)

  # Build color map
  color_map <- c()
  i <- 1
  # quick palette
  color_for_name <- function(nm, i) {
    if (i==1) return("red")
    if (i==2) return("blue")
    # fallback => others
    palette <- c("green","orange","cyan","brown","hotpink","black")
    if ((i-2) <= length(palette)) {
      return(palette[i-2])
    }
    return("black")
  }
  for (ln in last_names) {
    color_map[ln] <- color_for_name(ln, i)
    i <- i+1
  }
  color_map["shared"]  <- "purple"
  color_map["unknown"] <- "grey"

  # label map
  label_map <- c()
  for (ln in last_names) {
    label_map[ln] <- ln
  }
  label_map["shared"]  <- "shared"
  label_map["unknown"] <- "unknown"

  # add scale
  p <- p + scale_color_manual(
    name   = "Genealogy",
    values = color_map,
    breaks = names(label_map),
    labels = label_map
  )

  # If single genealogical set => hide the legend
  if (single_mode) {
    p <- p + theme(legend.position="none")
  }
  xvals <- layout_sugiyama$x
  yvals <- layout_sugiyama$y
  
  # compute expanded range
  xmid <- mean(range(xvals))
  ymid <- mean(range(yvals))
  xhalf <- (max(xvals)-min(xvals))/2 * 1.5
  yhalf <- (max(yvals)-min(yvals))/2 * 1.5
  
  xlim <- c(min(xvals), xmid + xhalf)
  ylim <- c(min(yvals), max(yvals))
  
  p <- p + coord_cartesian(xlim=xlim, ylim=ylim, expand=TRUE, clip="off")
  
  print(p)
  invisible(p)

}

The script uses rvest and xml2 to scrape advisor and graduation data from individual profiles on the MGP site. It builds a directed graph where each edge represents an advisor-advisee relationship. The graph construction is recursive: it follows the first listed advisor all the way up the tree, while secondary advisors are optionally included just one level deep to avoid visual clutter.

To manage multiple fetches efficiently, the scraper is memoized with memoise, and lineage data is tracked using URL-based mappings for names and years. The graph is visualized with igraph and ggraph, and each node is colored by genealogical origin — for example, red for one person’s tree, blue for another, and purple for shared nodes. The plotting layout uses a tree for clean hierarchy and merge points.

Dr David Kahle’s Lineage

Here’s a look at my advisor’s academic lineage. I followed the first advisor all the way up the tree, but limited second advisors to just one level to keep the graph from getting too cluttered.

Kahle <- "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=159440"
plotGenealogy(Kahle)

Dr Rodney Sturdivant’s Lineage

And here’s a look at my co-advisor Dr. Sturdivant, who also has an interesting but separate academic tree.

Sturdivant <- "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=34072"
plotGenealogy(Sturdivant)

We can see some big names in both trees.

Comparing Genealogies

These genealogies don’t overlap…

plotGenealogy(childA_url = Kahle, childB_url = Sturdivant)

…but it’s fascinating when they do — especially across countries, centuries, and disciplines. Here’s a look at Dr Kahle with Dr Hadley Wickham - coauthors of the ggmap package.

Wickham <- "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=145799"
plotGenealogy(childA_url = Kahle, childB_url = Wickham)

For good measure, though messy, I show Dr Kahle’s full tree by fully recursing through all advisors by setting partialA = FALSE.

plotGenealogy(Kahle, partialA = FALSE)

Final Thoughts

This turned out to be a fun little side project — part historical research, part data wrangling. The Mathematics Genealogy Project is a neat way to explore where your academic lineage comes from. If you’re curious, it’s worth poking around — you might find some unexpected connections.