Introduction

Over Christmas 2017, my wife, Jill, resolved that our family (to include babysitters) would read our children over 1000 new books in 2018. More specifically, 29 December 2017 - 28 December 2018.

A goal like this takes dedication, love, and of course, a Data Science-y approach.

Special thanks to Ally, Zia, and Bookbuddy for help in accomplishing this goal!

Also, special thanks to drob and juliasilge who privde the tidytext package and the entire R open-source community I rely on to do this analysis.

Since I use this page as a way to log my ideas and teach others (and future me) how I did this, I’ll include the code.

Exploratory Data Analysis

Libraries

The libraries I will use for this analysis are the following:

library(tidyverse)
library(lubridate) 
library(janitor)
library(stringr)
library(tidytext)
library(forcats)
library(zoo)
library(kableExtra)
library(DT)

Read in Data

Lets read in the data and get a feel for what information we have. Since my wife, our babysitter, and I have scanned the books in bookbuddy, I need to upload all the data and join it together. We have the data in many different spreadsheets to I’ll have to read them all in separately. I had to change some of the columns to character to make joining these dataframes go more smooth later on.

book = read_csv("data/children_book/BookBuddy.csv") %>% mutate(ISBN = as.numeric(ISBN))
book1 = read_csv("data/children_book/BookBuddy (1).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book2 = read_csv("data/children_book/BookBuddy (2).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book3 = read_csv("data/children_book/BookBuddy (3).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book4 = read_csv("data/children_book/BookBuddy (4).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book5 = read_csv("data/children_book/BookBuddy (5).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book6 = read_csv("data/children_book/BookBuddy (6).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book7 = read_csv("data/children_book/BookBuddy (7).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book8 = read_csv("data/children_book/BookBuddy (8).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book9 = read_csv("data/children_book/BookBuddy (9).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book10 = read_csv("data/children_book/BookBuddy (10).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book11 = read_csv("data/children_book/BookBuddy (11).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book12 = read_csv("data/children_book/BookBuddy (12).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>%  mutate(`Date Added` = as.character(`Date Added`))
book13 = read_csv("data/children_book/BookBuddy (13).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book14 = read_csv("data/children_book/BookBuddy (14).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book15 = read_csv("data/children_book/BookBuddy (15).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))

Join and Clean Data

Now that we’ve read it in, I need to do a few things to support future analysis.

  1. Many of the column names have spaces which does not support data munging. I’ll use the clean_names command from the janitor package to clean this up.

  2. I have a lot of duplicate books. I’ll keep only the rows that have distinct Title and Author. I used the distinct command because some of us scanned the same books. We do not want to account for them twice.

  3. I need to convert the DatePublished and DateAdded to characters (don’t worry, I’ll bring it back later.)

  4. It seemed like every Berenstain Bear book had a little bit different Author. Lets clean that up a little too.

  5. There were some DateAdded dates that do not exist. For a while, when one of our babysitters was saving the data, this information was not being saved. We assigned the book the most recent data from when we caught the error (as seen in the last line below). This is not a reasonable assumption, but its the best we can do.

  6. Lastly, I’ll only use the following columns: Title, Author, Publisher, YearPublished, Genre, Summary, DateAdded, DatePublished, and GoogleVolumeId. The rest of the columns contain blank information.

booksread =
  book %>% bind_rows(book1) %>% bind_rows(book2) %>% bind_rows(book3)  %>%
  bind_rows(book4) %>% bind_rows(book5) %>% bind_rows(book6) %>% 
  bind_rows(book7) %>% bind_rows(book8) %>% bind_rows(book9) %>% 
  bind_rows(book10) %>%  bind_rows(book11) %>% bind_rows(book12) %>% 
  bind_rows(book13) %>% bind_rows(book14) %>% bind_rows(book15) %>%
  clean_names(case = "upper_camel") %>% 
  distinct(Title, Author, .keep_all = TRUE) %>% 
  mutate(DatePublished = ymd(str_sub(DatePublished,1,10)))%>%
  mutate(DateAdded = ymd(str_sub(DateAdded,1,10))) %>%
  select(Title, Author, Publisher, YearPublished, Genre, Summary, NumberOfPages, DateAdded, DatePublished,GoogleVolumeId) %>%
  mutate(Author = ifelse(grepl("Bere", Author)==TRUE, "Berenstein", Author)) %>%
  mutate(DateAdded = na.locf(.$DateAdded, fromlast = TRUE))

Here is the data for you to explore:

booksread %>% 
  select(-Summary) %>% 
  datatable()

Analysis

To answer the primary question: How many books did we read?

nrow(booksread)
## [1] 1070

What is the span of the dates of our our books?

booksread$DateAdded %>% min
## [1] "2017-12-29"
booksread$DateAdded %>% max
## [1] "2018-12-18"

Top Authors

What other interesting things can we find? Who were our top authors?

booksread %>%
  group_by(Author) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>% 
  filter(n>5) %>%
  kable()
Author n
Berenstein 72
Mercer Mayer 17
Dr. Seuss 15
James Dean 14
Victoria Kann 14
Cynthia Rylant 11
Jane O’Connor 9
Gail Gibbons 6
Kate Banks 6
Rosemary Wells 6
booksread %>%
  group_by(Author) %>%
  summarise(n = n()) %>%
  filter(n>5) %>%
  ggplot(aes(x=fct_reorder(Author, desc(n)), y = n)) +
  geom_bar(stat = "identity") +
  labs(title = "Most Read Authors", x = "Authors", y = "Number of Books") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I’m a big fan of the Berenstain Bears, Dr. Seuss, and Pete The Cat (James Dean), but not so much Fancy Nancy (Thanks Jane O’Connor)….

Timeline

booksread %>%
  group_by(DateAdded) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  mutate(BooksByDate = cumsum(n)) %>%
  ggplot(aes(x=DateAdded, y = BooksByDate)) +
  geom_line() + 
  geom_point() +
  labs(title = "Books Read in 2018", x = "Date", y = "Total Books Read")

There are a few things to note from the data above.

  1. We were more vigilant at scanning books as we read them at the beginning – vs scanning them all at once at the end.

  2. We had a couple droughts of book reading – in particular August was a slow period.

  3. The big jump in December was a data collection error mentioned above. These were likely spread across the previous 3 months.

booksread %>%
  mutate(month = month(DateAdded)) %>%
  group_by(month) %>%
  summarise(BooksInMonth = n()) %>%
  kable()
month BooksInMonth
1 98
2 59
3 90
4 95
5 58
6 90
7 69
8 22
9 109
10 23
11 225
12 132

A Book by… God?

And just as a side note, I happen to see we had a book authored by ‘God’. Lets investigate.

booksread %>%
  filter(Author=="God") %>%
  select(Title, Author) %>%
  kable()
Title Author
The Lord’s prayer God

Okay then.

Publication Date

When were our books published?

ggplot(data = booksread, aes(YearPublished)) +
  geom_bar() +
  labs(title = "Books By Year", y = "Total Books", x = "Year Published")

It appears we’ve read some classics… as well as some of the more ‘pop culture’ children’s books.

Length of Books

I can hear you now…“So you’ve read over 1000 books this year. Its nice that they are all picture books…”"

ggplot(data = booksread, aes(NumberOfPages)) +
  geom_histogram() +
  ggtitle("Books By Number of Pages") + ylab("Total Books")

What?! We’ve read books over 100 pages to our kids…

booksread %>%
  filter(NumberOfPages>100) %>%
  arrange(desc(NumberOfPages)) %>%
  select(Title, Author, NumberOfPages) %>%
  kable()
Title Author NumberOfPages
Disney’s Once Upon a Princess Various, 592
Brick City Warren Elsmore 256
The Curious Kid’s Science Book Asia Citro 224
Good Night Stories for Rebel Girls Elena Favilli, Francesca Cavallo 212
LEGO Adventure Book, Vol. 1 Megan H. Rothrock 200
The LEGO Ideas Book Daniel Lipkowitz 198
Lots of Look & Finds Disney Princess Author 193
Little Critter’s Read-it-yourself Storybook Mercer Mayer 192
Pinkalicious: The Pinkamazing Storybook Collection Victoria Kann 192
Space Encyclopedia Christine Pulliam, Patricia Daniels 191
My Friend the Moon André Dahan 160
Big Book of the Berenstain Bears Berenstein 160
Yoga games for children Danielle Bersma 146
Yoga Games for Children Danielle Bersma, Marjoke Visscher 146
The LEGO Build-It Book, Vol. 1 Nathanael Kuipers, Mattia Zamboni 132

Book Genres

What types of books do we read our kids?

It appears a pretty interesting array of genres. And some uniquely named ones if you search around.

booksread %>%
  filter(!is.na(Genre)) %>%
  group_by(Genre) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  datatable()

Text Analysis

So, that was some interesting descriptive analysis. Lets look at the undertones our children are absorbing from what we are reading.

I’d really like to know what type of words / sentiment we are reading our children – but until I can figure out a way to get the words from all the books, I’ll have to rely on the Summary provided by BookBuddy.

Here’s an example of what we’ll be looking at:

booksread %>%
  filter(Title=="Mama Cat Has Three Kittens") %>%
  select(Summary)
## # A tibble: 1 x 1
##   Summary                                                                                                                                             
##   <chr>                                                                                                                                               
## 1 Some kittens march to the beat of a different drummer.Mama Cat has three kittens, Fluffy, Skinny, and Boris. Where Mama Cat leads, Fluffy and Skinn~

Preparation

So, lets prepare the data. I’d like to bring along some information while I tokenize the summaries (Titles, Authors, and Genres).

text_df = data_frame(line = 1:length(booksread$Summary), text = as.character(booksread$Summary), book = booksread$Title, Genre = booksread$Genre, Author = booksread$Author)

cleanedbooks = text_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

cleanedbooks %>% datatable()