Introduction

This write-up was prompted by a question in the R4DS Online Learning Community some time ago as to how one can document the data frames used in a typical data analysis exercise. Of course we have excel as a spreadsheet tool to do this, but what if we need to do it using R.

So let us watch how we can go about creating a basic documentation of our data frames. An analysis involves a number of data frames and a documentation, accompanied by a entity relationship diagram lends a great deal of readability to the data analysis. We will use the palmer penguins data set to explain this approach.

Let us load the libraries first

library(dm, warn.conflicts = FALSE)
library(palmerpenguins, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
library(DiagrammeR, warn.conflicts = FALSE)

We are using palmerpenguins package which has the dataset, data.table to use for data wrangling and dm, DiagrammeR to help create the relationship and draw the relationship diagram.

Palmer Penguins Data Set

So let us load the Palmer Penguins data set.

df.penguins <- as.data.table(penguins)
str(df.penguins)
## Classes 'data.table' and 'data.frame':   344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  - attr(*, ".internal.selfref")=<externalptr>

The str function shows the structure of the data frame at a glance.

The comment function

Not let us come to the first part, the use of comment function. We will use the comment function to document the variables/columns in the Palmer Penguins data set and this is how it works.

comment(df.penguins$species) <- "This field stores the species of the Penguin"

comment(df.penguins$island) <- "This field stores the island of Palmer Archipelago"

comment(df.penguins$bill_length_mm) <- "This field stores the penguin's bill length in mm"


comment(df.penguins$bill_depth_mm) <- "This field stores the penguin's bill depth in mm"

comment(df.penguins$flipper_length_mm) <- "This field stores the penguin's flipper length in mm"

comment(df.penguins$body_mass_g) <- "This field stores body mass of the penguin in gms"

comment(df.penguins$sex) <- "This field stores gender of the penguin"


comment(df.penguins$year) <- "This field is the year when the observation was recorded"

And how does the data frame look like after adding the comments ?

str(df.penguins)
## Classes 'data.table' and 'data.frame':   344 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "comment")= chr "This field stores the species of the Penguin"
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##   ..- attr(*, "comment")= chr "This field stores the island of Palmer Archipelago"
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##   ..- attr(*, "comment")= chr "This field stores the penguin's bill length in mm"
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##   ..- attr(*, "comment")= chr "This field stores the penguin's bill depth in mm"
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##   ..- attr(*, "comment")= chr "This field stores the penguin's flipper length in mm"
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##   ..- attr(*, "comment")= chr "This field stores body mass of the penguin in gms"
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##   ..- attr(*, "comment")= chr "This field stores gender of the penguin"
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##   ..- attr(*, "comment")= chr "This field is the year when the observation was recorded"
##  - attr(*, ".internal.selfref")=<externalptr>

So we can now see that each variable includes a comment attribute describing the variable. The output does not look like being much readable but as we show below we can make them tidy with a little bit of wrangling.

Tidying the Descriptions

Let us create a new data frame which will hold the descriptions of the data frame columns.

df.cols <- as.data.frame(do.call(rbind, 
                                 lapply(df.penguins, function(x) 
                                   attributes(x)$comment)))

df.cols <- as.data.frame(cbind(dfname = rep("df.penguins", 
                                            nrow(df.cols)), 
                               cols = rownames(df.cols), 
                               desc = df.cols$V1))

rownames(df.cols) <- 1:nrow(df.cols)

knitr::kable(x = df.cols, format = "simple", row.names = FALSE)
dfname cols desc
df.penguins species This field stores the species of the Penguin
df.penguins island This field stores the island of Palmer Archipelago
df.penguins bill_length_mm This field stores the penguin’s bill length in mm
df.penguins bill_depth_mm This field stores the penguin’s bill depth in mm
df.penguins flipper_length_mm This field stores the penguin’s flipper length in mm
df.penguins body_mass_g This field stores body mass of the penguin in gms
df.penguins sex This field stores gender of the penguin
df.penguins year This field is the year when the observation was recorded

Our documentation is now taking shape !!! and df.cols is one of the data frames that holds the documentation.

Data Types

A neat documentation should also contain the data type of the columns. Now let us get the data types of the data frame columns.

df.coltypes <- as.data.frame(do.call(rbind, 
                                     lapply(df.penguins, 
                                            class)))


df.coltypes <- as.data.frame(cbind(dfname = rep("df.penguins", 
                                                nrow(df.coltypes)), 
                                   cols = rownames(df.coltypes), 
                                   datatype = df.coltypes$V1))

rownames(df.coltypes) <- 1:nrow(df.coltypes)


all.df.cols <- merge(df.cols, 
                     df.coltypes, 
                     by = c("dfname", "cols")) |> (\(x){cbind(x[, 1:2], 
               dtype = x$datatype, 
               description = x$desc)})()

knitr::kable(x = all.df.cols, format = "simple", row.names = FALSE)
dfname cols dtype description
df.penguins bill_depth_mm numeric This field stores the penguin’s bill depth in mm
df.penguins bill_length_mm numeric This field stores the penguin’s bill length in mm
df.penguins body_mass_g integer This field stores body mass of the penguin in gms
df.penguins flipper_length_mm integer This field stores the penguin’s flipper length in mm
df.penguins island factor This field stores the island of Palmer Archipelago
df.penguins sex factor This field stores gender of the penguin
df.penguins species factor This field stores the species of the Penguin
df.penguins year integer This field is the year when the observation was recorded

The dtype column now shows the data type of the columns. Please remember that df.coltypes is another data frame that holds the column types of the data frame.

Documenting the Keys

Since we cannot set keys on a data frame but a typical table will have keys, therefore we have used data.table and we would set some keys and include them in the documentation.

setkey(df.penguins, year, species, island)

df.colkeys <- data.frame(dfname = "df.penguins", 
                         keys = key(df.penguins), 
                         flag = rep(1, 
                                    length(key(df.penguins))))



all.df.cols <- merge(all.df.cols, 
                     df.colkeys, 
                     by.x = c("dfname","cols"), 
                     by.y = c("dfname", "keys"), 
                     all.x = TRUE)

colnames(all.df.cols)[5] <- "iskey"

all.df.cols$iskey <- ifelse(is.na(all.df.cols$iskey), "No", "Yes")

df.colkeys holds the documentation of the keys of the data frame. we will come back to these data frame in the ERD section below to check out the model that we have built to enable this documentation.

Now we have documented the penguins data frame with all its bits and pieces. Lets check out the documentation in a tabular form.

knitr::kable(all.df.cols[order(all.df.cols$dfname, 
-as.numeric(as.factor((all.df.cols$iskey)))), c(1:3, 5, 4)], 
             format = "simple", 
             row.names = FALSE)
dfname cols dtype iskey description
df.penguins island factor Yes This field stores the island of Palmer Archipelago
df.penguins species factor Yes This field stores the species of the Penguin
df.penguins year integer Yes This field is the year when the observation was recorded
df.penguins bill_depth_mm numeric No This field stores the penguin’s bill depth in mm
df.penguins bill_length_mm numeric No This field stores the penguin’s bill length in mm
df.penguins body_mass_g integer No This field stores body mass of the penguin in gms
df.penguins flipper_length_mm integer No This field stores the penguin’s flipper length in mm
df.penguins sex factor No This field stores gender of the penguin

Isn’t it nice ? The more the data frames involved in the analysis, the more is the value that we may get out of this simple documentation.

The ERD

Now that we have shown how a documentation looks like, we would check out how a data model diagram looks like and how we can build that in R. Data frames df.cols, df.colkeys and df.coltypes have been used as underlying data frames to enable the above documentation.

So, let us create a relationship diagram involving them. Well, this will not require any custom code, but will require the use of the beautiful dm package in R and in this way as below.

# create a dm object 
dict.dm <- dm(df.cols, 
              df.coltypes, 
              df.colkeys)

# add the primary keys 
dict.dm.pks <- dict.dm |> 
  dm_add_pk(df.cols, 
            c("dfname", "cols")) |> 
  dm_add_pk(df.coltypes, 
            c("dfname", "cols")) |> 
  dm_add_pk(df.colkeys, 
            c("dfname" , "keys"))

# add foreign keys
dict.dm.allkeys <- dict.dm.pks |> 
  dm_add_fk(df.coltypes, 
            c("dfname", "cols"), 
            df.cols) |> 
  dm_add_fk(df.colkeys, 
            c("dfname", "keys"), 
            df.cols)

# draw the model 
dm_draw(dm = dict.dm.allkeys, 
        rankdir = "LR", 
        view_type = "all", 
        graph_name = "Dictionary ERD")

So, coupled with the data frame documentation table, the above diagram gives us a picture of the data frames involved, with a reasonable description of it’s important columns and a model diagram showing the relationships between them.

So this is how, without using Excel or any other commercial tools we can still generate a basic documentation of the data frames using the R Programming Language.