This is an independent project I have recently worked on in which I created a database based on my eBird life list. A user’s personal lifelist is easily exported as a CSV from the eBird website. While the database itself doesn’t serve much practical purpose, the goal was to practice generating a spatial database de novo using Postgres w/ PostGIS, host that database on a server, and create an application that interacts with the hosted database. It also provided the opportunity to work on data collection using a couple of APIs (rebird and openmeteo), and to derive attributes by integrating disparate data sets.

I used dbdiagram.io to generate the database structure and exported the DBML code. I was eager to see how well I could translate the schema into a functional database. In particular, I was interested in using the RPostgres and DBI packages to interact with and modify the database directly from R (DBI allows for the execution of SQL code within the R environment). I used the pgAdmin GUI to deploy a database instance to the AWS RDS, then utilized the command line to generate a database backup and import the .sql file into AWS. I am currently working on building out an application using Shiny that will interact with the database.


Database structure

Table species {
  species_id integer [primary key]
  common_name varcahr
  speciesCode varchar //one of banding or common name code
  scientific_name varchar
  family varchar
  familySci varcahr // scientific family name
  conservation_status varchar
  obs_id integer // foreign key
}

Table observations {
  obs_id integer [primary key] // lifer number
  common_name varcahr
  date date
  loc_id varchar
  checklist_id varchar // foreign key
}

Table locations {
  loc_id varchar [primary key] // eBird L identifier
  name varchar
  lat double
  long double
  biome varchar
  ecoregion varchar
}

Table checklists {
  sub_id varchar [primary key]
  checklist_id varchar
  duration integer
  num_species integer
  date date
  temperature float
  wind float
  precipitation float
  cloudcover float
  loc_id varchar
}

Ref: observations.obs_id > species.obs_id

Ref: observations.loc_id < locations.loc_id // each obs at a single location, but one location can be the site of many obs

Ref: observations.checklist_id < checklists.sub_id // each obs to a single checklist, but a single checklist can  hold many obs

Ref: locations.loc_id > checklists.loc_id // each location can be the site of many checklists



Data Preparation

library(dplyr)
library(rebird)

lifelist <- readr::read_csv("lifelist.csv")

# Exclude non-species-level records and exotics/escapes
lifelist <- lifelist %>%
  slice(-(867:nrow(lifelist))) %>%
  mutate(`Row #` = row_number())

# Set ObservationID field
lifelist <- lifelist %>%
  mutate(ObsID = sort(`Row #`, decreasing = TRUE))


Derive X,Y Coords for Locations

Lat/long data will be stored in ‘Location Table’ of database

Start with low hanging fruit: official hotspots.

## ---- Custom func to return single-row df with x and y coords based on hotspot LocID
hotspot_xy <- function(hotspot_id) {
  
  info <- tryCatch(ebirdregioninfo(hotspot_id), 
                   error = function(e) {
    NULL  # If error, just return NULL
  })

  # Check if info is NULL
  if (is.null(info)) {
    # Return df with NAs if there was an error
    return(data.frame(LocID = hotspot_id,
                      lat = NA,
                      long = NA))
  } else {
    # Extract coordinates and return them if no error occurred
    y_coord <- info$latitude
    x_coord <- info$longitude
    return(data.frame(LocID = hotspot_id,
                      lat = y_coord,
                      long = x_coord))
  }
}


## ----- Apply custom func to get xy data from official hotspots
# df to hold location ID and resultant xy data
locations <- as.data.frame(unique(lifelist$LocID))
colnames(locations)[1] <- "LocID"


for (i in 1:nrow(locations)) {
  coords <- hotspot_xy(locations$LocID[i])
  
  locations$lat[i] <- coords$lat
  locations$long[i] <- coords$long
}


Many observations are not tied to recognized hotspots and thus require a bit more leg work to access their location data

Some Loc names have xy data in the name – regex + stringr can be used to extract those late/long strings

library(stringr)

# Subset locations that still lack spatial data
loc_noCoords <- locations %>%
  filter(is.na(lat)) %>%
  left_join(lifelist %>%
              group_by(LocID) %>%
              slice_head() %>%
              select(Location, LocID), by = "LocID")

#  Define regex pattern to identify lat and long strings
pattern <- "-?\\d+\\.\\d+,\\s*-?\\d+\\.\\d+"

loc_noCoords$extractedCoords <- str_extract(loc_noCoords$Location, pattern)

# Split string so lat and long are separated; this creates list object
coords_split <- str_split(loc_noCoords$extractedCoords, ",")

for (i in 1:nrow(loc_noCoords)) {
  if (!is.na(coords_split[[i]][1])) {
    loc_noCoords$lat[i] <- as.numeric(coords_split[[i]][1])
    loc_noCoords$long[i] <- as.numeric(coords_split[[i]][2])
  } else {
    loc_noCoords$lat[i] <- NA
    loc_noCoords$long[i] <- NA
  }
}


## ---- 19 locations still lack coordinates. They will need to be seek out coords manually.
stillMissing <- loc_noCoords %>%
  filter(is.na(lat))

# Spatial data manually obtained from eBird 'My Locations page'
missingXY_vec <- c("-40.430443603861484, -71.52732714595885",
                   "21.962500388966415, -80.06262240958718",
                   "37.01274054205935, -76.45330601760207",
                   "-34.40817439312428, -58.58968011287809",
                   "38.96130967755057, -78.73640236441761", #5  George Washington 
                   "36.89192374233853, -76.44309073294882",
                   "-25.599437557304018, -54.56900095919036",
                   "32.1168246, -110.478799",
                   "31.8657879, -109.4118481",
                   "31.8610354, -109.3382328",  #10
                   "31.970572, -109.3321154",
                   "18.373368, -68.815373",
                   "18.70751, -68.44678",
                   "-33.4984608, -58.7883568",
                   "-23.7007266, -65.6840385",    # 15 Cuesta de Lipan
                   "-34.5230615, -58.4680527",  
                   "42.4587784, -73.3565011",
                   "37.0455199, -76.4641169",   
                   "-34.9688638, -58.5851879"   # 19 Estancia Villa Maria
   
                                   )
stillMissing$extractedCoords <- missingXY_vec

# Apply same str_split approach as above
coords_split <- str_split(stillMissing$extractedCoords, ",")

for (i in 1:nrow(stillMissing)) {
    stillMissing$lat[i] <- as.numeric(coords_split[[i]][1])
    stillMissing$long[i] <- as.numeric(coords_split[[i]][2])
}

# Join all coords together in singl df
loc1 <- locations %>%
  filter(!LocID %in% loc_noCoords$LocID)

loc2 <- loc_noCoords %>%
  filter(!is.na(lat)) %>%
  rbind(stillMissing) %>%
  select(LocID, lat, long)

loc_coords <- rbind(loc1, loc2)

# Join coords to life listd df
lifelist <- lifelist %>%
  left_join(loc_coords, by = "LocID")


Collect Weather Data

Weather data will be incorporated into the ‘Checklist Table’.

API Wrapper & Custom Functions

rebird currently has no wrapper for accessing individual checklist information . This function accesses a checklist based on SubID field from life list…

library(httr)
library(jsonlite)

api_token <- "jrel4sdsnpqv"

## -- Custom func to get checklist info using eBird api
getChecklist <- function(subId) {
    url <- paste0("https://api.ebird.org/v2/product/checklist/view/", subId)
    response <- httr::GET(url, httr::add_headers(`X-eBirdApiToken` = api_token))
    if (httr::status_code(response) == 200) {
        data <- jsonlite::fromJSON(httr::content(response, "text", encoding = "UTF-8"))
        return(data)
    } else {
        stop("Failed to fetch data: ", httr::status_code(response))
    }
}


…this function then takes that checklist information and accesses weather information using Open Meteo API.

## ------- Custom func to get weather information given an eBird checklist as input
getWeather <- function(checklist) {
  locid <- checklist$locId
  coords.df <- lifelist %>%
    filter(LocID == locid)

  lat <- as.vector(unlist(coords.df[1, "lat"]))
  long <- as.vector(unlist(coords.df[1, "long"]))
  
  obs_datetime <- checklist$obsDt
  obs_POSIXct <- as.POSIXct(obs_datetime)
  
  split_datetime <- strsplit(obs_datetime, " ")
  obsDate <- unlist(split_datetime)[1]
  
  # Open-Meteo API to get historical weather records
  weatherInfo <- openmeteo::weather_history(location = c(lat, long), 
                                            start = obsDate, 
                                            end = obsDate, 
                                            hourly = c("temperature_2m", "windspeed_10m", "cloudcover", "precipitation")
                                            )
  
  weatherInfo$datetime_char <- as.character(weatherInfo$datetime)
  
  # Round the checklist time to nearest hour to be able to match with a datetime from the hourly weather output
  rounded_obsTime <- as.character(round(obs_POSIXct, units = "hours"))
  
 weatherInfo <- weatherInfo %>%
    filter(datetime_char == rounded_obsTime) %>%
    select(-datetime, -datetime_char)
  
 return(weatherInfo)
}


Store Weather Data

# Create df to hold weather data with unique identifier
weather.df <- data.frame(checklistID = unique(lifelist$SubID),
                         hourly_temperature_2m = NA,
                         hourly_windspeed_10m = NA,
                         hourly_cloudcover = NA,
                         hourly_precipitation = NA)


# Utilize custom functions in for loop to access weather data for each checklist
for (i in seq_along(unique(lifelist$SubID))) {
  subID <- unique(lifelist$SubID)[i]
  checklist <- getChecklist(subID)
  weather <- tryCatch(getWeather(checklist),
                      error = function(e) {
                        cat("Error in iteration:", i, "\n checklist:", subID)
                      }
  )
  
  weather.df$hourly_temperature_2m[i] <- weather$hourly_temperature_2m
  weather.df$hourly_windspeed_10m[i] <- weather$hourly_windspeed_10m
  weather.df$hourly_cloudcover[i] <- weather$hourly_cloudcover
  weather.df$hourly_precipitation[i] <- weather$hourly_precipitation
}


Biome & Subbiome for Obs. Locations

National Geographic MapMaker has an excellent biomes and ecoregions layer available through the ArcGIS portal (2017), which I initially exported from Pro as a shapefile to bring into R. (A vector layer I had downloaded from the journal “Vegetation Classification and Survey - before realizing a similar and superior layer was available through ArcGIS - was inadequate for the task, i.e., inaccurate boundaries at small scale result in many birds not being captured in the spatial join).

For some reason the spatial join I attemped in R failed. In the interest of time I simply reverted to Pro to carry out the join and determine the biome and subbiome within which each observation location falls. This information will be used in the ‘Location Table’ of the database.

# Read in biome vectors and covert to sf object
shp <- st_read("biomes_ecoregions/biomes.shp") # ESRI biomes layer
biomes_sf <- st_as_sf(shp)

# Convert life list to sf object
lifers_sf <- st_as_sf(lifelist, coords = c("long", "lat"), crs = 4326)

# Intersect
# lifelist_biomes <- st_intersection(lifers_sf, biomes_sf)

# Still some points not captured in intersection (70 obs total)
## This process took forever (stopped it after 10 min) -- much more efficient to export lifers as shp and run the join in ArcGIS Pro
# lifers_biomes <- st_is_within_distance(lifers_sf, biomes_sf, dist = 2000)


Bring Joined Biome df into R

This code simply connects to the geodatabase and reads in the spatial join layer, which then gets incorporated into the Locations Table.

library(arcgisbinding)
arc.check_product()
## product: ArcGIS Pro (13.2.0.49743)
## license: Advanced
## version: 1.0.1.306
gdb_path <- "C:/Users/andre/Desktop/myeBird/Ebird_workingProj/Ebird_workingProj.gdb"
fc <- arc.open(paste0(gdb_path, "/lifers_SpatialJoin"))
biomes_df <- arc.select(fc)


Banding/Name Codes & Families

rebird can be used to access taxonomic information.

tax <- rebird::ebirdtaxonomy()
#tax <- tax %>%
  #select(sciName, comName, comNameCodes, familyComName, familySciName)

myLifers <- lifelist$`Scientific Name`

taxonomy.df <- tax %>%
  filter(sciName %in% myLifers)

# 4 letter codes are spread across two fields in eBird taxonomy dataset (banding code and common name code)
# Note that what I am doing here returns something different from the species_code() function already available in `rebird`
## Dataframe to hold 4-letter codes codes 
birdCodes <- data.frame(sciName = myLifers,
                        code = NA)

for (i in seq_along(myLifers)) {
   currentSpecies <- myLifers[i]
   taxSubset <- taxonomy.df %>%
     filter(sciName == currentSpecies) %>%
     select(bandingCodes, comNameCodes)
   
   # Select a 4-letter code with preference for banding codes; default to NA if none available
   speciesCode <- if (!is.na(taxSubset$bandingCodes) && length(taxSubset$bandingCodes) > 0) {
     taxSubset$bandingCodes
   } else if (!is.na(taxSubset$comNameCodes) && length(taxSubset$comNameCodes) > 0) {
     taxSubset$comNameCodes
   } else {
     NA
   }
   
      birdCodes$code[birdCodes$sciName == currentSpecies] <- speciesCode
}

# Join back to taxonomy DF, 
taxonomy.df <- taxonomy.df %>%
  left_join(birdCodes, join_by("sciName")) %>%
  select(sciName, comName, familyComName, familySciName, code, speciesCode)


Additional Checklist Data

Fields to be used in ‘Checklists Table’: outing duration, number of species seen, and date.

# Use function getChecklist(), created above
## Many obs from same checklist; create vector of unique checklist IDs
uniqueChecklists <- lifelist %>%
  distinct(SubID) %>%
  pull(SubID)

# Data frame to hold checklist information
checklist.df <- data.frame(subID = uniqueChecklists,
                           checklist_id = NA,
                           loc_id = NA,
                           duration = NA,
                           numSpecies = NA,
                           dateTime = NA,
                           comments = NA)

for (i in seq_along(1:nrow(checklist.df))) {
  currentChecklist = uniqueChecklists[i]
  listInfo <- getChecklist(currentChecklist)
  
  checklist.df$checklist_id[i] <- listInfo$checklistId
  checklist.df$loc_id[i] <- listInfo$locId
  checklist.df$duration[i] <- if (!is.null(listInfo$durationHrs)) {
    listInfo$durationHrs
  } else {
    NA
  }
  checklist.df$numSpecies[i] <- listInfo$numSpecies
  checklist.df$dateTime[i] <- listInfo$obsDt
  checklist.df$comments[i] <- if (!is.null(listInfo$comments)) {
    listInfo$comments
  } else {
    NA
  }
}



Create Separate Database Tables

Split data into four database tables and export as csv to bring integrate into Postgres

‘Species Table’

CREATE TABLE "species" (
  "species_id" integer PRIMARY KEY,
  "commonName" varcahr,
  "speciesCode" varchar,
  "sciName" varchar,
  "family" varchar,
  "familySci" varchar,
  "conservation_status" varchar,   # still waiting for response from IUCN to my request for data access
  "obs_id" integer
);


I want this table ordered taxonomically, so the first step is to add fields from the taxonomy df, then join missing fields from the lifelist df based on scientific or common name. Then reorder cols to fit desired database schema. ‘species_id’ primary key field will be defined as the eBird alphanumeric species code.

speciesTable <- data.frame(species_id = taxonomy.df$speciesCode,
                           sciName = taxonomy.df$sciName,
                           speciesCode = taxonomy.df$code,
                           family = taxonomy.df$familyComName,
                           familySci = taxonomy.df$familySciName,
                           conservationStatus = NA
          )

# Subset desired lifelist fields 
p <- lifelist %>%
  select(`Scientific Name`, `Common Name`, ObsID)

speciesTable <- speciesTable %>%
  left_join(p, join_by("sciName" == "Scientific Name")) %>%
  relocate(`Common Name`, .before = sciName) %>%
  rename(obs_id= ObsID,
         commonName = `Common Name`)


‘Observations Table’

CREATE TABLE "observations" (
  "obs_id" integer PRIMARY KEY,
  "common_name" varcahr,
  "date" date,
  "loc_id" varchar,
  "checklist_id" varchar
);


obsTable <- lifelist %>%
  select(ObsID, `Common Name`, `Date`, LocID, SubID) %>%
  rename(obs_id = ObsID,
         common_name = `Common Name`,
         date = Date,
         loc_id = LocID,
         sub_id = SubID) %>%
  mutate(date = lubridate::dmy(date))

‘Locations Table’

CREATE TABLE "locations" (
  "loc_id" varchar PRIMARY KEY,
  "name" varchar,
  "lat" double,
  "long" double,
  "biome" varchar,
  "ecoregion" varchar
);


# Subset biome data
biome <- biomes_df %>%
  select(LocID, BIOME_NAME, ECO_NAME)

locTable <- lifelist %>%
  select(LocID, Location,lat, long) %>%
  left_join(biome, by = "LocID", relationship = ) %>%
  rename(loc_id = LocID, 
         name = Location, 
         biome = BIOME_NAME, 
         ecoregion = ECO_NAME) %>%
  group_by(loc_id) %>%  # Ensure only 1 row per location
  slice(1)


‘Checklists Table’

CREATE TABLE "checklists" (
  "checklist_id" varchar PRIMARY KEY,
  "duration" integer,
  "num_species" integer,
  "date" date,
  "temperature" float,
  "wind" float,
  "precipitation" float,
  "cloudcover" float,
  "loc_id" varchar
);


checklistTable <- checklist.df %>%
  left_join(weather.df, by = c("subID" = "checklistID")) %>%
  rename(sub_id = subID,
         temperature = hourly_temperature_2m,
         wind = hourly_windspeed_10m,
         precipitation = hourly_precipitation,
         cloudcover = hourly_cloudcover) %>%
  mutate(dateTime = lubridate::ymd_hm(dateTime))



Integration with PostgresSQL

Integrating with Postgres requires the RPostgres package. The database must be initiated first, either via the command line or a Postgres GUI like pgAdmin.

Using command line:

  • to connect to Postgres psql -U postgres
  • to create a new database CREATE DATABASE lifeList;
  • to connect to the new database \c lifeList;

Having initated the database, you can connect to it directly via R using the RPostgres and DBI libraries.

library(RPostgres)
library(DBI)

connection <- DBI::dbConnect(RPostgres::Postgres(), 
                        dbname = "lifelist", 
                        host = "localhost", 
                        port = 5432, 
                        user = "postgres",
                        password = "10161485")



Write Dataframes as Database Tables

DBI::dbWriteTable(connection, "species", speciesTable, overwrite = TRUE, row.names = FALSE)
DBI::dbWriteTable(connection, "observations", obsTable, overwrite = TRUE, row.names = FALSE)
DBI::dbWriteTable(connection, "locations", locTable, overwrite = TRUE, row.names = FALSE)
DBI::dbWriteTable(connection, "checklists", checklistTable, overwrite = TRUE, row.names = FALSE)

query_test <- dbGetQuery(connection, 'SELECT "sciName" FROM species LIMIT 5')
print(query_test)



Define Primary and Foreign Keys

Successful execution of code will result in output of “0”, indicating 0 rows were affected by this operation.

# Define primary keys
dbExecute(connection, "ALTER TABLE species ADD PRIMARY KEY (species_id);")
dbExecute(connection, "ALTER TABLE observations ADD PRIMARY KEY (obs_id);")
dbExecute(connection, "ALTER TABLE locations ADD PRIMARY KEY (loc_id);")
dbExecute(connection, "ALTER TABLE checklists ADD PRIMARY KEY (sub_id);")


The syntax for defining foreign keys is a bit less intuitive. Consider the following:

# Define foreign keys
dbExecute(connection, "
          ALTER TABLE species 
          ADD CONSTRAINT fk_obs_id 
          FOREIGN KEY (obs_id) REFERENCES observations(obs_id);"
          )


ALTER TABLE species:

Specifies that the species table is to be altered.

ADD CONSTRAINT fk_obs_id:

Instructs Postgres to add a new constraint to the species table (i.e., a rule that must be followed by the data). The constraint is named fk_obs_id.

FOREIGN KEY (obs_id) REFERENCES observations(obs_id):

Defines a new foreign key for the species table, the column obs_id. This field is intended to reference a primary key in another table; in this case the observations table.

# Define the rest of the foreign keys
dbExecute(connection, "
          ALTER TABLE observations 
          ADD CONSTRAINT fk_loc_id 
          FOREIGN KEY (loc_id) REFERENCES locations(loc_id);"
          )

dbExecute(connection, "
          ALTER TABLE observations 
          ADD CONSTRAINT fk_checklist_id 
          FOREIGN KEY (sub_id) REFERENCES checklists(sub_id);"
          )

dbExecute(connection, "
          ALTER TABLE checklists 
          ADD CONSTRAINT fk_loc_id 
          FOREIGN KEY (loc_id) REFERENCES locations(loc_id);"
          )



Make it Spatial: PostGIS Extension

PostGIS adds support for geographic objects, allowing location queries to be run in SQL. This requires geometries to be stored somewhere in the database. In this case, the most sensible table in which to incorporate a geom field is the Locations Table.

Once PostGIS is installed, the extension must be enabled for each new database that will hold spatial data.

dbExecute(connection, "CREATE EXTENSION IF NOT EXISTS postgis;")

# Check the PostGIS version to confirm installation
postgis_version <- dbGetQuery(connection, "SELECT PostGIS_full_version();")
print(postgis_version)


Now that it is enabled for the lifelist db, the geom field can be added to the locations table.

dbExecute(connection, "ALTER TABLE locations ADD COLUMN geom geometry(Point, 4326);")

# Update the geometry column with latitude and longitude data
dbExecute(connection, "UPDATE locations SET geom = ST_SetSRID(ST_MakePoint(long, lat), 4326);")

# ST_SetSRID is a PostGIS function, more or less equivalent to st_set_crs() from `sf` package



Database Hosting

Because the objective here was to create an application that interacts with my database, I created a free tier account with Amazon Web Services RDS to host it.

Generate a Database Backup

To generate a .sql file from the local database (i.e., ‘database dump’), access the system command line (not the PostgreSQL terminal, ‘psql’) and use the pg_dump command.

C:\Users\andre>pg_dump -h localhost -U postgres -d lifelist -f lifelist_dump.sql -W


Cloud Deployment and Database Import

After creating a new database instance on AWS RDS, I imported the database dump file using psql from the command line by first connecting to the database instance:

psql -h [endpoint of instance] -p 5432 -U postgres -d lifelist -W

Then, within the psql session in the command shell, I ran the following command to import the database dump:

\i lifelist_dump.sql

Trying to a run a SQL query to test the success of the import initially returned an error that no such relation existed. This was because the ‘search_path’ config parameter was not set. To fix this, I ran the following command. In this case, ‘public’ refers to the schema in which the tables are stored.

SET search_path TO public;



Shiny Application w/ map

Eventually I will get around to this last step…!