This is an independent project I have recently worked on in which I
created a database based on my eBird life list. A user’s personal
lifelist is easily exported as a CSV from the eBird website. While the
database itself doesn’t serve much practical purpose, the goal was to
practice generating a spatial database de novo using Postgres
w/ PostGIS, host that database on a server, and create an application
that interacts with the hosted database. It also provided the
opportunity to work on data collection using a couple of APIs
(rebird
and openmeteo
), and to derive
attributes by integrating disparate data sets.
I used dbdiagram.io to generate the database structure and exported
the DBML code. I was eager to see how well I could translate the schema
into a functional database. In particular, I was interested in using the
RPostgres
and DBI
packages to interact with
and modify the database directly from R (DBI
allows for the
execution of SQL code within the R environment). I used the pgAdmin GUI
to deploy a database instance to the AWS RDS, then utilized the command
line to generate a database backup and import the .sql file into AWS. I
am currently working on building out an application using Shiny that
will interact with the database.
Table species {
species_id integer [primary key]
common_name varcahr
speciesCode varchar //one of banding or common name code
scientific_name varchar
family varchar
familySci varcahr // scientific family name
conservation_status varchar
obs_id integer // foreign key
}
Table observations {
obs_id integer [primary key] // lifer number
common_name varcahr
date date
loc_id varchar
checklist_id varchar // foreign key
}
Table locations {
loc_id varchar [primary key] // eBird L identifier
name varchar
lat double
long double
biome varchar
ecoregion varchar
}
Table checklists {
sub_id varchar [primary key]
checklist_id varchar
duration integer
num_species integer
date date
temperature float
wind float
precipitation float
cloudcover float
loc_id varchar
}
Ref: observations.obs_id > species.obs_id
Ref: observations.loc_id < locations.loc_id // each obs at a single location, but one location can be the site of many obs
Ref: observations.checklist_id < checklists.sub_id // each obs to a single checklist, but a single checklist can hold many obs
Ref: locations.loc_id > checklists.loc_id // each location can be the site of many checklists
library(dplyr)
library(rebird)
lifelist <- readr::read_csv("lifelist.csv")
# Exclude non-species-level records and exotics/escapes
lifelist <- lifelist %>%
slice(-(867:nrow(lifelist))) %>%
mutate(`Row #` = row_number())
# Set ObservationID field
lifelist <- lifelist %>%
mutate(ObsID = sort(`Row #`, decreasing = TRUE))
Lat/long data will be stored in ‘Location Table’ of database
Start with low hanging fruit: official hotspots.
## ---- Custom func to return single-row df with x and y coords based on hotspot LocID
hotspot_xy <- function(hotspot_id) {
info <- tryCatch(ebirdregioninfo(hotspot_id),
error = function(e) {
NULL # If error, just return NULL
})
# Check if info is NULL
if (is.null(info)) {
# Return df with NAs if there was an error
return(data.frame(LocID = hotspot_id,
lat = NA,
long = NA))
} else {
# Extract coordinates and return them if no error occurred
y_coord <- info$latitude
x_coord <- info$longitude
return(data.frame(LocID = hotspot_id,
lat = y_coord,
long = x_coord))
}
}
## ----- Apply custom func to get xy data from official hotspots
# df to hold location ID and resultant xy data
locations <- as.data.frame(unique(lifelist$LocID))
colnames(locations)[1] <- "LocID"
for (i in 1:nrow(locations)) {
coords <- hotspot_xy(locations$LocID[i])
locations$lat[i] <- coords$lat
locations$long[i] <- coords$long
}
Many observations are not tied to recognized hotspots and thus require a bit more leg work to access their location data
Some Loc names have xy data in the name – regex + stringr can be used to extract those late/long strings
library(stringr)
# Subset locations that still lack spatial data
loc_noCoords <- locations %>%
filter(is.na(lat)) %>%
left_join(lifelist %>%
group_by(LocID) %>%
slice_head() %>%
select(Location, LocID), by = "LocID")
# Define regex pattern to identify lat and long strings
pattern <- "-?\\d+\\.\\d+,\\s*-?\\d+\\.\\d+"
loc_noCoords$extractedCoords <- str_extract(loc_noCoords$Location, pattern)
# Split string so lat and long are separated; this creates list object
coords_split <- str_split(loc_noCoords$extractedCoords, ",")
for (i in 1:nrow(loc_noCoords)) {
if (!is.na(coords_split[[i]][1])) {
loc_noCoords$lat[i] <- as.numeric(coords_split[[i]][1])
loc_noCoords$long[i] <- as.numeric(coords_split[[i]][2])
} else {
loc_noCoords$lat[i] <- NA
loc_noCoords$long[i] <- NA
}
}
## ---- 19 locations still lack coordinates. They will need to be seek out coords manually.
stillMissing <- loc_noCoords %>%
filter(is.na(lat))
# Spatial data manually obtained from eBird 'My Locations page'
missingXY_vec <- c("-40.430443603861484, -71.52732714595885",
"21.962500388966415, -80.06262240958718",
"37.01274054205935, -76.45330601760207",
"-34.40817439312428, -58.58968011287809",
"38.96130967755057, -78.73640236441761", #5 George Washington
"36.89192374233853, -76.44309073294882",
"-25.599437557304018, -54.56900095919036",
"32.1168246, -110.478799",
"31.8657879, -109.4118481",
"31.8610354, -109.3382328", #10
"31.970572, -109.3321154",
"18.373368, -68.815373",
"18.70751, -68.44678",
"-33.4984608, -58.7883568",
"-23.7007266, -65.6840385", # 15 Cuesta de Lipan
"-34.5230615, -58.4680527",
"42.4587784, -73.3565011",
"37.0455199, -76.4641169",
"-34.9688638, -58.5851879" # 19 Estancia Villa Maria
)
stillMissing$extractedCoords <- missingXY_vec
# Apply same str_split approach as above
coords_split <- str_split(stillMissing$extractedCoords, ",")
for (i in 1:nrow(stillMissing)) {
stillMissing$lat[i] <- as.numeric(coords_split[[i]][1])
stillMissing$long[i] <- as.numeric(coords_split[[i]][2])
}
# Join all coords together in singl df
loc1 <- locations %>%
filter(!LocID %in% loc_noCoords$LocID)
loc2 <- loc_noCoords %>%
filter(!is.na(lat)) %>%
rbind(stillMissing) %>%
select(LocID, lat, long)
loc_coords <- rbind(loc1, loc2)
# Join coords to life listd df
lifelist <- lifelist %>%
left_join(loc_coords, by = "LocID")
Weather data will be incorporated into the ‘Checklist Table’.
rebird
currently has no wrapper for accessing individual
checklist information . This function accesses a checklist based on
SubID field
from life list…
library(httr)
library(jsonlite)
api_token <- "jrel4sdsnpqv"
## -- Custom func to get checklist info using eBird api
getChecklist <- function(subId) {
url <- paste0("https://api.ebird.org/v2/product/checklist/view/", subId)
response <- httr::GET(url, httr::add_headers(`X-eBirdApiToken` = api_token))
if (httr::status_code(response) == 200) {
data <- jsonlite::fromJSON(httr::content(response, "text", encoding = "UTF-8"))
return(data)
} else {
stop("Failed to fetch data: ", httr::status_code(response))
}
}
…this function then takes that checklist information and accesses weather information using Open Meteo API.
## ------- Custom func to get weather information given an eBird checklist as input
getWeather <- function(checklist) {
locid <- checklist$locId
coords.df <- lifelist %>%
filter(LocID == locid)
lat <- as.vector(unlist(coords.df[1, "lat"]))
long <- as.vector(unlist(coords.df[1, "long"]))
obs_datetime <- checklist$obsDt
obs_POSIXct <- as.POSIXct(obs_datetime)
split_datetime <- strsplit(obs_datetime, " ")
obsDate <- unlist(split_datetime)[1]
# Open-Meteo API to get historical weather records
weatherInfo <- openmeteo::weather_history(location = c(lat, long),
start = obsDate,
end = obsDate,
hourly = c("temperature_2m", "windspeed_10m", "cloudcover", "precipitation")
)
weatherInfo$datetime_char <- as.character(weatherInfo$datetime)
# Round the checklist time to nearest hour to be able to match with a datetime from the hourly weather output
rounded_obsTime <- as.character(round(obs_POSIXct, units = "hours"))
weatherInfo <- weatherInfo %>%
filter(datetime_char == rounded_obsTime) %>%
select(-datetime, -datetime_char)
return(weatherInfo)
}
# Create df to hold weather data with unique identifier
weather.df <- data.frame(checklistID = unique(lifelist$SubID),
hourly_temperature_2m = NA,
hourly_windspeed_10m = NA,
hourly_cloudcover = NA,
hourly_precipitation = NA)
# Utilize custom functions in for loop to access weather data for each checklist
for (i in seq_along(unique(lifelist$SubID))) {
subID <- unique(lifelist$SubID)[i]
checklist <- getChecklist(subID)
weather <- tryCatch(getWeather(checklist),
error = function(e) {
cat("Error in iteration:", i, "\n checklist:", subID)
}
)
weather.df$hourly_temperature_2m[i] <- weather$hourly_temperature_2m
weather.df$hourly_windspeed_10m[i] <- weather$hourly_windspeed_10m
weather.df$hourly_cloudcover[i] <- weather$hourly_cloudcover
weather.df$hourly_precipitation[i] <- weather$hourly_precipitation
}
National Geographic MapMaker has an excellent biomes and ecoregions layer available through the ArcGIS portal (2017), which I initially exported from Pro as a shapefile to bring into R. (A vector layer I had downloaded from the journal “Vegetation Classification and Survey - before realizing a similar and superior layer was available through ArcGIS - was inadequate for the task, i.e., inaccurate boundaries at small scale result in many birds not being captured in the spatial join).
For some reason the spatial join I attemped in R failed. In the interest of time I simply reverted to Pro to carry out the join and determine the biome and subbiome within which each observation location falls. This information will be used in the ‘Location Table’ of the database.
# Read in biome vectors and covert to sf object
shp <- st_read("biomes_ecoregions/biomes.shp") # ESRI biomes layer
biomes_sf <- st_as_sf(shp)
# Convert life list to sf object
lifers_sf <- st_as_sf(lifelist, coords = c("long", "lat"), crs = 4326)
# Intersect
# lifelist_biomes <- st_intersection(lifers_sf, biomes_sf)
# Still some points not captured in intersection (70 obs total)
## This process took forever (stopped it after 10 min) -- much more efficient to export lifers as shp and run the join in ArcGIS Pro
# lifers_biomes <- st_is_within_distance(lifers_sf, biomes_sf, dist = 2000)
This code simply connects to the geodatabase and reads in the spatial join layer, which then gets incorporated into the Locations Table.
library(arcgisbinding)
arc.check_product()
## product: ArcGIS Pro (13.2.0.49743)
## license: Advanced
## version: 1.0.1.306
gdb_path <- "C:/Users/andre/Desktop/myeBird/Ebird_workingProj/Ebird_workingProj.gdb"
fc <- arc.open(paste0(gdb_path, "/lifers_SpatialJoin"))
biomes_df <- arc.select(fc)
rebird
can be used to access taxonomic information.
tax <- rebird::ebirdtaxonomy()
#tax <- tax %>%
#select(sciName, comName, comNameCodes, familyComName, familySciName)
myLifers <- lifelist$`Scientific Name`
taxonomy.df <- tax %>%
filter(sciName %in% myLifers)
# 4 letter codes are spread across two fields in eBird taxonomy dataset (banding code and common name code)
# Note that what I am doing here returns something different from the species_code() function already available in `rebird`
## Dataframe to hold 4-letter codes codes
birdCodes <- data.frame(sciName = myLifers,
code = NA)
for (i in seq_along(myLifers)) {
currentSpecies <- myLifers[i]
taxSubset <- taxonomy.df %>%
filter(sciName == currentSpecies) %>%
select(bandingCodes, comNameCodes)
# Select a 4-letter code with preference for banding codes; default to NA if none available
speciesCode <- if (!is.na(taxSubset$bandingCodes) && length(taxSubset$bandingCodes) > 0) {
taxSubset$bandingCodes
} else if (!is.na(taxSubset$comNameCodes) && length(taxSubset$comNameCodes) > 0) {
taxSubset$comNameCodes
} else {
NA
}
birdCodes$code[birdCodes$sciName == currentSpecies] <- speciesCode
}
# Join back to taxonomy DF,
taxonomy.df <- taxonomy.df %>%
left_join(birdCodes, join_by("sciName")) %>%
select(sciName, comName, familyComName, familySciName, code, speciesCode)
Fields to be used in ‘Checklists Table’: outing duration, number of species seen, and date.
# Use function getChecklist(), created above
## Many obs from same checklist; create vector of unique checklist IDs
uniqueChecklists <- lifelist %>%
distinct(SubID) %>%
pull(SubID)
# Data frame to hold checklist information
checklist.df <- data.frame(subID = uniqueChecklists,
checklist_id = NA,
loc_id = NA,
duration = NA,
numSpecies = NA,
dateTime = NA,
comments = NA)
for (i in seq_along(1:nrow(checklist.df))) {
currentChecklist = uniqueChecklists[i]
listInfo <- getChecklist(currentChecklist)
checklist.df$checklist_id[i] <- listInfo$checklistId
checklist.df$loc_id[i] <- listInfo$locId
checklist.df$duration[i] <- if (!is.null(listInfo$durationHrs)) {
listInfo$durationHrs
} else {
NA
}
checklist.df$numSpecies[i] <- listInfo$numSpecies
checklist.df$dateTime[i] <- listInfo$obsDt
checklist.df$comments[i] <- if (!is.null(listInfo$comments)) {
listInfo$comments
} else {
NA
}
}
Split data into four database tables and export as csv to bring integrate into Postgres
CREATE TABLE "species" (
"species_id" integer PRIMARY KEY,
"commonName" varcahr,
"speciesCode" varchar,
"sciName" varchar,
"family" varchar,
"familySci" varchar,
"conservation_status" varchar, # still waiting for response from IUCN to my request for data access
"obs_id" integer
);
I want this table ordered taxonomically, so the first step is to add fields from the taxonomy df, then join missing fields from the lifelist df based on scientific or common name. Then reorder cols to fit desired database schema. ‘species_id’ primary key field will be defined as the eBird alphanumeric species code.
speciesTable <- data.frame(species_id = taxonomy.df$speciesCode,
sciName = taxonomy.df$sciName,
speciesCode = taxonomy.df$code,
family = taxonomy.df$familyComName,
familySci = taxonomy.df$familySciName,
conservationStatus = NA
)
# Subset desired lifelist fields
p <- lifelist %>%
select(`Scientific Name`, `Common Name`, ObsID)
speciesTable <- speciesTable %>%
left_join(p, join_by("sciName" == "Scientific Name")) %>%
relocate(`Common Name`, .before = sciName) %>%
rename(obs_id= ObsID,
commonName = `Common Name`)
CREATE TABLE "observations" (
"obs_id" integer PRIMARY KEY,
"common_name" varcahr,
"date" date,
"loc_id" varchar,
"checklist_id" varchar
);
obsTable <- lifelist %>%
select(ObsID, `Common Name`, `Date`, LocID, SubID) %>%
rename(obs_id = ObsID,
common_name = `Common Name`,
date = Date,
loc_id = LocID,
sub_id = SubID) %>%
mutate(date = lubridate::dmy(date))
CREATE TABLE "locations" (
"loc_id" varchar PRIMARY KEY,
"name" varchar,
"lat" double,
"long" double,
"biome" varchar,
"ecoregion" varchar
);
# Subset biome data
biome <- biomes_df %>%
select(LocID, BIOME_NAME, ECO_NAME)
locTable <- lifelist %>%
select(LocID, Location,lat, long) %>%
left_join(biome, by = "LocID", relationship = ) %>%
rename(loc_id = LocID,
name = Location,
biome = BIOME_NAME,
ecoregion = ECO_NAME) %>%
group_by(loc_id) %>% # Ensure only 1 row per location
slice(1)
CREATE TABLE "checklists" (
"checklist_id" varchar PRIMARY KEY,
"duration" integer,
"num_species" integer,
"date" date,
"temperature" float,
"wind" float,
"precipitation" float,
"cloudcover" float,
"loc_id" varchar
);
checklistTable <- checklist.df %>%
left_join(weather.df, by = c("subID" = "checklistID")) %>%
rename(sub_id = subID,
temperature = hourly_temperature_2m,
wind = hourly_windspeed_10m,
precipitation = hourly_precipitation,
cloudcover = hourly_cloudcover) %>%
mutate(dateTime = lubridate::ymd_hm(dateTime))
Integrating with Postgres requires the RPostgres
package. The database must be initiated first, either via the command
line or a Postgres GUI like pgAdmin.
Using command line:
psql -U postgres
CREATE DATABASE lifeList;
\c lifeList;
Having initated the database, you can connect to it directly via R
using the RPostgres
and DBI
libraries.
library(RPostgres)
library(DBI)
connection <- DBI::dbConnect(RPostgres::Postgres(),
dbname = "lifelist",
host = "localhost",
port = 5432,
user = "postgres",
password = "10161485")
DBI::dbWriteTable(connection, "species", speciesTable, overwrite = TRUE, row.names = FALSE)
DBI::dbWriteTable(connection, "observations", obsTable, overwrite = TRUE, row.names = FALSE)
DBI::dbWriteTable(connection, "locations", locTable, overwrite = TRUE, row.names = FALSE)
DBI::dbWriteTable(connection, "checklists", checklistTable, overwrite = TRUE, row.names = FALSE)
query_test <- dbGetQuery(connection, 'SELECT "sciName" FROM species LIMIT 5')
print(query_test)
Successful execution of code will result in output of “0”, indicating 0 rows were affected by this operation.
# Define primary keys
dbExecute(connection, "ALTER TABLE species ADD PRIMARY KEY (species_id);")
dbExecute(connection, "ALTER TABLE observations ADD PRIMARY KEY (obs_id);")
dbExecute(connection, "ALTER TABLE locations ADD PRIMARY KEY (loc_id);")
dbExecute(connection, "ALTER TABLE checklists ADD PRIMARY KEY (sub_id);")
The syntax for defining foreign keys is a bit less intuitive. Consider the following:
# Define foreign keys
dbExecute(connection, "
ALTER TABLE species
ADD CONSTRAINT fk_obs_id
FOREIGN KEY (obs_id) REFERENCES observations(obs_id);"
)
ALTER TABLE species:
Specifies that the species table is to be altered.
ADD CONSTRAINT fk_obs_id:
Instructs Postgres to add a new constraint to the species table (i.e., a rule that must be followed by the data). The constraint is named fk_obs_id.
FOREIGN KEY (obs_id) REFERENCES observations(obs_id):
Defines a new foreign key for the species table, the column
obs_id
. This field is intended to reference a primary key
in another table; in this case the observations table.
# Define the rest of the foreign keys
dbExecute(connection, "
ALTER TABLE observations
ADD CONSTRAINT fk_loc_id
FOREIGN KEY (loc_id) REFERENCES locations(loc_id);"
)
dbExecute(connection, "
ALTER TABLE observations
ADD CONSTRAINT fk_checklist_id
FOREIGN KEY (sub_id) REFERENCES checklists(sub_id);"
)
dbExecute(connection, "
ALTER TABLE checklists
ADD CONSTRAINT fk_loc_id
FOREIGN KEY (loc_id) REFERENCES locations(loc_id);"
)
PostGIS adds support for geographic objects, allowing location
queries to be run in SQL. This requires geometries to be stored
somewhere in the database. In this case, the most sensible table in
which to incorporate a geom
field is the Locations
Table.
Once PostGIS is installed, the extension must be enabled for each new database that will hold spatial data.
dbExecute(connection, "CREATE EXTENSION IF NOT EXISTS postgis;")
# Check the PostGIS version to confirm installation
postgis_version <- dbGetQuery(connection, "SELECT PostGIS_full_version();")
print(postgis_version)
Now that it is enabled for the lifelist db, the geom field can
be added to the locations table.
dbExecute(connection, "ALTER TABLE locations ADD COLUMN geom geometry(Point, 4326);")
# Update the geometry column with latitude and longitude data
dbExecute(connection, "UPDATE locations SET geom = ST_SetSRID(ST_MakePoint(long, lat), 4326);")
# ST_SetSRID is a PostGIS function, more or less equivalent to st_set_crs() from `sf` package
Because the objective here was to create an application that interacts with my database, I created a free tier account with Amazon Web Services RDS to host it.
To generate a .sql file from the local database (i.e., ‘database
dump’), access the system command line (not the PostgreSQL terminal,
‘psql’) and use the pg_dump
command.
C:\Users\andre>pg_dump -h localhost -U postgres -d lifelist -f lifelist_dump.sql -W
After creating a new database instance on AWS RDS, I imported the database dump file using psql from the command line by first connecting to the database instance:
psql -h [endpoint of instance] -p 5432 -U postgres -d lifelist -W
Then, within the psql session in the command shell, I ran the following command to import the database dump:
\i lifelist_dump.sql
Trying to a run a SQL query to test the success of the import initially returned an error that no such relation existed. This was because the ‘search_path’ config parameter was not set. To fix this, I ran the following command. In this case, ‘public’ refers to the schema in which the tables are stored.
SET search_path TO public;
Eventually I will get around to this last step…!