PostgreSQL is a relational database management system, similar to MySQL, and PostGIS is an extension that adds support for geographic objects. I have frequently heard about the benefits of using spatial databases rather than a collection of shapefiles and the PostgreSQL/PostGIS combo was particularly attractive since its open source. Despite this, I’ve avoided using these tools because I use R almost exclusively for my spatial work and it seemed like a pain to connect the two. Well, no more, I invested the time to get these tools up and running and talking to eachother on my system.
This is the first in a series of posts on working with PostGIS in R. Here I’ll focus on getting setup.
A note on R
This is primarily a tutorial on using PostGIS through R, however, much of the information should be useful to those using PostGIS through another interface. Throughout this post I’ll use the following R packages:
library(dplyr) library(RPostgreSQL) # installation instructions below library(sp) # defines R's spatial classes library(raster) # makes for nicer printing of spatial objects library(rgdal) # installation instructions below library(rgeos) # installation instructions below library(viridis) # better colour palettes
Installation and Setup
PostgreSQL and PostGIS
KyngChaos has precompiled Mac OS X binaries for all sorts of software, including PostgreSQL and PostGIS. After installation I needed to add the PostgreSQL binary to my path, by adding the following line to
.bash_profile in my home directory:
Creating a database
Connect to PostgreSQL using the command line tool
psql and the default user
psql -U postgres
Next you can add a new database named
CREATE DATABASE gistest;
\l can be used to list existing databases to be sure you’ve created the new database correctly. You can connect to this new database with:
According to the official PostGIS installation instructions, “PostGIS is an optional extension that must be enabled in each database you want to use it in before you can use it”. To enable it the following commands need to be run:
-- Enable PostGIS (includes raster) CREATE EXTENSION postgis; -- Enable Topology CREATE EXTENSION postgis_topology;
Now spatial objects can be created and topological operations performed. Again referring to the official PostGIS installation instructions, you can run the following queries to check that everything is running properly:
-- create table with spatial column CREATE TABLE mytable ( id SERIAL PRIMARY KEY, geom GEOMETRY(Point, 26910), name VARCHAR(128) ); -- add a spatial index CREATE INDEX mytable_gix ON mytable USING GIST (geom); -- add points -- the srid argument specifies the spatial reference system, i.e. projection INSERT INTO mytable (geom, name) VALUES (ST_GeomFromText('POINT(0 0)', 26910), 'Site A'), (ST_GeomFromText('POINT(1500 1500)', 26910), 'Site B'), (ST_GeomFromText('POINT(1500 -1500)', 26910), 'Site C'); -- query for nearby points SELECT id, name FROM mytable WHERE ST_DWithin( geom, ST_GeomFromText('POINT(0 0)', 26910), 1000 );
Next I installed the R package
RPostgreSQL, which allows R to interact with PostgreSQL:
Connecting to a PostgreSQL database
To work with a PostgreSQL database with
RPostgreSQL you’ll need to create a connection object, which will be passed as a parameter to functions querying the database.
drv <- dbDriver("PostgreSQL") con <- dbConnect(drv, user = "postgres", dbname = "gistest", host = "localhost")
The majority of
RPostgreSQL tutorials I’ve found online (including the official one here) show connecting with default parameters, e.g.
dbConnect(drv, dbname='spatial'). This didn’t work for me since it defaults to
host='local' instead of
host='localhost' and uses the username you’re logged in to your system with (in my case
matt) rather than the default PostgreSQL username
Now that we have a valid connection we can check that it’s working by getting a list of available tables.
dbListTables(con) #>  "spatial_ref_sys" "topology" "layer" #>  "mytable" "spatial_table" "constrained_table" #>  "countries" "countries2"
The first three tables (
layer) are system tables generated when installing the PostGIS spatial extensions.
mytable is the table we created when testing PostgreSQL at the command line. Everything appears to be working!
You’ll need to close the connection when you’re done using it.
dbDisconnect(con) #>  TRUE
RPostgreSQL is a general purpose tool to connect to PostgreSQL databases, but it isn’t spatially aware and doesn’t have explcit methods to work with the spatial objects provided by the PostGIS extensions. The GeoDatabase Abstraction Library is a set of open source tools for translating between vector and raster spatial formats, including PostGIS format.
rgdal is an R interface to GDAL.
Getting GDAL and
rgdal up and running on Mac OS X can be a challenge. These are the steps I followed. It’s likely that you also want
rgeos, the R interface to GEOS, which provides lots of great tools for topology operations (e.g. union, intersect, etc.). Installing GEOS and
rgeos is an almost identical process, so it’s worth doing them both at once:
- First you’ll need to download and install the GDAL framework, and KyngChaos again comes to the rescue with precompiled binaries.
- Once the framework is installed, you can try installing
rgdal, being sure to use
type = 'source'.
install.packages("rgdal", type = "source")
- If this miraculously works, you’re done. Likely it’ll fail and you’ll have to download the package source (something like
rgdal_1.1-10.tar.gz) from CRAN.
- Open to the Terminal app and navigate to the directory where you’ve downloaded the source to.
- Install the package by running the following in the Terminal, making sure to replace the
tar.gzfilename with the one you downloaded:
sudo R CMD install rgdal_1.1-10.tar.gz --configure-args='--with-gdal-config=/Library/Frameworks/GDAL.framework/unix/bin/gdal-config --with-proj-include=/Library/Frameworks/PROJ.framework/unix/include --with-proj-lib=/Library/Frameworks/PROJ.framework/unix/lib'
These last few steps are courtesty of this StackOverflow response.
Checking the connection to PostGIS
Pulling vector data from the database into R with
rgdal just requires constructing a DataSource Name (dsn) string specific to your database. Details on constructing the DSN are availble from GDAL.
rgdal::readOGR() is used to convert vector files into R’s standard spatial objects as defined in the
dsn <- "PG:dbname='gistest' host='localhost' port='5432' user='postgres'" # list available spatial tables in database ogrListLayers(dsn) #>  "mytable" "spatial_table" "constrained_table" #>  "countries" "countries2" #> attr(,"driver") #>  "PostgreSQL" #> attr(,"nlayers") #>  5 # read from PostGIS into R pts <- readOGR(dsn = dsn, "mytable") #> OGR data source with driver: PostgreSQL #> Source: "PG:dbname='gistest' host='localhost' port='5432' user='postgres'", layer: "mytable" #> with 3 features #> It has 1 fields pts #> class : SpatialPointsDataFrame #> features : 3 #> extent : 0, 1500, -1500, 1500 (xmin, xmax, ymin, ymax) #> coord. ref. : +proj=utm +zone=10 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs #> variables : 1 #> names : name #> min values : Site A #> max values : Site C
The last example was borrowed from this useful StackOverflow response.
Working with PostGIS
You should now have all the tools necessary to work with PostGIS within R. Next I’ll demonstrate some of the functionality. Start by setting up a connection to the database.
drv <- dbDriver("PostgreSQL") con <- dbConnect(drv, user = "postgres", dbname = "gistest", host = "localhost")
Reading and writing data
dbWriteTable() writes a data frame to a table in the PostgreSQL database.
dbListTables() lists all tables in a given databas and
dbListFields lists all the fields in a given table. Here I write the built in
iris data frame to the Postgres database.
dbWriteTable(con, "iris", iris) #>  TRUE dbListTables(con) #>  "spatial_ref_sys" "topology" "layer" #>  "mytable" "spatial_table" "constrained_table" #>  "countries" "countries2" "iris" dbListFields(con, "iris") #>  "row.names" "Sepal.Length" "Sepal.Width" "Petal.Length" #>  "Petal.Width" "Species"
A entire table can be read from the database into a data frame with
db_iris <- dbReadTable(con, "iris") head(db_iris) %>% kable()
# compare the database table with the original data frame glimpse(db_iris) #> Observations: 150 #> Variables: 5 #> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,... #> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,... #> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,... #> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,... #> $ Species <chr> "setosa", "setosa", "setosa", "setosa", "setosa",... glimpse(iris) #> Observations: 150 #> Variables: 5 #> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,... #> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,... #> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,... #> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,... #> $ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
Note that the factor variable
iris$Species has been converted to character since the database can’t store factor variables.
In addition to extracting entire tables, it’s also possible to send an SQL query to the database and get the results in a data frame.
dbSendQuery() submits an SQL query to the database and returns a result set, but it doesn’t extract any records directly.
dbFetch() returns the next chunk of records from the result set,
dbHasCompleted() indicates whether the end of the result set has been reached, and
dbClearResult() clears the result set when you’re finished with it.
For those unfamiliar with SQL, it super easy to learn the basics and there are a variety of tutorials online.
rs <- dbSendQuery(con, "SELECT * FROM iris WHERE \"Species\" = 'setosa';") iris_chunk <- dbFetch(rs, n = 2) kable(iris_chunk)
# get next 2 records iris_chunk <- dbFetch(rs, n = 2) kable(iris_chunk, row.names = F)
dbGetRowCount(rs) #>  4 dbHasCompleted(rs) #>  FALSE # get all remaining records iris_chunk <- dbFetch(rs, n = -1) nrow(iris_chunk) #>  46 dbHasCompleted(rs) #>  TRUE dbClearResult(rs) #>  TRUE
dbFetch(rs, n = -1) returns all pending rows in the result set.
This example also brings up an oddity with PostgreSQL that may be unfamiliar to those coming from MySQL. As with SQL databases in general, PostgreSQL is case insensitive, so
select col from table is the same as
SELECT COL FROM TABLE. However, unlike MySQL, it seems PostreSQL goes one step further and forces all names in queries to lowercase. The catch is that
RPostgreSQL creates tables with variable names as they appear in the data frame, including uppercase letters. For example, the variable
iris$Species retains the uppercase
S in the database and can’t be directly referred to because
SELECT Species FROM iris is executed as
SELECT species FROM iris.
The way to get around this is outlined in this blog post. The trick: single and double quotes have different uses in PostgreSQL. Single quotes are used indicate to a text string, while double quotes are used to name an identifier without changing its case. This is reason behind the odd syntax
\"Species\" = 'setosa'. Finally, don’t forget to escape quotes when nested inside another set of the same type of quotes, e.g.
"he said \"hi\" to her".
If you want all the results of query all at once,
dbGetQuery() submits a query, fetches the results, and clears the result set all at once.
iris_all <- dbGetQuery(con, 'SELECT * FROM iris WHERE "Sepal.Length" > 7.6;') head(iris_all) #> row.names Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 118 7.7 3.8 6.7 2.2 virginica #> 2 119 7.7 2.6 6.9 2.3 virginica #> 3 123 7.7 2.8 6.7 2.0 virginica #> 4 132 7.9 3.8 6.4 2.0 virginica #> 5 136 7.7 3.0 6.1 2.3 virginica
Finally, tables can be deleted with
dbRemoveTable(con, 'iris') #>  TRUE
I’ll begin by creating some sample polygons to play with, in particular a random array of circular patches of varying sizes with randomly generated attribute data.
set.seed(1) e <- extent(c(0, 100, 0, 100)) e <- as(e, 'SpatialPolygons') proj4string(e) <- '+proj=wag4 +lon_0=0 +units=km' # projected coordinates pts <- spsample(e, 100, type = 'random') patches <- gBuffer(pts, byid = T, width = rnorm(length(pts), mean = 1)^2) patches$continuous <- rnorm(length(patches)) patches$categorical <- sample(letters[1:10], length(patches), replace = T) spplot(patches, 'continuous', col.regions = viridis(256))
rgdal::writeOGR() to load this spatial object into a PostGIS database. The
layer parameter determines the name of the table that this object will be stored in.
dsn <- "PG:dbname='gistest' host='localhost' port='5432' user='postgres'" writeOGR(patches, dsn, layer = 'patches', driver = 'PostgreSQL') as.character(ogrListLayers(dsn)) #>  "mytable" "spatial_table" "constrained_table" #>  "countries" "countries2" "patches"
The spatial object is now stored as a table in the database with each feature occupying a row and each attribute a column. Two additional fields are created:
wkb_geometry stores the geometry of the feature in Well-known Binary (WKB) format and
ogc_fid stores a unique feature ID. The structure of this table can be examined with
dbListTables(con) #>  "spatial_ref_sys" "topology" "layer" #>  "mytable" "spatial_table" "constrained_table" #>  "countries" "countries2" "patches" dbListFields(con, 'patches') #>  "ogc_fid" "wkb_geometry" "continuous" "categorical" rs <- dbSendQuery(con,"SELECT * FROM patches;") #> Warning in postgresqlExecStatement(conn, statement, ...): RS-DBI driver #> warning: (unrecognized PostgreSQL field type geometry (id:50703) in column #> 1) dbColumnInfo(rs) #> Warning in postgresqlDescribeFields(res, ...): RS-DBI driver warning: #> (unknown (50703)) #> name Sclass type len precision scale nullOK #> 1 ogc_fid integer INTEGER 4 -1 -1 FALSE #> 2 wkb_geometry character UNKNOWN -1 230633996 -1 TRUE #> 3 continuous double FLOAT8 8 -1 -1 TRUE #> 4 categorical character VARCHAR -1 -1 -1 TRUE dbClearResult(rs) #>  TRUE
A warning is raised because
RPostgreSQL is not spatially aware, so it doesn’t recognize geometry fields. If we want to get this spatial data into R we’ll need to use
rgdal, which reads the data from the PostGIS database and converts the table into an R object.
dsn <- "PG:dbname='gistest' host='localhost' port='5432' user='postgres'" patches_postgis <- readOGR(dsn = dsn, "patches") #> OGR data source with driver: PostgreSQL #> Source: "PG:dbname='gistest' host='localhost' port='5432' user='postgres'", layer: "patches" #> with 100 features #> It has 2 fields class(patches_postgis) #>  "SpatialPolygonsDataFrame" #> attr(,"package") #>  "sp"
Keys and indexes
In the context of relational databases (such as PostgreSQL and MySQL), a primary key is column in a table that uniquely identifies rows in that table. The primary key is used to access records in the table and to define relationships between tables.
Indexes are a means of optimizing database performance. They are generally created based on columns in the table that are frequently queried and, like an index in a book, provide faster lookup within these columns.
We can look at the primary key and indexes for the
patches table created above by logging into the database via the Terminal:
psql -U postgres \c gistest
Then, executing the command
\d patches will provide information about the table:
gistest=# \d patches Table "public.patches" Column | Type | Modifiers --------------+--------------------------+------------------------------------ ogc_fid | integer | not null default nextval('patches_ogc_fid_seq'::regclass) wkb_geometry | geometry(Polygon,900914) | continuous | double precision | categorical | character varying | Indexes: "patches_pkey" PRIMARY KEY, btree (ogc_fid) "patches_wkb_geometry_geom_idx" gist (wkb_geometry)
The type of
wkb_geometry indicates that it contains polygon geometries, and the number (900914) gives the Spatial Reference ID (SRID), i.e. the projection and data of the polygons. In addition to the column names and data types, we see that the feature ID column (
ogc_fid) has been set as primary key for this table by GDAL. Under
Modifiers, it is also specified that
ogc_fid is not allowed to be
NULL and, when new rows are inserted, its value is automatically set to the next highest integer to ensure uniqueness.
In addition, GDAL has created an index based on the feature geometry (
wkb_geometry). These spatial indexes are one of the strengths of using a spatial database. They index the spatial relationships between the bounding boxes of features, which speeds up the topological operations. Have a look at this great discussion of spatial indexing for further details.
Finally, I delete the tables I created during this tutorial and disconnect from the database.
dbRemoveTable(con, 'iris') #>  FALSE dbRemoveTable(con, 'patches') #>  TRUE dbDisconnect(con) #>  TRUE dbUnloadDriver(drv) #>  TRUE