August 14, 2018 From rOpenSci (https://deploy-preview-119--ropensci.netlify.app/technotes/2018/08/14/mongolite-20/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
This week version 2.0 of the mongolite package has been released to CRAN. Major new features in this release include support for MongoDB 4.0, GridFS, running database commands, and connection pooling.
Mongolite is primarily an easy-to-use client to get data in and out of MongoDB. However it supports increasingly many advanced features like aggregation, indexing, map-reduce, streaming, encryption, and enterprise authentication. The mongolite user manual provides a great introduction with details and worked examples.
New in version 2.0 is support for the MongoDB GridFS system. A GridFS is a special type of Mongo collection for storing binary data, such as files. To the user, a GridFS looks like a key-value server with potentially very large values.
We support two interfaces for sending/receiving data from/to GridFS. The fs$read()
and fs$write()
methods are the most flexible and can stream data from/to an R connection, such as a file, socket or url. These methods support a progress counter and can be interrupted if needed. These methods are recommended for reading or writing single files.
# Assume 'mongod' is running on localhost
fs <- gridfs()
buf <- serialize(nycflights13::flights, NULL)
fs$write(buf, 'flights')
#> [flights]: written 45.11 MB (done)
#> List of 6
#> $ id : chr "5b70a37a47a302506117c352"
#> $ name : chr "flights"
#> $ size : num 45112028
#> $ date : POSIXct[1:1], format: "2018-08-12 23:15:38"
#> $ type : chr NA
#> $ metadata: chr NA
# Read serialized data:
out <- fs$read('flights')
flights <- unserialize(out$data)
# [flights]: read 45.11 MB (done)
#> A tibble: 6 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
#> 1 2013 1 1 517 515 2 830 819 11 UA 1545
#> 2 2013 1 1 533 529 4 850 830 20 UA 1714
#> 3 2013 1 1 542 540 2 923 850 33 AA 1141
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
#> 5 2013 1 1 554 600 -6 812 837 -25 DL 461
#> 6 2013 1 1 554 558 -4 740 728 12 UA 1696
#> # ... with 8 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
The fs$upload()
and fs$download()
methods on the other hand copy directly between GridFS and your local disk. This API is vectorized so it can transfer many files at once. Hover over to the gridfs chapter in the manual for more examples.
MongoDB exposes an enormous number of database commands, and mongolite cannot provide wrappers for each command. As a compromise, the new version of mongolite exposes an api to run raw commands, so you can manually run the commands for which we do not expose wrappers.
The result data from the commannd automatically gets simplified into nice R data structures using the jsonlite simplification system (but you can set simplify = FALSE
if you prefer json structures).
m <- mongo()
col$run('{"ping": 1}')
#> $ok
#> [1] 1
For example we can run the listDatabases
command to list all db’s on the server:
admin <- mongo(db = "admin")
admin$run('{"listDatabases":1}')
#> $databases
#> name sizeOnDisk empty
#> 1 admin 32768 FALSE
#> 2 config 73728 FALSE
#> 3 local 73728 FALSE
#> 4 test 72740864 FALSE
#>
#> $totalSize
#> [1] 72921088
#>
#> $ok
#> [1] 1
The server tools chapter in the manual has some more examples.
Finally another much requested feature was connection pooling. Previously, mongolite would open a new database connection for each collection object in R. However as of version 2.0, mongolite will use existing connections to the same database when possible.
# This will use a single database connection
db.flights <- mongolite::mongo("flights")
db.iris <- mongolite::mongo("iris")
db.files <- mongolite::gridfs()
A database connection is automatically closed when all objects that were using it have either been removed, or explicitly disconnected with the disconnect()
method. For example using the example above:
# Connection still alive
rm(db.flights)
db.files$disconnect()
# Now it will disconnect
db.iris$disconnect()
Mongolite collection and GridFS objects automatically reconnect if when needed if they are disconnected (either explicitly or automatically e.g. when restarting your R session). For example:
> db.files$find()
Connection lost. Trying to reconnect with mongo...
#> id name size date type metadata
#> 1 5b70a37a47a302506117c352 flights 45112028 2018-08-12 23:15:38 <NA> <NA>