The intent of my lecture in the Certificate of Advanced Studies for Data Analysis at the Berne University of Applied Science is to present a practitioner perspective as well as some introductory background on open data, the open data movement, and several real-world projects - with details of the data involved, legal conditions and technical challenges. This post is a refresh of part II of last year’s course notes, the introductory lecture being changed to a lesser degree.
If you’re interested in taking the course, you can sign up for a future semester at CAS Datenanalyse | BFH
In the first week, we start the module by considering how Attention to certain types of questions leads to virtuous cycles of data, information and knowledge, and how the opening of data activates this cycle. We covered Definitions of open data, as well as the various types of licenses, guidelines and publication standards involved. Then our focus was on Switzerland, the origins of the open data movement here and what opportunities and challenges exist here in regards to public and government data.
After discussing the role of the Community in validating use cases, we learned in part II how to use open data ourselves in a Hands-on way, looking at what happens behind the scenes in open data portals and trying out some open source tools on datasets researched together. But first, we began the class with one minute of silence for Alain Nadeau, a Swiss pioneer of open data whose obituary was posted just a few hours earlier.
The screenshot above is of an example script shown in class, suggested as a homework assignment last week: to use the ckanr library to run a search for and directly download open data from the opendata.swiss portal’s CKAN API:
install.packages("ckanr")
library('ckanr')
# Initialise the CKAN library with a remote portal
ckanr_setup(url = "https://opendata.swiss")
# Run a search to get some data packages
x <- package_search(q = 'name:arbeitslosenquote', rows = 1)
# Note that on the Swiss server the titles are multilingual
x$results[[1]]$title$de
# Get the URL of the first resource in the first package
tsv_url <- x$results[[1]]$resources[[1]]$download_url
# Download the remote (Tab Separated Values) data file
# ..and parse it in one step
raw_data <- read.csv(tsv_url, header=T, sep="\t")
# Draw a simple plot of the first and second column
plot(raw_data[,2], raw_data[,1], type="b")
As “bonus”, students were also encouraged to try the “next generation” library datapackage-r described in the Frictionless Data Field Guide, with one of the data packages on datahub.io or openfood.schoolofdata.ch. Students reportedly had difficulty getting the (still new and not thoroughly tested) library to work. I explained that the example code used in datahub.io datasets uses the jsonlite library directly to work with the Data Package specification, and will switch to the official Data Package library when it is mature enough. Here is a code snippet from datahub.io:
library("jsonlite")
json_file <- "http://datahub.io/core/cofog/datapackage.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
path_to_file = json_data$resources$path[1][1]
data <- read.csv(url(path_to_file))
print(data)
We then went through the CKAN code in some more detail, and made some additional searches which led us to the issue of indirect linking. A dataset that interested the students (“Schutzwald”) was labeled as “CSV format”, however the Download link takes one to a separate geo-portal, on which it is possible (though not very intuitive) to get access to CSV formatted data. This can be quite misleading for developers, as we saw, who would expect direct links to the data.
To round off our technical discussion, prompted by student interest, I also ran a demonstration of the Wikidata Query API (covered in some detail here), working through some SPARQL requests, and explaining the interface - making note of it’s facility to generate code snippets such as this one to make use of Linked Open Data in R through the SPARQL library:
library(SPARQL) # SPARQL querying package
library(ggplot2)
endpoint <- "https://query.wikidata.org/sparql"
query <- '#Cats, with pictures\n#added before 2016-10\n#defaultView:ImageGrid\nSELECT ?item ?itemLabel ?pic WHERE {\n ?item wdt:P822 wd:Q726.\n ?item wdt:P18 ?pic.\n SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }\n}'
qd <- SPARQL(endpoint,query)
df <- qd$results
The difference between the CKAN-based opendata.swiss portal and other CKAN portals like the old DataHub.io, as well as the competitors OpenDataSoft and Socrata, and the incumbent Frictionless Data platform, were outlined to the participants, and in addition to covering some of the main features of open data portals, we discussed how to use R effectively to work with datasets from different sources.
We then had a discussion on what makes an exemplary open data project. During the introduction, I mentioned the School of Data workshop that took place the previous day, and explained the way that our regular hackathon events worked to crowdsource ideas and prototypes from the community. We looked in particularly at last year’s Open Data Day hackathon organised by the Zurich R User Group. There, teams like Predict Delays used open data sources to create analytical models and publish them in the form of easy-to-use Shiny Apps as well as sharing the open source code on GitHub.
These hallmarks of open data development led us to launch into a mini-hackathon during the second half of the class, inspired by the make.opendata.ch events. We divided into teams of 3-4 people and took up roles (Expert - Designer - Developer), brainstormed and researched open data sources, and built rapid prototypes with a ticking countdown clock. About 45 minutes were spent on the whole exercise, which was followed by presentations and discussions of everything that was found - and, more interestingly, a hard look at the barriers which prevented teams from getting closer to the challenge they picked or using the data they wanted.
Notes on each of the 4 topics we addressed and the ideas, references and datasets researched and presented by the students are shared in an internal Etherpad. I was impressed by the student’s tenacity during this exercise, willingness to work and learn from each other, and the observations we made from the group presentations made for excellent closing material to this semester’s Open Data class.
Many thanks once again to all of the students who enthusiastically took part in the module, to the BFH staff and my fellow teachers of the CAS.