My Jupyter notebook for the Wikidata API workshop is now available on GitHub. Looking forward to all the sessions this morning!
Where it says “A Jupyter Widget” below, is replaced by an image when run live inside of Jupyter.
Wikidata API
This notebook is part of a Wikidata information & training event at the University of Zurich on 14.9.2017 (more details here). Our goal is to help people interested in Wikidata, coding and open data in general, to get started with Wikidata as a source and platform for their projects. This workshop focuses on using the capabilities of the MediaWiki API, as presented by Cristina Sarasua (slides), to extract entities and linked concepts from Wikidata.
Prepared by Oleg Lavrovsky (Datalets.ch), facilitating on behalf of the School of Data working group at Opendata.ch. If you have any feedback or questions, please raise them directly to me or on our forum. This notebook can also be viewed and forked on GitHub.
Prerequisites
We will be using the Python programming language for our demonstration, popular with academic, industrial and hobby/beginner users alike, it lends itself well to the task of connecting to Internet data sources and doing some analysis or visualisation. The example code here should run in any modern Python interpreter, available on most computing platforms (Windows, Mac, Linux, etc.). We particularly encourage you to try Jupyter, which makes working with this kind of Python project especially pleasant.
For participants of the workshop, we have set up a shared “Jupyter Hub” server, so you don’t need to prepare anything except Internet connectivity and a reasonably modern web browser on your laptop. For those who wish to run set up their own environment in a short amount of time, check out the official installation guide which suggests the use of Anaconda, or cloud-based offerings from Sandstorm.io, Google Cloud, Microsoft Azure or Domino Data Lab.
Goalposts
*Chroicocephalus ridibundus* by [Ximeg](https://commons.wikimedia.org/wiki/File:Chroicocephalus_ridibundus_in_Z%C3%BCrich_01.JPG), CC BY-SA 3.0
A little bird lives in the big city of Zürich, thirsting for knowledge. He doesn’t know how to read or to google, or even to ask a librarian for advice. Luckily for him, Wikidata is here to help: by creating structured data, all of the collected knowledge of Wikipedia can be reused in infinite ways. Through the magic of APIs, this data can also be accessed in a variety of ways. Let’s dig in and find out how!
Accessing data
Wikidata is available under a free license, exported using standard formats, and can be interlinked to other open data sets on the linked data web.
As mentioned in the Introduction, Wikidata is central to the architecture of projects maintained by Wikimedia Foundation. It is connected to the system behind Wikipedia and its many sibling sites. Read up on Data access to find out about the different approaches available to extract data. The API Sandbox along with the tips here can help us to come up with some interesting queries.
MediaWiki’s API is language-angostic, that is, it can be accessed using any programming language which supports the widespread RESTful API standard. You shouldn’t need to know much about this unless you are a Web developer, and assume that it will probably “just work”. Depending on the programming language, there may be libraries which give you a little more support. In Python, such APIs are often accessed using Requests. This comes standard in most installations and frameworks, or can be easily added to your system.
To get started, we just:
import requests
Let us begin with the “bread and butter” of wikis and try the example from Cristina’s presentation, querying for the contributions of a Wikidata user with the usercontribs query (documentation), which takes a ucuser parameter with the user’s name, and a uclimit for the number of results to return:
USERID = "Loleg"
URL = "https://www.wikidata.org/w/api.php?action=query&list=usercontribs&format=json&ucuser=%s&uclimit=5"
r = requests.get(URL % USERID)
for uc in r.json()['query']['usercontribs']:
print(uc['title'] + " " + uc['comment'])
Wikidata:Sandbox just a Hello, World!
Q4115189 /* wbsetclaim-create:2||1 */ [[Property:P195]]: [[Q9259]]
User:Loleg
Wikidata:Events/Wikidata Zurich /* Wikidata workshop */ added Loleg user link
User:Loleg Added link
Climbing the tree
To be more precise and avoid the ambiguity of free-text search, it helps to use these facilities to define our interests in Wikidata by Items, identified by Q and a number, such as:
Having some starting places in the tree of knowledge, we can run some API queries to get information about these subjects. The call to the Wikidata API to obtain details (wbgetentities) on the first item in our list, Q72, would then look like this:
# identifier of the item we are looking for
ITEM = "Q72"
# from the MediaWiki API documentation, a "query string"
URL = "https://www.wikidata.org/w/api.php?action=wbgetentities&ids=%s&format=json"
r = requests.get(URL % ITEM) # here we make the actual call to the remote server
r.status_code # if it worked, the next line should read 200...
200
You can now look at the structure of the data by running the r.json()
command. But we will skip right to the juicy details:
Zurich = r.json()['entities']['Q72']
Zurich['aliases']['en'][0]['value'] # should equal City of Zurich
'City of Zurich'
Zurich['descriptions']['en']['value']
'capital of the canton of Zürich, Switzerland'
Digging deep into the returned properties, we find the city banner, and use the Image function built into this notebook to render it:
from ipywidgets import Image
flagfile = Zurich['claims']['P41'][0]['mainsnak']['datavalue']['value']
# We need to a little bit of work on that value to make it into a full URL
flagfile = flagfile.replace(' ', '_')
flagurl = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9b/%s/200px-%s.png" % (flagfile, flagfile)
Image(value=requests.get(flagurl).content, width=100, height=100)
A Jupyter Widget
How libraries help
Dictionaries created from JSON are a popular way to work with data for developers, but for some people seeing so many brackets can be a maddening experience. Happily, there are nice people in the software community who create libraries that can help streamline working with APIs.
The advantage is that you are working with (hopefully nicely documented) classes and functions. Libraries can help you tune your use of a web service for performance, and solve a lot of other headaches that other developers have run into.
The main issue to think about it is that you are relying on an intermediary which may not keep up with the latest changes and new features of the API. Point in case.
A mature project called Pywikibot was developed to do all kinds of useful things in Wikipedia, including automating queries. But it also packs in tons of functionality which we probably don’t need, addressing other use cases. So let us try the much younger Wikidata library, which focuses specifically on providing access to the Wikidata API, and is being actively developed on GitHub.
To install it, we first run the pip install wikidata
command in our project folder. Then we can do things like this:
from wikidata.client import Client
wikiclient = Client()
entity = wikiclient.get('Q72', load=True)
entity
<wikidata.entity.Entity Q72 'Zurich'>
If you didn’t get any errors at this point, things get quite a bit easier. Take a look at the wikidata library docs to help make sense of what’s going on in these functions:
entity.description
m'capital of the canton of Zürich, Switzerland'
image_prop = wikiclient.get('P18')
image = entity[image_prop]
image.image_url
'https://upload.wikimedia.org/wikipedia/commons/3/3c/Zuerich_Fraumuenster_St_Peter.jpg'
Image(value=requests.get(image.image_url).content, width=100, height=100)
A Jupyter Widget
We can also combine the use of the requests
and wikidata
based API calls for slightly more complex queries, such as:
URL = "https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=%s&property=%s&format=json"
q = URL % ('Q72', 'P31')
r = requests.get(q)
j = r.json()
r.status_code == 200
True
for c in j['claims']['P31']:
QID = c['mainsnak']['datavalue']['value']['id']
print(wikiclient.get(QID, load=True).description)
large and permanent human settlement
primary city of a political entity (country, state, county, etc)
capital or administrative center of a canton of Switzerland
community dominated by its university population
smallest government division in Switzerland
city with a population of more than 100,000 inhabitants
Down to business
Yay! We were able to reproduce the basic use of the API. Now let us try something potentially a little more interesting for our feathered friend… The Wikidata Query service allows us to execute SPARQL (a semantic query language for databases) queries for answering questions like How many water fountains are there in Zürich? A great way to dive in is to use the query.wikidata.org to formulate and test your code first, since it will help you to juggle the items and properties using a visual tool.
Once you have a query like this:
query = '''PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX psv: <http://www.wikidata.org/prop/statement/value/>
SELECT DISTINCT ?item ?name ?creator ?coordinate_location ?image WHERE {
?item wdt:P131* wd:Q72.
?item (wdt:P31/wdt:P279*) wd:Q43483.
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
?item rdfs:label ?name.
}
OPTIONAL { ?item wdt:P2561 ?name. }
OPTIONAL { ?item wdt:P170 ?creator. }
OPTIONAL { ?item wdt:P625 ?coordinate_location. }
OPTIONAL { ?item wdt:P18 ?image. }
}
ORDER BY ?name
'''
url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'
data = requests.get(url, params={'query': query, 'format': 'json'}).json()
results = data['results']['bindings']
len(results)
284
# A hacked together thumbnail function
def getThumbnailFromImage(imgname):
if '/' in imgname:
imgname = imgname.split('/')[-1]
print(imgname)
imgname = imgname.replace(' ', '_')
# TODO: fixme - this is going to be different for every image ..
thumburl = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/%s/320px-%s"
return thumburl % (imgname, imgname)
# Let's have a look at the first result
imageurl = results[0]['image']['value']
Image(value=requests.get(getThumbnailFromImage(imageurl)).content, width=320, height=200)
Sirius-Annemie-Fontana-Oerlikon-1.jpeg
A Jupyter Widget
Fly, bird, fly!
In this tutorial we took a look at some different approaches to extracting structured data from the Wikidata APIs. Take a moment to try this yourself - and let me know how far you get.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.