[14.9] Wikidata Zurich Workshop

csarasua · 21 August 2017 17:03

Hello everyone,

We are organizing a workshop at the University of Zurich on Thu 14.09 to provide information and training about Wikidata, one of Wikimedia’s successful projects. We will have invited talks and mini hands-on sessions. We expect the participants to be an interesting mixture of academics, people from the Swiss industry and open data enthusiasts in general.

If you feel curious and would like to learn how to edit, query and code for Wikidata, don’t hesitate to join us!

All the information of the event is here:
https://www.wikidata.org/w/index.php?title=Wikidata:Events/Wikidata_Zurich

And you can directly register for free here:
https://docs.google.com/forms/d/e/1FAIpQLSd2srUW6LjdnkpmDUKz-SkE20kGxGImX8Zn-rbD9Ob5QVnv2Q/viewform

The workshop is on Thu 14.09 because the days after the workshop (15.09-17.09) we will also give a TechTalk at HackZurich, and we would like to team up with people who would like to code something for the city of Zurich using Wikidata. If you will be around at HackZurich and you feel like coding something great using the free knowledge base that has over 33 million items, please let us know by signing up in the discussion page of the Wikidata page for the event mentioned above.

Looking forward to seeing you in September.

Cristina Sarasua

This event is organized by the University of Zurich, Open Data Zurich, Wikimedia CH

oleg · 14 September 2017 07:26

My Jupyter notebook for the Wikidata API workshop is now available on GitHub. Looking forward to all the sessions this morning!

Where it says “A Jupyter Widget” below, is replaced by an image when run live inside of Jupyter.

Wikidata API

This notebook is part of a Wikidata information & training event at the University of Zurich on 14.9.2017 (more details here). Our goal is to help people interested in Wikidata, coding and open data in general, to get started with Wikidata as a source and platform for their projects. This workshop focuses on using the capabilities of the MediaWiki API, as presented by Cristina Sarasua (slides), to extract entities and linked concepts from Wikidata.

Prepared by Oleg Lavrovsky (Datalets.ch), facilitating on behalf of the School of Data working group at Opendata.ch. If you have any feedback or questions, please raise them directly to me or on our forum. This notebook can also be viewed and forked on GitHub.

Prerequisites

We will be using the Python programming language for our demonstration, popular with academic, industrial and hobby/beginner users alike, it lends itself well to the task of connecting to Internet data sources and doing some analysis or visualisation. The example code here should run in any modern Python interpreter, available on most computing platforms (Windows, Mac, Linux, etc.). We particularly encourage you to try Jupyter, which makes working with this kind of Python project especially pleasant.

For participants of the workshop, we have set up a shared “Jupyter Hub” server, so you don’t need to prepare anything except Internet connectivity and a reasonably modern web browser on your laptop. For those who wish to run set up their own environment in a short amount of time, check out the official installation guide which suggests the use of Anaconda, or cloud-based offerings from Sandstorm.io, Google Cloud, Microsoft Azure or Domino Data Lab.

Goalposts

*Chroicocephalus ridibundus* by [Ximeg](https://commons.wikimedia.org/wiki/File:Chroicocephalus_ridibundus_in_Z%C3%BCrich_01.JPG), CC BY-SA 3.0

A little bird lives in the big city of Zürich, thirsting for knowledge. He doesn’t know how to read or to google, or even to ask a librarian for advice. Luckily for him, Wikidata is here to help: by creating structured data, all of the collected knowledge of Wikipedia can be reused in infinite ways. Through the magic of APIs, this data can also be accessed in a variety of ways. Let’s dig in and find out how!

Accessing data

Wikidata is available under a free license, exported using standard formats, and can be interlinked to other open data sets on the linked data web.

As mentioned in the Introduction, Wikidata is central to the architecture of projects maintained by Wikimedia Foundation. It is connected to the system behind Wikipedia and its many sibling sites. Read up on Data access to find out about the different approaches available to extract data. The API Sandbox along with the tips here can help us to come up with some interesting queries.

MediaWiki’s API is language-angostic, that is, it can be accessed using any programming language which supports the widespread RESTful API standard. You shouldn’t need to know much about this unless you are a Web developer, and assume that it will probably “just work”. Depending on the programming language, there may be libraries which give you a little more support. In Python, such APIs are often accessed using Requests. This comes standard in most installations and frameworks, or can be easily added to your system.

To get started, we just:

import requests

Let us begin with the “bread and butter” of wikis and try the example from Cristina’s presentation, querying for the contributions of a Wikidata user with the usercontribs query (documentation), which takes a ucuser parameter with the user’s name, and a uclimit for the number of results to return:

USERID = "Loleg"
URL = "https://www.wikidata.org/w/api.php?action=query&list=usercontribs&format=json&ucuser=%s&uclimit=5"

r = requests.get(URL % USERID)
for uc in r.json()['query']['usercontribs']:
    print(uc['title'] + " " + uc['comment'])

Wikidata:Sandbox just a Hello, World!
Q4115189 /* wbsetclaim-create:2||1 */ [[Property:P195]]: [[Q9259]]
User:Loleg 
Wikidata:Events/Wikidata Zurich /*   Wikidata workshop */ added Loleg user link
User:Loleg Added link

Climbing the tree

To be more precise and avoid the ambiguity of free-text search, it helps to use these facilities to define our interests in Wikidata by Items, identified by Q and a number, such as:

Having some starting places in the tree of knowledge, we can run some API queries to get information about these subjects. The call to the Wikidata API to obtain details (wbgetentities) on the first item in our list, Q72, would then look like this:

# identifier of the item we are looking for
ITEM = "Q72"

# from the MediaWiki API documentation, a "query string"
URL = "https://www.wikidata.org/w/api.php?action=wbgetentities&ids=%s&format=json"

r = requests.get(URL % ITEM) # here we make the actual call to the remote server
r.status_code # if it worked, the next line should read 200...

You can now look at the structure of the data by running the r.json() command. But we will skip right to the juicy details:

Zurich = r.json()['entities']['Q72']
Zurich['aliases']['en'][0]['value'] # should equal City of Zurich

'City of Zurich'

Zurich['descriptions']['en']['value']

'capital of the canton of Zürich, Switzerland'

Digging deep into the returned properties, we find the city banner, and use the Image function built into this notebook to render it:

from ipywidgets import Image

flagfile = Zurich['claims']['P41'][0]['mainsnak']['datavalue']['value']

# We need to a little bit of work on that value to make it into a full URL
flagfile = flagfile.replace(' ', '_')
flagurl = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9b/%s/200px-%s.png" % (flagfile, flagfile)

Image(value=requests.get(flagurl).content, width=100, height=100)

A Jupyter Widget

How libraries help

Dictionaries created from JSON are a popular way to work with data for developers, but for some people seeing so many brackets can be a maddening experience. Happily, there are nice people in the software community who create libraries that can help streamline working with APIs.

The advantage is that you are working with (hopefully nicely documented) classes and functions. Libraries can help you tune your use of a web service for performance, and solve a lot of other headaches that other developers have run into.

The main issue to think about it is that you are relying on an intermediary which may not keep up with the latest changes and new features of the API. Point in case.

A mature project called Pywikibot was developed to do all kinds of useful things in Wikipedia, including automating queries. But it also packs in tons of functionality which we probably don’t need, addressing other use cases. So let us try the much younger Wikidata library, which focuses specifically on providing access to the Wikidata API, and is being actively developed on GitHub.

To install it, we first run the pip install wikidata command in our project folder. Then we can do things like this:

from wikidata.client import Client
wikiclient = Client()
entity = wikiclient.get('Q72', load=True)
entity

<wikidata.entity.Entity Q72 'Zurich'>

If you didn’t get any errors at this point, things get quite a bit easier. Take a look at the wikidata library docs to help make sense of what’s going on in these functions:

entity.description

m'capital of the canton of Zürich, Switzerland'

image_prop = wikiclient.get('P18')
image = entity[image_prop]
image.image_url

'https://upload.wikimedia.org/wikipedia/commons/3/3c/Zuerich_Fraumuenster_St_Peter.jpg'

Image(value=requests.get(image.image_url).content, width=100, height=100)

A Jupyter Widget

We can also combine the use of the requests and wikidata based API calls for slightly more complex queries, such as:

URL = "https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=%s&property=%s&format=json"
q = URL % ('Q72', 'P31')
r = requests.get(q)
j = r.json()
r.status_code == 200

True

for c in j['claims']['P31']:
    QID = c['mainsnak']['datavalue']['value']['id']
    print(wikiclient.get(QID, load=True).description)

large and permanent human settlement
primary city of a political entity (country, state, county, etc)
capital or administrative center of a canton of Switzerland
community dominated by its university population
smallest government division in Switzerland
city with a population of more than 100,000 inhabitants

Down to business

Yay! We were able to reproduce the basic use of the API. Now let us try something potentially a little more interesting for our feathered friend… The Wikidata Query service allows us to execute SPARQL (a semantic query language for databases) queries for answering questions like How many water fountains are there in Zürich? A great way to dive in is to use the query.wikidata.org to formulate and test your code first, since it will help you to juggle the items and properties using a visual tool.

Once you have a query like this:

query = '''PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX psv: <http://www.wikidata.org/prop/statement/value/>

SELECT DISTINCT ?item ?name ?creator ?coordinate_location ?image WHERE {
  ?item wdt:P131* wd:Q72.
  ?item (wdt:P31/wdt:P279*) wd:Q43483.
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en".
    ?item rdfs:label ?name.
  }
  OPTIONAL { ?item wdt:P2561 ?name. }
  OPTIONAL { ?item wdt:P170 ?creator. }
  OPTIONAL { ?item wdt:P625 ?coordinate_location. }
  OPTIONAL { ?item wdt:P18 ?image. }
}
ORDER BY ?name
'''
url = 'https://query.wikidata.org/bigdata/namespace/wdq/sparql'
data = requests.get(url, params={'query': query, 'format': 'json'}).json()
results = data['results']['bindings']
len(results)

# A hacked together thumbnail function
def getThumbnailFromImage(imgname):
    if '/' in imgname:
        imgname = imgname.split('/')[-1]
    print(imgname)
    imgname = imgname.replace(' ', '_')
    # TODO: fixme - this is going to be different for every image ..
    thumburl = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/%s/320px-%s"
    return thumburl % (imgname, imgname)

# Let's have a look at the first result
imageurl = results[0]['image']['value']
Image(value=requests.get(getThumbnailFromImage(imageurl)).content, width=320, height=200)

Sirius-Annemie-Fontana-Oerlikon-1.jpeg



A Jupyter Widget

Fly, bird, fly!

In this tutorial we took a look at some different approaches to extracting structured data from the Wikidata APIs. Take a moment to try this yourself - and let me know how far you get.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

lucaswerkmeister · 15 September 2017 12:14

In reply to the getThumbnailFromImage function:

I found two ways to get the file URL from commons.

Use Special:Redirect/file (or its alias, Special:FilePath) – do a HEAD request to https://commons.wikimedia.org/wiki/Special:Redirect/file/Sirius-Annemie-Fontana-Oerlikon-1.jpeg (replacing the last part with your file name) and look at the Location header. (If you use Special:FilePath, note that it first redirects you to Special:Redirect/file, which then does the proper redirect.)
Use the API, with action=query, titles=File:Sirius-Annemie-Fontana-Oerlikon-1.jpeg (with File: namespace!), prop=imageinfo, and iiprop=url, e. g. https://commons.wikimedia.org/w/api.php?format=json&action=query&titles=File%3ASirius-Annemie-Fontana-Oerlikon-1.jpeg&prop=imageinfo&iiprop=url. The URL is in the JSON response, jq selector .query.pages | .[] | .imageinfo[0].url. This is more complicated, but you can get several URLs in one request by specifying multiple titles (separated by a pipe character, |).

Once you have the full file URL, you can get the thumb URL by inserting thumb/ between commons/ and the hexadecimal part (f/ff in this case), and then appending a slash, a width in px, a dash, and then the file title again. It looks like you can use any width, and the image will be computed on-the-fly if that thumbnail wasn’t requested before.

oleg · 15 September 2017 18:13

Lucas, thanks very much for your detailed suggestions to my dilemma! I’ll give both of them a try, and recommend an implementation in the library I used.

By the way, your Wikidata demo knocked my socks off today, and I would be really glad if you could share your slides or any of the queries shown.