Data Package as a Service

oleg · 18 November 2022 10:01

This is a mirror of a challenge pitched for the DINAcon HACKnight next week. See also:

Find the latest version and other project ideas here:

Open-data-by-default web applications like Flask-based CKAN or dribdat (that runs this site), Django-based SDPP, search engines like OpenSearch, etc., offer full-text search of their content and other APIs as a standard feature. For quickly sharing single datasets or developing ‘single page applications’ (SPAs) or visualizations, using a large backend application like this may be excessive.

I’m supporting Portal.js and Livemark, which accomplish this very well - but sometimes want something even simpler and more integrated in my data stack of Python and Pandas. There are portal previews, linked data endpoints, and wonderful tools like Datasette to dive into a resource, but this might not be ideal for tinkering with data in code. Providing data services through a statically-generated site like JKAN or Datacentral is another cheap and cheerful option. You may already be working on a data science notebook in Jupyter, R Shiny or Observable, but having issues preparing your data on your own.

While working with Frictionless Data, a global initiative to improve the way quality open data is crowdsourced - I often wished that there was a way to put a quick API around a Data Package. On top of it, a user interface …or a data science notebook …or a natural language interface could be built. The proposed project DaatS is a service in Python and Pandas, which instantly turns a Data Package into an API.

You can see the idea in action, combined with Wikidata, as part of a recent project: Living Herbarium (GLAMhack 2022). Another example with data scraping automation in: Baumkadaster.

{ hacknight challenges }

Create a Data Package. It might be your first or your 99th. It is easy and fun to scrape some data off the web and put some shiny wrapping and “nutritional” guidance around it. Ask @loleg if you need some help here, or see this or this detailed guide.

Use the DaatS template to add a repo with boilerplate code on your GitHub account. Or just download the repository to your local machine. Follow the README to install the packages, and drop in your datapackage.json and CSV dataset. Use your browser or an API testing tool to run some queries, and you should see it paginating and searching your data.

Write a converter to patch your DaatS service into a no-code workflow. This could be a Proxeus node, a Node-RED, an Airtable or Slack workflow step, a GitHub Action, etc. Whatever would potentially scratch your own itch. Make it super easy for users to connect a dataset and invoke search queries or even statistical / data science functions as embedded in their process.

The idea of connecting this project to workflows would be to think of this API-fication of a dataset as a data transformation step, something a user might want to add with a couple of clicks to their data collection in order to benefit from Frictionless Data tools and other components in the open data ecosystem.