At the Applied Machine Learning Days at EPFL towards the end of January, I will be running a morning workshop: oriented towards anyone who is interested in making use of and publishing open data in ML projects. The description is below, feedback and suggestions welcome!
The Frictionless Data program at Open Knowledge is the leading community-based effort to update and support open data publication processes worldwide. Building on the experience of developing the technology for thousands of data integration projects and portals like opendata.swiss, we are working on an extensible, cohesively formulated set of standards and a library of multiplatform, multilingual libraries and tools to make working with diverse data sources easier, smoother, and more reliable than ever.
This workshop will start with an introduction to the philosophy, concepts, and roadmap of the initiative, contrasted to parallel efforts in data containerization. We will dive in to explore new data sources that support the Frictionless Data specifications, and help you to start a data exploration and machine learning project. Our focus will be on the principles of data exchange and comparability, so more experienced participants can bring their own tools to check compatibility. We will also show easy ways with which beginners will start exploring and extending open data in Julia or Python, and share learning waypoints.
In the second half of the workshop, we will look at the question of reproducibility in data science, discuss the challenges involved, and learn how to (re)publish both our code and data in forms of efficient distributed workflows - both to improve accessibility for other users, and to help ensure authenticity of the result. Several case studies, including work-in-progress, will be shared in the group, and discussion facilitated about the opportunities of ‚industrial-strength‘ open data.
Learn from a practitioner about the latest trends in the intersection of crowdsourcing and data science. Gain experience in useful tools and methods. Take steps in becoming a more active member of the open data community.
- laptop with an up-to-date web browser
- optionally a Python, Julia, R, Node.js, Clojure or Go development environment