Scraping für das Bearbeiten von Suchresultaten

ake · 25 March 2019 16:18

Als Anfänger gelingt es mir nicht mein Vorhaben umzusetzen:
Ich möchte die Suchergebnisse auf e-periodica.ch - Suche in der Schweizerischen Lehrerzeitung nach dem Stichwort Weltausstellung im Text - in einer Liste erfassen, zumindest die LInks zu den einzelnen Nummern der Lehrerzeitung, im Idealfall zusammen mit den Textsnippets dazu und den damit verbundenen Seitenangaben.

Mit Python bin ich genau so weit gekommen:

from bs4 import BeautifulSoup
import lxml
import requests
import html5lib
import csv

source = requests.get('https://www.e-periodica.ch/digbib/hitlist?p=f2d3').text #Adresse der Suchergebnisse

soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())

Das ergibt etwas, aber darin finde ich mich nicht zurecht und sehe die Zeitschriftentitel etc. nicht.
Schön wäre es natürlich auch, die >800 Treffer nicht in 100erPakten zu erfassen, sondern in einem Arbeitsgang; die Limitierung der Darstellung der Suchergebnisse liegt dort bei 100.

Tipps wären willkommen.

oleg · 1 May 2019 13:26

First of all, in any scraping project - even with a personal or educational intent - it is worth checking out the legal conditions. https://www.e-periodica.ch/digbib/about3 explains that the content does not belong to them, and that any ‘Systematic storage’ requires written consent of the rights holders, i.e. the publications themselves, not the library. IANAL, but this to me sounds like legal murky waters. Get advised if you go further here.

Secondly, in any scraping project I would look for a path of least resistance, and often that is not reading out the raw HTML. There are no indications of an API or support for machine-readable formats like RSS on the site. The link you have in the code snippet also does not work. But the frontend makes Web service quest calls to some kind of Ajax API, which would be much easier and faster to work with.

For example, the output of https://www.e-periodica.ch/digbib/ajax/jinfo?id=psi-001&id=psf-001 is reasonably nicely structured JSON:

[
  {
    "collectionLink": "https://www.e-periodica.ch/digbib/browse5_8",
    "collectionTitle": "Medicine",
    "editor": "Pro Senectute Suisse",
    "expandedJournalIds": null,
    "infoLink": "https://www.e-periodica.ch/digbib/info_psf",
    "issn": "1664-3976",
    "journalId": "psf-001",
    "link": "https://www.e-periodica.ch/digbib/volumes?UID=psf-001",
    "publisher": "",
    "title": "PS info : nouvelles de Pro Senectute Suisse",
    "volNumRange": "",
    "yearRange": "1999-2011",
    "zdb": "",
    "movingWall": "0",
    "coverImageUri": "resources/psf/2011_000/psf-001_2011_000_0001.jpg",
    "coverImageWidth": 2429,
    "coverImageHeight": 3449,
    "nextJournal": null,
    "previousJournal": null,
    "volumes": [
      {
        "movingWallInfoLink": "https://www.e-periodica.ch/digbib/jinfo?UID=psf-001",
        "available": true,
        "volNum": "-",
        "year": "2011",
        "link": "https://www.e-periodica.ch/digbib/view?pid=psf-001:2011:0",
        "volumeTitle": "PS info : nouvelles de Pro Senectute Suisse",
        "journalTitle": "PS info : nouvelles de Pro Senectute Suisse",
        "journalId": "psf-001",
        "thumbnailId": "psf-001:2011:0::3",
        "coverImageUri": null,
        "coverImageWidth": 0,
        "coverImageHeight": 0,
        "title": "PS info : nouvelles de Pro Senectute Suisse",
        "num": "-"
      },
      ...
    ],
    "id": "psf-001"
  },
...

However, I can’t tell what kind of API this is, whether it is just specifically made for a display widget or available across all resources, without more digging. It would be great if the site owner would share some documentation to avoid guesswork. I think having a discussion with them at least would be very useful, especially as I recall the ETH libraries being very supportive of OpenGLAM initiatives in the past.

To respond to your technical question: I don’t know if it is possible to go beyond 100 search results. I don’t see it being possible with the search engine queries, anyway. And there may be good reasons to avoid reducing load on the server, and so on. In scraping projects it is quite standard practice to keep your requests small to avoid being a nuisance to the site owner. So I would suggest you look at Scrapy, in particular the Scrapy tutorial, which explains how to write an agent that systematically crawls through every page of the results. You can still use BeautifulSoup or whatever library you want to process the HTML, but Scrapy will help take care of the background processing for you.