diff --git a/notebooks/10 Data_acquisition.ipynb b/notebooks/10 Data_acquisition.ipynb index d99a7f9..196ad34 100644 --- a/notebooks/10 Data_acquisition.ipynb +++ b/notebooks/10 Data_acquisition.ipynb @@ -6,7 +6,11 @@ "source": [ "# 10. Data acquisition\n", "\n", - "Data science projects typically start with the acquisition of data. In many cases, such data sets consist of secondary data made available on the web by commercial or non-commercial organisations. This part of the tutorial explains how you can obtain such online data sets.\n", + "Data science projects typically start with the acquisition of data. In many cases, such data sets consist of secondary data made available on the web by commercial or non-commercial organisations. This part of the tutorial explains how you can obtain such online data sets using code.\n", + "\n", + "Many data sets can be downloaded manually through your browser, for example, from data portals or repositories. [Re3data](https://www.re3data.org/) is a large overview of repositories for research data.\n", + "However, there are good reasons for downloading data sets using a script. Some data sets may consist of many files, or you may want to download files that are updated frequently.\n", + "These are just some examples of when manually downloading is not ideal for your research.\n", "\n", "In this tutorial, we distinguish three methods of data acquisition: downloading data files, accessing data through APIs and webscraping. You usually choose one of these methods to acquire your data, based on what the data provider offers.\n", "\n", @@ -28,7 +32,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The `requests` library can be used to make requests according to the Hypertext Transfer Protocol (HTTP), which was developed to enable the exchange of information across computers. The computer that can provide information is typically referred to as a server, and the computer that requests information from this server is referred to as a client. In the HTTP protocol, the GET method is used to request data from a specified server. \n", + "The `requests` library can be used to make requests according to the [Hypertext Transfer Protocol (HTTP)](https://en.wikipedia.org/wiki/HTTP), which was developed to enable the exchange of information between computers. The computer that can provide information is typically referred to as a server, and the computer that requests information from this server is referred to as a client. In the HTTP protocol, the GET method is used to request data from a specified server. \n", "\n", "In Python, such a GET request can be sent to a server using the `get()` method in `requests`, as demonstrated below. Evidently, it is important that you are online when you run this code." ] @@ -48,11 +52,11 @@ "source": [ "This method returns a so-called `Response` object. It is an object which represents information about the downloaded web resource. In the example above, the result of the method is assigned to a variable named `response`.\n", "\n", - "Once this `Response` object has been created successfully, you can use various pieces of information about the resource that was downloaded.\n", + "Once this `Response` object has been created successfully, you can use various pieces of information about the resource that was requested.\n", "The property `status_code`, for instance, indicates the HTTP status code that was returned by the server.\n", "The status code 200 indicates that the request was successful and the infamous status code 404 indicates that the file was not found.\n", "\n", - "If the status code is indeed 200, the contents of the resource is accessible in the response's `body` property. However, this property holds the contents as bytes. Typically, when we downloaded a webpage, we want to work with the data as text. In these cases, the `text` property of the `Response` object contains the full contents of the downloaded website, dataset or other kind of file as a string.\n", + "If the status code is indeed 200, the contents of the resource is accessible in the response's `content` property. However, this property holds the contents as bytes. Typically, when we downloaded a webpage, we want to work with the data as text. In these cases, the `text` property of the `Response` object contains the full contents of the downloaded website, dataset or other kind of file as a string.\n", "\n", "Note that `requests` may not always understand a file's [character encoding](https://www.w3.org/International/questions/qa-what-is-encoding) automatically. You can set the correct character encoding explicitly using the `encoding` property.\n", "\n", @@ -108,6 +112,55 @@ "The `requests` library can also be used to retrieve data from an API." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 10.1.\n", + "\n", + "The list below contains a number of URLs. They are the web addresses of texts created for the [Project Gutenberg](https://www.gutenberg.org) website.\n", + "\n", + "```\n", + "urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,\n", + "'https://www.gutenberg.org/files/1400/1400-0.txt' ,\n", + "'https://www.gutenberg.org/files/786/786-0.txt' ,\n", + "'https://www.gutenberg.org/files/766/766-0.txt' \n", + "]\n", + "```\n", + "\n", + "Write a program in Python that downloads all the files in this list and stores them in the current directory.\n", + "As filenames, use the same names that are used by Project Gutenberg (e.g. '580-0.txt' or '1400-0.txt').\n", + "The basename in a URL can be extracted using the [`os.path.basename()`](https://docs.python.org/3/library/os.path.html#os.path.basename) function.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import os.path\n", + "\n", + "# Recreate the given list using copy and paste\n", + "urls = [ \n", + "]\n", + "\n", + "# We use a for-loop to take the same steps for each item in the list:\n", + "for url in urls:\n", + " # 1. Download the file contents\n", + " \n", + " # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding\n", + " \n", + " # 2. Use basename to get a suitable filename\n", + " \n", + " # 3. Open the file in write mode and write the downloaded file contents to the file\n", + " \n", + " # 4. Close the file\n", + " \n", + " " + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -120,7 +173,7 @@ "\n", "The communication between the sender and the recipient of such requests needs to take place according to a specific protocol. The requests need to be formulated according to certain rules. \n", "\n", - "For many APIs, you need to create an access key before you can send requests. This is the case, for instance, for the Twitter API. " + "For many APIs, you need to create an access key (which may or may not require payment) before you can send requests. This is the case, for instance, for the Twitter API. " ] }, { @@ -129,11 +182,11 @@ "source": [ "### Example: MusicBrainz\n", "\n", - "There are also many APIs which are fully open, however. One example is the [MusicBrainz](https://musicbrainz.org/doc/MusicBrainz_API) API. *MusicBrainz* is a large online encyclopedia containing information about musicians and their work. You can send requests to this API without having to provide an access key. \n", + "There are also many APIs that are open, i.e. that do not require registration. For example, the [MusicBrainz API](https://musicbrainz.org/doc/MusicBrainz_API) is free for non-commercial use. *MusicBrainz* is a large online encyclopedia containing information about musicians and their work. You can send requests to this API without having to provide an access key. \n", "\n", - "The root URL of this API is [https://musicbrainz.org/ws/2/](https://musicbrainz.org/ws/2/)\n", + "The root URL of this API is \n", "\n", - "On *MusicBrainz*, you can request information a number of different entities, including artists, genres, instruments, labels and releases. The enity type you are interested in firstly needs to be appended the root URL. If you want to want to search for information about an artist, for example, you need to work with the following URL structure: https://musicbrainz.org/ws/2/artist\n", + "On MusicBrainz, you can request information about different entities, including artists, genres, instruments, labels and releases. The entity type you are interested in firstly needs to be appended to the root URL. If you want to search for information about an artist, for example, you need to work with the following URL structure: `https://musicbrainz.org/ws/2/artist[?parameters]`\n", "\n", "You can then work with the following parameters:\n", "\n", @@ -143,13 +196,13 @@ "limit = [integer]\n", "```\n", "\n", - "Following the `query` parameter, you can supply the name of the artist you want to search for. Using the `fmt` parameter, you can specify whether you want to receive the result in [XML](https://www.w3.org/XML/) or in [JSON](https://www.json.org/) format. The API returns XML data by default. If the API results many results, you can reduce the number of results by working with the `limit` parameter. \n", + "Following the `query` parameter, you can supply the name of the artist you want to search for. Using the `fmt` parameter, you can specify whether you want to receive the result in [XML](https://www.w3.org/XML/) or in [JSON](https://www.json.org/) format. The API returns XML data by default. If the API returns many results, you can reduce the number of results by working with the `limit` parameter. \n", "\n", "The following API call returns information about *The Beatles* in the JSON format. \n", "\n", "https://musicbrainz.org/ws/2/artist?query=The%20Beatles&fmt=json\n", "\n", - "In Python, you can also send out such API calls using the requests library. " + "In Python, because this API is a Web API, you can also send out such API calls using the requests library. " ] }, { @@ -195,8 +248,8 @@ "musicbrainz_results = response.json()\n", "\n", "for artist in musicbrainz_results['artists']:\n", - " name = artist.get('name','[unknown]')\n", - " artist_type = artist.get('type','[unknown]')\n", + " name = artist.get('name', '[unknown]')\n", + " artist_type = artist.get('type', '[unknown]')\n", " print(f'{name} ({artist_type})')\n" ] }, @@ -204,57 +257,33 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Webscraping\n", + "### Exercise 10.4.\n", "\n", - "When a website does not offer access to its structured data via a well-defined API, it may be an option to acquire the data that can be viewed on a site by making use of web scraping. It is a process in which a computer program tries to process the contents of given webpage, and to extract the data values that are needed. The aim of such an application is generally to copy information on a web page and to paste it into a local database.\n", + "Find the coordinates for each address in the given list using [OpenStreetMap](https://www.openstreetmap.org/)'s Nominatim API.\n", "\n", - "To get the most out of webscraping, you need to have a basic understanding of HTML. Many [basic introductions](https://bookandbyte.universiteitleiden.nl/DMT/PDF/HTML.pdf) can be found on the web. Web scraping should be used with caution, because it may be not be allowed to download large quantities of data from a specific website. In this tutorial we will only look at extracting information from single pages.\n", + "The Nominatim API can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is .\n", "\n", - "To scrape webpages, you firstly need to download them. This can be done using the `requests` library that was explained above. The code below scrapes data from a page on the [Internet Movie Database](https://www.gutenberg.org) website, listing the top rated movies." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", + "Following the `q` parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the `format` parameter, you can use `xml` for XML-formatted data or `json` for JSON-formatted data. Use this API to find the longitude and the latitude of the addresses in the following list:\n", "\n", - "url = 'https://www.imdb.com/chart/top?ref_=ft_250'\n", - "response = requests.get( url )\n", + "```\n", + "addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,\n", + "'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", + "```\n", "\n", - "if response:\n", - " response.encoding = 'utf-8'\n", - " html_page = response.text " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once you have obtained the contents of a webpage, in the form of an HTML document, you can begin to extract the data values that you are interested in. This tutorial explains how you can extract the title of these movies and the URLs of the pages on IMDB using web scraping. \n", + "The JSON data received via the OpenStreetMap API can be converted to regular Python lists and dictionaries using the `json()` method: \n", "\n", - "If you inspect the output of the previous cell (the HTML code), you can see that the information about the movies is encoded as follows:\n", + "```json_data = response.json()```\n", "\n", + "If the result is saved as variable named `json_data`, you should be able to access the latitude and the longitude as follows:\n", "\n", "```\n", - "\n", - "\n", - "\n", - "The Godfather\n", - "\n", - "\n", - "\n", - "\n", + "latitude = json_data[0]['lat']\n", + "longitude = json_data[0]['lon']\n", "```\n", "\n", - "The data can found in a <td> element whose 'class' attribute has value 'titleColumn'. The actual title in given in a hyperlink, encoded using <a>. The URL to the page for the movie is given in an 'href' attribute. 'Scraping' the page really means that we need to extract the values we need from these HTML elements. \n", - "\n", - "\n", - "One of the libraries that you can use in Python for scraping online resources is `Beautiful Soup`. The code below firstly transforms the HTML code that was downloaded into a BeautifulSoup object. If the `bs4` library has been imported, you can use its `BeautifulSoup()` method. This method demands the full contents of an HTML document as a first parameter. As a second parameter, you need to provide the name one of the parsers that are available. Generally, a parser is an application which can process and analyse data. In this context, it refers to a program which can analyse the HTML file. One of the parsers that we can use is `lxml`. Using this parser, the `BeautifulSoup()` method converts the downloaded HTML page into a BeautifulSoup object. \n", + "The `[0]` is used to get the results for the first result.\n", "\n", - "The `prettify()` method of this object creates a more readable version of the HTML file by adding indents and end of line characters." + "Print each address and its latitude and longitude coordinates." ] }, { @@ -263,25 +292,41 @@ "metadata": {}, "outputs": [], "source": [ - "from bs4 import BeautifulSoup\n", + "import requests\n", "\n", - "soup = BeautifulSoup( html_page,\"lxml\")\n", + "addresses = ['Grote Looiersstraat 17 Maastricht' , \n", + " 'Witte Singel 27 Leiden','Singel 425 Amsterdam' , \n", + " 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", "\n", - "print( soup.prettify() )\n" + "for a in addresses:\n", + " # create the API call, with the address in the 'q' parameter\n", + " \n", + " # Get the JSON data and process the data using json()\n", + " \n", + " # Find the latitude and the longitude of the first result\n", + " #latitude = json_data[0]['lat']\n", + " #longitude = json_data[0]['lon']\n", + " \n", + " \n", + "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "### Exercise 10.8.\n", "\n", - "The BeautifulSoup object that was created above has a `find_all()` method, which you can use to find all occurrences of a specific HTML tag. The name of the tag (or element) needs to be mentioned as the first parameter. \n", - "\n", - "In our example, we need to focus on specific types of <td> elements: those which have a 'class' attribute with value 'titleColumn'. Such criteria for the attributes can be given as the second parameter in `find_all()`.\n", + "As was discussed in this notebook, you can use the *MusicBrainz* API to request information about musicians. Via the code that is provided, you can request the names and the type (i.e. are we dealing with a person or with a group?). This specific API can make much more information available, however. Try to add some code with can add the following data about each artist: \n", "\n", - "As we saw in the HTML snippet above, the <td> elements do not contain the title and the url directly. These values are given in the <a> child element. Such child elements, or subelements, can be found using `findChildren()`. As a parameter, you need to give the name of the tag you want to find underneath the current element. In the code below, the variable `children` represents all the <a> elements found underneath <td>. \n", + "* The date of birth (in the case of a person) or formation (in the case of a group)\n", + "* The date of death or breakup\n", + "* The place of birth or formation\n", + "* The place of death or breakup\n", + "* Aliases\n", + "* Tags associated with the artist\n", "\n", - "To retrieve only the text of the tag (i.e. the text which is encoded using the tag), we can use the `text` property. To retrieve the value of an attribute of this element, we can use the `get()` method. As an argument, this method demands the name of the attibute we are interested in, `href` in this case. " + "Tip: 'Uncomment' the print statement in the second cell to be able explore the structure of the JSON data. \n" ] }, { @@ -290,86 +335,82 @@ "metadata": {}, "outputs": [], "source": [ - "movies = soup.find_all('td', {'class': 'titleColumn'} )\n", + "import requests\n", + "from requests.utils import requote_uri\n", "\n", - "for m in movies:\n", - " # Find links (a elements) within the cell\\n\",\n", - " children = m.findChildren(\"a\" , recursive=False)\n", - " for c in children:\n", - " movie_title = c.text\n", - " url = c.get('href')\n", - " ## This is an internal link, so we need to prepand the base url\n", - " url = 'https://imdb.com' + url\n", - " print( f'{movie_title}: {url}' ) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once you have created a list of URLs using the method outlined above, you can also download all the texts that were found, using the `get()` method from `requests` library.\n", "\n", - "As you can see, web scraping can easily become rather difficult. You need to inspect the structure of the HTML source quite carefully, and you often need to work with fairly complicated code to extract only the values that you need. \n", + "root_url = 'https://musicbrainz.org/ws/2/'\n", "\n", + "## The parameters for the API call are defined as variables\n", + "entity = 'artist'\n", + "query = 'David Bowie'\n", + "limit = 5\n", + "fmt = 'json'\n", "\n", - "### Advanced scraping: Scrapy\n", + "query = requote_uri(query)\n", "\n", - "This tutorial has only touched the surface of web scraping. To get specific data from webpages or APIs, you will need to dig into the data that you get and probably learn more about the data formats. A more advanced framework (or toolkit) for webscraping with Python is [Scrapy](https://scrapy.org). This framework simlified the process of building a scraper/crawler considerably by providing basic functionalities out of the box. Although Scrapy does not understand what parts of webpages are of interest to you, it does many things for you, such as making sure you don't send too many requests at the same time or retrying requests that fail. Feel free to look at the [Scrapy tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html) if you want to experiment with this library. " + "api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'\n", + "response = requests.get( api_call )" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "# Exercises" + "import json\n", + "\n", + "musicbrainz_results = response.json()\n", + "\n", + "for artist in musicbrainz_results['artists']:\n", + " #print(json.dumps(artist, indent=4))\n", + " name = artist.get('name', '[unknown]')\n", + " artist_type = artist.get('type', '[unknown]')\n", + " print(f'{name} ({artist_type})')\n", + " ## Add your code below\n", + " \n", + " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.1.\n", + "### Exercise 10.9.\n", "\n", - "The list below contains a number of URLs. They are the web addresses of texts created for the [Project Gutenberg](https://www.gutenberg.org) website.\n", + "*[PLOS One](https://journals.plos.org/plosone/)* is a peer reviewed open access journal. The *PLOS One* API can be used to request metadata about all the articles that have been published in the journal. In this API, you can refer to specific articles using their [DOI](https://www.doi.org/).\n", "\n", - "```\n", - "urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,\n", - "'https://www.gutenberg.org/files/1400/1400-0.txt' ,\n", - "'https://www.gutenberg.org/files/786/786-0.txt' ,\n", - "'https://www.gutenberg.org/files/766/766-0.txt' \n", - "]\n", - "```\n", + "Such requests can be sent using API calls with the following structure:\n", "\n", - "Write a program in Python that downloads all the files in this list and stores them in the current directory.\n", - "As filenames, use the same names that are used by Project Gutenberg (e.g. '580-0.txt' or '1400-0.txt').\n", - "The basename in a URL can be extracted using the [`os.path.basename()`](https://docs.python.org/3/library/os.path.html#os.path.basename) function.\n" + "https://api.plos.org/search?q=id:{doi}\n", + "\n", + "To acquire data about the article with DOI [10.1371/journal.pone.0270739](https://doi.org/10.1371/journal.pone.0270739), for example, you can use the following API call:\n", + "\n", + "https://api.plos.org/search?q=id:10.1371/journal.pone.0270739\n", + "\n", + "Try to write code which can get hold of metadata about the articles with the following DOIs:\n", + "\n", + "* 10.1371/journal.pone.0169045\n", + "* 10.1371/journal.pone.0271074\n", + "* 10.1371/journal.pone.0268993\n", + "\n", + "For each article, print the title, the publication date, the article type, a list of all the authors and the abstract. \n" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests\n", - "import os \n", "\n", - "# Recreate the given list using copy and paste\n", - "urls = [ \n", - "]\n", + "dois = [ '10.1371/journal.pone.0169045',\n", + " '10.1371/journal.pone.0268993',\n", + " '10.1371/journal.pone.0271074' ]\n", + "\n", "\n", - "# We use a for-loop to take the same steps for each item in the list:\n", - "for url in urls:\n", - " # 1. Download the file contents\n", - " \n", - " # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding\n", - " \n", - " # 2. Use basename to get a suitable filename\n", - " \n", - " # 3. Open the file in write mode and write the downloaded file contents to the file\n", - " \n", - " # 4. Close the file\n", - " \n", " " ] }, @@ -377,11 +418,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.2.\n", + "## Webscraping\n", "\n", - "Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only.\n", + "When a website does not offer access to its structured data via a well-defined API, it may be an option to acquire the data that can be viewed on a site by making use of web scraping. It is a process in which a computer program tries to process the contents of given webpage, and to extract the data values that are needed. The aim of such an application is generally to copy information on a web page and to paste it into a local database.\n", "\n", - "*Hint: the tutorial covers the Wikipedia API.*" + "To get the most out of webscraping, you need to have a basic understanding of HTML. This [basic introduction](https://bookandbyte.universiteitleiden.nl/DMT/HTML/HTML.pdf) may provide a start. Other tutorials can be found on the web. Web scraping should be used with caution, because it may be not be allowed to download large quantities of data from a specific website. In this tutorial we will only look at extracting information from single pages.\n", + "\n", + "To scrape webpages, you firstly need to download them. This can be done using the `requests` library that was explained above. The code below scrapes data from a page on the [Internet Movie Database](https://www.imdb.com/) website, listing the top rated movies." ] }, { @@ -392,23 +435,39 @@ "source": [ "import requests\n", "\n", - "baseURL = 'https://en.wikipedia.org/w/api.php?action=opensearch'\n", + "url = 'https://www.imdb.com/chart/top?ref_=ft_250'\n", + "response = requests.get( url )\n", "\n", - "# Get the search results and display them\n", - "\n" + "if response:\n", + " response.encoding = 'utf-8'\n", + " html_page = response.text " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.3.\n", + "Once you have obtained the contents of a webpage, in the form of an HTML document, you can begin to extract the data values that you are interested in. This tutorial explains how you can extract the title of these movies and the URLs of the pages on IMDB using web scraping. \n", "\n", - "Write an application in Python that extracts all the publications that have been added to a specific ORCID account, using the ORCID API.\n", + "If you inspect the output of the previous cell (the HTML code), you can see that the information about the movies is encoded as follows:\n", "\n", - "Information about individual ORCID accounts can be obtained by appending their ID to the base URL . The ORCID API returns data in XML by default. In the XML, the list of publications can be found using the XPath `r:record/a:activities-summary/a:works/a:group` (using the namespace declarations given below).\n", + "```\n", + "\n", "\n", - "*Note: we use the [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) library to process the XML data. It is very powerful, but has a quite steep learning curve.*" + "\n", + "The Godfather\n", + "\n", + "\n", + "\n", + "```\n", + "\n", + "The data can found in a <td> element whose 'class' attribute has value 'titleColumn'. The actual title in given in a hyperlink, encoded using <a>. The URL to the page for the movie is given in an 'href' attribute. 'Scraping' the page really means that we need to extract the values we need from these HTML elements. \n", + "\n", + "\n", + "One of the libraries that you can use in Python for scraping online resources is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).\n", + "The code below firstly transforms the HTML code that was downloaded into a `BeautifulSoup` object. From the `bs4` library we import the `BeautifulSoup` class. We then *construct* an object of this class, providing the full contents of an HTML document as a first parameter. As a second parameter, you need to provide the name one of the parsers that are available. Generally, a parser is an application which can process and analyse data. In this context, it refers to a program which can analyse the HTML file. One of the parsers that we can use is `lxml`. Using this parser, the `BeautifulSoup()` method converts the downloaded HTML page into a `BeautifulSoup` object. \n", + "\n", + "The `prettify()` method of this object creates a more readable version of the HTML file by adding indents and end of line characters." ] }, { @@ -417,66 +476,25 @@ "metadata": {}, "outputs": [], "source": [ - "# Choose an ORCID to look up, e.g. 0000-0002-8469-6804\n", - "orcid = ''\n", - "\n", - "\n", - "import re\n", - "import requests\n", - "import xml.etree.ElementTree as ET\n", - "\n", + "from bs4 import BeautifulSoup\n", "\n", - "ns = {'o': 'http://www.orcid.org/ns/orcid' ,\n", - "'s' : 'http://www.orcid.org/ns/search' ,\n", - "'h': 'http://www.orcid.org/ns/history' ,\n", - "'p': 'http://www.orcid.org/ns/person' ,\n", - "'pd': 'http://www.orcid.org/ns/personal-details' ,\n", - "'a': 'http://www.orcid.org/ns/activities' ,\n", - "'e': 'http://www.orcid.org/ns/employment' ,\n", - "'c': 'http://www.orcid.org/ns/common' , \n", - "'w': 'http://www.orcid.org/ns/work'}\n", + "soup = BeautifulSoup(html_page, \"lxml\")\n", "\n", - "# We expect that there may be an error and therefore use `try` and `except`\n", - "try:\n", - " # Construct the API call\n", - " orcidUrl = \"https://pub.orcid.org/v2.0/\" + orcid\n", - " print( orcidUrl )\n", - " \n", - " # Find and print the record creation date\n", - " \n", - " \n", - " # Find and print the titles of the publications\n", - "\n", - " \n", - "except:\n", - " print(\"Data could not be downloaded\")" + "print( soup.prettify() )\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.4.\n", - "\n", - "The API developed by [OpenStreetMap](https://www.openstreetmap.org/) can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search. \n", - "\n", - "Following the `q` parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the `format` parameter, you can use `xml` for XML-formatted data or `json` for JSON-formatted data. Use this API to find the longitude and the latitude of the addresses in the following list:\n", "\n", - "```\n", - "addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,\n", - "'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", - "```\n", - "\n", - "The JSON data received via the OpenStreetMap API can be converted to regular Python lists and dictionaries using the `json()` method: \n", + "The BeautifulSoup object that was created above has a `find_all()` method, which you can use to find all occurrences of a specific HTML tag. The name of the tag (or element) needs to be mentioned as the first parameter. \n", "\n", - "```json_data = response.json()```\n", + "In our example, we need to focus on specific types of <td> elements: those which have a 'class' attribute with value 'titleColumn'. Such criteria for the attributes can be given as the second parameter in `find_all()`.\n", "\n", - "If the result is saved as variable named `json_data`, you should be able to access the latitude and the longitude as follows:\n", + "As we saw in the HTML snippet above, the <td> elements do not contain the title and the url directly. These values are given in the <a> child element. Such child elements, or subelements, can be found using `findChildren()`. As a parameter, you need to give the name of the tag you want to find underneath the current element. In the code below, the variable `children` represents all the <a> elements found underneath <td>. \n", "\n", - "```\n", - "latitude = json_data[0]['lat']\n", - "longitude = json_data[0]['lon']\n", - "```" + "To retrieve only the text of the tag (i.e. the text which is encoded using the tag), we can use the `text` property. To retrieve the value of an attribute of this element, we can use the `get()` method. As an argument, this method demands the name of the attibute we are interested in, `href` in this case. " ] }, { @@ -485,43 +503,45 @@ "metadata": {}, "outputs": [], "source": [ + "movies = soup.find_all('td', {'class': 'titleColumn'} )\n", "\n", - "import requests\n", - "import urllib.parse\n", - "import re\n", - "import string\n", - "from os.path import isfile, join , isdir\n", - "import os\n", + "for m in movies:\n", + " # Find links (a elements) within the cell\",\n", + " children = m.findChildren(\"a\" , recursive=False)\n", + " for c in children:\n", + " movie_title = c.text\n", + " url = c.get('href')\n", + " ## This is an internal link, so we need to prepend the base url\n", + " url = 'https://imdb.com' + url\n", + " print( f'{movie_title}: {url}' ) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you have created a list of URLs using the method outlined above, you can also download all the texts that were found, using the `get()` method from `requests` library.\n", "\n", - "addresses = ['Grote Looiersstraat 17 Maastricht' , \n", - " 'Witte Singel 27 Leiden','Singel 425 Amsterdam' , \n", - " 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", + "As you can see, web scraping can easily become rather difficult. You need to inspect the structure of the HTML source quite carefully, and you often need to work with fairly complicated code to extract only the values that you need. \n", "\n", - "for a in addresses:\n", - " # create the API call, with the address in the 'q' parameter\n", - " \n", - " # Get the JSON data and process the data using json()\n", - " \n", - " # Find the latitude and the longitude\n", - " #latitude = json_data[0]['lat']\n", - " #longitude = json_data[0]['lon']\n", - " \n", - " \n", - "\n" + "\n", + "### Advanced scraping: Scrapy\n", + "\n", + "This tutorial has only touched the surface of web scraping. To get specific data from webpages or APIs, you will need to dig into the data that you get and probably learn more about the data formats. A more advanced framework (or toolkit) for webscraping with Python is [Scrapy](https://scrapy.org). This framework simlified the process of building a scraper/crawler considerably by providing basic functionalities out of the box. Although Scrapy does not understand what parts of webpages are of interest to you, it does many things for you, such as making sure you don't send too many requests at the same time or retrying requests that fail. Feel free to look at the [Scrapy tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html) if you want to experiment with this library. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.5\n", + "### Exercise 10.5.\n", "\n", "The webpage below offers access to the complete work of the author H.P. Lovecraft. \n", "\n", - "http://www.hplovecraft.com/writings/texts/\n", + "https://www.hplovecraft.com/writings/texts/\n", "\n", " \n", - "Write code in Python to find the URLs of all the texts that are listed. The links are all encoded in an element named <a>. The attribute `href` mentions the links, and the body of the <a> element mentions the title. List only the web pages that end in '.aspx'. \n" + "Write code in Python to find and print the URLs of all the texts that are listed. The links are all encoded in an element named <a>. The attribute `href` mentions the links, and the body of the <a> element mentions the title. List only the web pages that end in '.aspx'. \n" ] }, { @@ -534,18 +554,20 @@ "import requests\n", "import re\n", "\n", - "base_url = \"http://www.hplovecraft.com/writings/texts/\"\n" + "base_url = \"https://www.hplovecraft.com/writings/texts/\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.6\n", + "### Exercise 10.6.\n", "\n", "Using `requests` and `BeautifulSoup`, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.\n", "\n", - "Also collect data about the capital, the population and the area of all of these countries. " + "Also collect and print data about the capital, the population and the area of all of these countries.\n", + "\n", + "How you print or present the information is not too important here; the challenge in this exercise is to extract the data from the webpage." ] }, { @@ -563,17 +585,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.7\n", + "### Exercise 10.7.\n", "\n", - "Download all the images shown on the following page: https://www.bbc.com/news/in-pictures-61014501 \n", + "Download all the images shown on the following page: \n", "\n", "You can follow these steps:\n", "\n", - "* Download the HTML file\n", - "* 'Scrape' the HTML file you downloaded. As images in HTML are encoded using the `` element, try to create a list containing all occurrences of this element. \n", - "* Find the URLS of all the images. Witnin these `` element, there should be a `src` attribute containing the URL of the image. \n", - "* The bbc.com website uses images as part of the user interface. These images all have the word 'line' in their filenames. Try to exclude these images whose file names contain the word 'line'. \n", - "* Download all the images that you found in this way, using the `requests` library. In the `Response` object that is created following a succesful download, you need to work with the `content` property to obtain the actual file. Save all these images on your computer, using `open()` and `write()`. In the `open()` function, use the code ‘wb’ as a second parameter (instead of only ‘w’) to make sure that the contents are saved as bytes.\n" + "1. Download the HTML file\n", + "1. 'Scrape' the HTML file you downloaded. As images in HTML are encoded using the `` element, try to create a list containing all occurrences of this element. \n", + "1. Find the URLS of all the images. Within these `` element, there should be a `src` attribute containing the URL of the image. \n", + "1. The bbc.com website uses images as part of the user interface. These images all have the word 'line' in their filenames. Try to exclude these images whose file names contain the word 'line'. \n", + "1. Download all the images that you found in this way, using the `requests` library. In the `Response` object that is created following a succesful download, you need to work with the `content` property to obtain the actual file. Save all these images on your computer, using `open()` and `write()`. In the `open()` function, use the string `\"wb\"` (write binary) as a second parameter (instead of only `\"w\"`) to make sure that the contents are saved as bytes.\n" ] }, { @@ -587,42 +609,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.8\n", + "## Additional challenges\n", "\n", - "As was discussed in this notebook, you can use the *MusicBrainz* API to request information about musicians. Via the code that is provided, you can request the names and the type (i.e. are we dealing with a person or with a group?). This specific API can make much more information available, however. Try to add some code with can add the following data about each artist: \n", + "These are exercises that require more effort to complete.\n", + "For example, you would need to study the documentation and results of the API to understand how the Wikipedia API works.\n", "\n", - "* The date of birth (in the case of a person) or formation (in the case of a group)\n", - "* The date of death or breakup\n", - "* The place of birth or formation\n", - "* The place of death or breakup\n", - "* Aliases\n", - "* Tags associated with the artist\n", - "\n", - "Tip: 'Uncomment' the print statement in the second cell to be able explore the structure of the JSON data. \n" + "If you need a good challenge, these exercises may be for you!" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "import requests\n", - "from requests.utils import requote_uri\n", - "\n", - "\n", - "root_url = 'https://musicbrainz.org/ws/2/'\n", + "### Exercise 10.2.\n", "\n", - "## The parameters for the API call are defined as variables\n", - "entity = 'artist'\n", - "query = 'David Bowie'\n", - "limit = 5\n", - "fmt = 'json'\n", - "\n", - "query = requote_uri(query)\n", - "\n", - "api_call = f'{root_url}{entity}?query={query}&fmt={fmt}&limit={limit}'\n", - "response = requests.get( api_call )" + "Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only." ] }, { @@ -631,59 +632,66 @@ "metadata": {}, "outputs": [], "source": [ - "import json\n", + "import requests\n", "\n", - "musicbrainz_results = response.json()\n", + "base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'\n", "\n", - "for artist in musicbrainz_results['artists']:\n", - " #print(json.dumps(artist, indent=4))\n", - " name = artist.get('name','[unknown]')\n", - " artist_type = artist.get('type','[unknown]')\n", - " print(f'{name} ({artist_type})')\n", - " ## Add your code below\n", - " \n", - " " + "# Get the search results and display them\n", + "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.9\n", + "### Exercise 10.3.\n", "\n", - "*[PLOS One](https://journals.plos.org/plosone/)* is a peer reviewed open access journal. The *PLOS One* API can be used to request metadata about all the articles that have been published in the journal. In this API, you can refer to specific articles using their [DOI](https://www.doi.org/).\n", - "\n", - "Such requests can be sent using API calls with the following structure:\n", - "\n", - "https://api.plos.org/search?q=id:{doi}\n", - "\n", - "To acquire data about the article with DOI [10.1371/journal.pone.0270739](https://doi.org/10.1371/journal.pone.0270739), for example, you can use the following API call:\n", - "\n", - "https://api.plos.org/search?q=id:10.1371/journal.pone.0270739\n", - "\n", - "Try to write code which can get hold of metadata about the articles with the following DOIs:\n", + "Write an application in Python that extracts all the publications that have been added to a specific ORCID account, using the ORCID API.\n", "\n", - "* 10.1371/journal.pone.0169045\n", - "* 10.1371/journal.pone.0271074\n", - "* 10.1371/journal.pone.0268993\n", + "Information about individual ORCID accounts can be obtained by appending their ID to the base URL . The ORCID API returns data in XML by default. In the XML, the list of publications can be found using the XPath `r:record/a:activities-summary/a:works/a:group` (using the namespace declarations given below).\n", "\n", - "For each article, print the title, the publication date, the article type, a list of all the authors and the abstract. \n" + "*Note: we use the [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) library to process the XML data. It is very powerful, but has a quite steep learning curve.*" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ + "# Choose an ORCID to look up, e.g. 0000-0002-8469-6804\n", + "orcid = ''\n", + "\n", + "\n", + "import re\n", "import requests\n", + "import xml.etree.ElementTree as ET\n", "\n", - "dois = [ '10.1371/journal.pone.0169045',\n", - " '10.1371/journal.pone.0268993',\n", - " '10.1371/journal.pone.0271074' ]\n", "\n", + "ns = {'o': 'http://www.orcid.org/ns/orcid' ,\n", + "'s' : 'http://www.orcid.org/ns/search' ,\n", + "'h': 'http://www.orcid.org/ns/history' ,\n", + "'p': 'http://www.orcid.org/ns/person' ,\n", + "'pd': 'http://www.orcid.org/ns/personal-details' ,\n", + "'a': 'http://www.orcid.org/ns/activities' ,\n", + "'e': 'http://www.orcid.org/ns/employment' ,\n", + "'c': 'http://www.orcid.org/ns/common' , \n", + "'w': 'http://www.orcid.org/ns/work'}\n", "\n", - " " + "# We expect that there may be an error and therefore use `try` and `except`\n", + "try:\n", + " # Construct the API call\n", + " orcidUrl = \"https://pub.orcid.org/v2.0/\" + orcid\n", + " print( orcidUrl )\n", + " \n", + " # Find and print the record creation date\n", + " \n", + " \n", + " # Find and print the titles of the publications\n", + "\n", + " \n", + "except:\n", + " print(\"Data could not be downloaded\")" ] } ], @@ -703,7 +711,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.9.7" } }, "nbformat": 4, diff --git a/notebooks/Solutions/10 Data_acquisition.ipynb b/notebooks/Solutions/10 Data_acquisition.ipynb index a3dc09f..057f3ae 100644 --- a/notebooks/Solutions/10 Data_acquisition.ipynb +++ b/notebooks/Solutions/10 Data_acquisition.ipynb @@ -11,7 +11,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.1.\n", + "## Direct downloads\n", + "\n", + "### Exercise 10.1.\n", "\n", "The list below contains a number of URLs. They are the web addresses of texts created for the [Project Gutenberg](https://www.gutenberg.org) website.\n", "\n", @@ -35,7 +37,7 @@ "outputs": [], "source": [ "import requests\n", - "import os \n", + "import os.path\n", "\n", "# Recreate the given list using copy and paste\n", "urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,\n", @@ -64,131 +66,35 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.2.\n", - "\n", - "Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only.\n", - "\n", - "*Hint: the tutorial covers the Wikipedia API.*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", - "import json\n", - "\n", - "# Let's construct the full API call (which is a URL) piece by piece\n", - "baseURL = 'https://en.wikipedia.org/w/api.php?action=opensearch'\n", + "## Acquiring data via APIs\n", "\n", - "searchTerm = \"Dutch\"\n", - "limit = 30\n", - "data_format = 'json'\n", - "\n", - "apiCall = '{}&search={}&limit={}&format={}'.format( baseURL, searchTerm , limit , data_format )\n", - "\n", - "# Get the data using the Requests library\n", - "responseData = requests.get( apiCall )\n", + "### Exercise 10.4.\n", "\n", - "# Because we asked for and got JSON-formatted data, Requests lets us access\n", - "# the data as a Python data structure using the .json() method\n", - "wikiResults = responseData.json()\n", + "Find the coordinates for each address in the given list using [OpenStreetMap](https://www.openstreetmap.org/)'s Nominatim API.\n", "\n", - "# Now we print the search results \n", - "for i in range( 0 , len(wikiResults[1]) ):\n", - " print( 'Title: ' + wikiResults[1][i] )\n", - " print( 'Tagline: ' + wikiResults[2][i] )\n", - " print( 'Url: ' + wikiResults[3][i] + '\\n')\n", - " \n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exercise 10.3.\n", + "The Nominatim API can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is .\n", "\n", - "Write an application in Python that extracts all the publications that have been added to a specific ORCID account, using the ORCID API.\n", + "Following the `q` parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the `format` parameter, you can use `xml` for XML-formatted data or `json` for JSON-formatted data. Use this API to find the longitude and the latitude of the addresses in the following list:\n", "\n", - "Information about individual ORCID accounts can be obtained by appending their ID to the base URL . The ORCID API returns data in XML by default. In the XML, the list of publications can be found using the XPath `r:record/a:activities-summary/a:works/a:group` (using the namespace declarations given below).\n", + "```\n", + "addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,\n", + "'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", + "```\n", "\n", - "*Note: we use the [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) library to process the XML data. It is very powerful, but has a quite steep learning curve.*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "orcid = '0000-0002-8469-6804'\n", + "The JSON data received via the OpenStreetMap API can be converted to regular Python lists and dictionaries using the `json()` method: \n", "\n", + "```json_data = response.json()```\n", "\n", - "import re\n", - "import requests\n", - "import xml.etree.ElementTree as ET\n", + "If the result is saved as variable named `json_data`, you should be able to access the latitude and the longitude as follows:\n", "\n", - "# Declare namespace abbreviations\n", - "ns = {'o': 'http://www.orcid.org/ns/orcid' ,\n", - "'s' : 'http://www.orcid.org/ns/search' ,\n", - "'h': 'http://www.orcid.org/ns/history' ,\n", - "'p': 'http://www.orcid.org/ns/person' ,\n", - "'pd': 'http://www.orcid.org/ns/personal-details' ,\n", - "'a': 'http://www.orcid.org/ns/activities' ,\n", - "'e': 'http://www.orcid.org/ns/employment' ,\n", - "'c': 'http://www.orcid.org/ns/common' , \n", - "'w': 'http://www.orcid.org/ns/work',\n", - "'r': 'http://www.orcid.org/ns/record'}\n", - "\n", - "\n", - "try:\n", - " # Construct the API call and retrieve the data\n", - " orcidUrl = \"https://pub.orcid.org/v2.0/\" + orcid\n", - " print( orcidUrl )\n", - " \n", - " response = requests.get( orcidUrl )\n", - " \n", - " # Parse XML string into its Python ElementTree object representation\n", - " root = ET.fromstring(response.text)\n", - " \n", - " # Find and print the ORCID creation date\n", - " creationDate = root.find('h:history/h:submission-date' , ns ).text\n", - " \n", - " print('\\nORCID created on:')\n", - " print(creationDate)\n", - " \n", - " # Print the title and DOI of each work (DOI only when available)\n", - " print('\\nWorks:')\n", - " \n", - " works = root.findall('a:activities-summary/a:works/a:group' , ns )\n", - " for w in works:\n", - " title = w.find('w:work-summary/w:title/c:title' , ns ).text\n", - " print(title)\n", - " doiEl = w.find('c:external-ids/c:external-id/c:external-id-url' , ns )\n", - " if doiEl is not None:\n", - " doi = doiEl.text\n", - " print(doi)\n", - " \n", - "except:\n", - " print(\"Data could not be downloaded\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exercise 10.4.\n", + "```\n", + "latitude = json_data[0]['lat']\n", + "longitude = json_data[0]['lon']\n", + "```\n", "\n", - "The API developed by [OpenStreetMap](https://www.openstreetmap.org/) can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search. Following the `q` parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the `format` parameter, you can use `xml` for XML-formatted data or `json` for JSON-formatted data. Use this API to find the longitude and the latitude of the addresses in the following list:\n", + "The `[0]` is used to get the results for the first result.\n", "\n", - "```\n", - "addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,\n", - "'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", - "```" + "Print each address and its latitude and longitude coordinates." ] }, { @@ -197,23 +103,18 @@ "metadata": {}, "outputs": [], "source": [ - "\n", "import requests\n", - "import urllib.parse\n", - "import re\n", - "import string\n", - "from os.path import isfile, join , isdir\n", - "import os\n", "\n", "addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,\n", "'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", "\n", "\n", "for a in addresses:\n", - " url = 'https://nominatim.openstreetmap.org/search?q='+ a_encoded + '&format=json'\n", + " url = f'https://nominatim.openstreetmap.org/search?q={a}&format=json'\n", "\n", - " response = requests.get( url )\n", - " json_data = response.json() \n", + " response = requests.get( url ) # The spaces in each address are automatically encoded as '%20' by requests\n", + " json_data = response.json()\n", + " # json_data is a list of results; we assume that the first result is always correct(!)\n", " latitude = json_data[0]['lat']\n", " longitude = json_data[0]['lon']\n", " print( f'{latitude},{longitude}')\n" @@ -232,13 +133,9 @@ "metadata": {}, "outputs": [], "source": [ - "\n", "import requests\n", "import urllib.parse\n", "import re\n", - "import string\n", - "from os.path import isfile, join , isdir\n", - "import os\n", "\n", "addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' ,\n", "'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']\n", @@ -247,6 +144,7 @@ "\n", "\n", "for a in addresses:\n", + " # You can 'URL-encode' the spaces in the addresses yourself, instead of leaving it to requests\n", " a_encoded = urllib.parse.quote(a)\n", " url = 'https://nominatim.openstreetmap.org/search?q='+ a_encoded + '&format=json'\n", " print(url)\n", @@ -260,44 +158,37 @@ "\n", "\n", "out = open( 'map.html' , 'w' , encoding = 'utf-8')\n", - "import re\n", "\n", "out.write('''\n", "\n", "\n", "\n", "\n", - " Locations\n", - "\n", - " \n", - " \n", + " Locations\n", "\n", - " \n", + " \n", + " \n", "\n", " \n", " \n", "\n", - "\n", - "\n", "\n", "\n", "\n", - "\n", - "\n", "
\n", "\n", "\n", - "\n", - "\n", "\n", "\n", - "\n", "''')\n", "\n", + "# Do not forget to close the file handler!\n", "out.close()\n" ] }, @@ -324,156 +213,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.5\n", - "\n", - "The webpage below offers access to the complete work of the author H.P. Lovecraft. \n", - "\n", - "http://www.hplovecraft.com/writings/texts/\n", - "\n", - " \n", - "Write code in Python to find the URLs of all the texts that are listed. The links are all encoded in an element named <a>. The attribute `href` mentions the links, and the body of the <a> element mentions the title. List only the web pages that end in '.aspx'. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from bs4 import BeautifulSoup\n", - "import requests\n", - "import re\n", - "\n", - "base_url = \"http://www.hplovecraft.com/writings/texts/\"\n", - "\n", - "response = requests.get(base_url)\n", - "if response: \n", - " #print(response.text)\n", - " soup = BeautifulSoup( response.text ,\"lxml\")\n", - " links = soup.find_all(\"a\")\n", - " for link in links:\n", - " if link.get('href') is not None:\n", - " title = link.string\n", - " url = base_url + link.get('href')\n", - " if re.search( r'aspx$' , url): \n", - " print( f'{title}\\n{url}')\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exercise 10.6\n", - "\n", - "Using `requests` and `BeautifulSoup`, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.\n", - "\n", - "Also collect data about the capital, the population and the area of all of these countries. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import requests\n", - "from bs4 import BeautifulSoup\n", - "\n", - "url = 'https://www.scrapethissite.com/pages/simple/'\n", - "\n", - "response = requests.get(url)\n", - "\n", - "if response.status_code == 200:\n", - " response.encoding = 'utf-8'\n", - " html_page = response.text\n", - " \n", - " \n", - "soup = BeautifulSoup( html_page,\"lxml\")\n", - " \n", - "countries = soup.find_all('div', {'class': 'col-md-4 country'} )\n", - "\n", - "\n", - "for c in countries:\n", - " \n", - " name = c.find('h3' , { 'class':'country-name'})\n", - " print(name.text.strip())\n", - " \n", - " # find all elements underneath

\n", - " span = c.find_all(\"span\" )\n", - " \n", - " capital = ''\n", - " population = 0\n", - " area = 0\n", - " \n", - " for s in span:\n", - "\n", - " if s['class'][0] == 'country-capital':\n", - " capital = s.text\n", - " \n", - " if s['class'][0] == 'country-population':\n", - " population = s.text\n", - " \n", - " if s['class'][0] == 'country-area':\n", - " area = s.text\n", - " \n", - " print(f' Capital: {capital}')\n", - " print(f' Population: {population}')\n", - " print(f' Area: {area}')\n", - " print()\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exercise 10.7\n", - "\n", - "Download all the images shown on the following page: https://www.bbc.com/news/in-pictures-61014501 \n", - "\n", - "You can follow these steps:\n", - "\n", - "* Download the HTML file\n", - "* 'Scrape' the HTML file you downloaded. As images in HTML are encoded using the `img` element, try to create a list containing all occurrences of this element. \n", - "* Find the URLS of all the images. Witnin these `img` element, there should be a `src` attribute containing the URL of the image. \n", - "* The bbc.com website uses images as part of the user interface. These images all have the word 'line' in their filenames. Try to exclude these images whose file names contain the word 'line'. \n", - "* Download all the images that you found in this way, using the `requests` library. In the `Response` object that is created following a succesful download, you need to work with the `content` property to obtain the actual file. Save all these images on your computer, using `open()` and `write()`. In the `open()` function, use the code ‘wb’ as a second parameter (instead of only ‘w’) to make sure that the contents are saved as bytes.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "url = 'https://www.bbc.com/news/in-pictures-61014501'\n", - "\n", - "response = requests.get(url)\n", - "\n", - "if response:\n", - " html_page = response.text\n", - " soup = BeautifulSoup( html_page,\"lxml\")\n", - " images = soup.find_all('img')\n", - " for i in images:\n", - " img_url = i.get('src')\n", - " if 'line' not in img_url:\n", - " response = requests.get(img_url)\n", - " if response:\n", - " file_name = os.path.basename(img_url)\n", - " print(file_name)\n", - " out = open( file_name , 'wb' )\n", - " out.write(response.content)\n", - " out.close()\n", - " " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exercise 10.8\n", + "### Exercise 10.8.\n", "\n", "As was discussed in this notebook, you can use the *MusicBrainz* API to request information about musicians. Via the code that is provided, you can request the names and the type (i.e. are we dealing with a person or with a group?). This specific API can make much more information available, however. Try to add some code with can add the following data about each artist: \n", "\n", @@ -583,7 +323,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Exercise 10.9\n", + "### Exercise 10.9.\n", "\n", "*[PLOS One](https://journals.plos.org/plosone/)* is a peer reviewed open access journal. The *PLOS One* API can be used to request metadata about all the articles that have been published in the journal. In this API, you can refer to specific articles using their [DOI](https://www.doi.org/).\n", "\n", @@ -640,6 +380,261 @@ "\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Webscraping\n", + "\n", + "### Exercise 10.5.\n", + "\n", + "The webpage below offers access to the complete work of the author H.P. Lovecraft. \n", + "\n", + "\n", + " \n", + "Write code in Python to find and print the URLs of all the texts that are listed. The links are all encoded in an element named <a>. The attribute `href` mentions the links, and the body of the <a> element mentions the title. List only the web pages that end in '.aspx'. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from bs4 import BeautifulSoup\n", + "import requests\n", + "import re\n", + "\n", + "base_url = \"http://www.hplovecraft.com/writings/texts/\"\n", + "\n", + "response = requests.get(base_url)\n", + "if response: \n", + " #print(response.text)\n", + " soup = BeautifulSoup( response.text ,\"lxml\")\n", + " links = soup.find_all(\"a\")\n", + " for link in links:\n", + " if link.get('href') is not None:\n", + " title = link.string\n", + " url = base_url + link.get('href')\n", + " if re.search( r'aspx$' , url): \n", + " print( f'{title}\\n{url}')\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 10.6.\n", + "\n", + "Using `requests` and `BeautifulSoup`, create a list of all the countries mentioned on https://www.scrapethissite.com/pages/simple/.\n", + "\n", + "Also collect and print data about the capital, the population and the area of all of these countries.\n", + "\n", + "How you print or present the information is not too important here; the challenge in this exercise is to extract the data from the webpage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from bs4 import BeautifulSoup\n", + "\n", + "url = 'https://www.scrapethissite.com/pages/simple/'\n", + "\n", + "response = requests.get(url)\n", + "\n", + "if response.status_code == 200:\n", + " response.encoding = 'utf-8'\n", + " html_page = response.text\n", + " \n", + " \n", + "soup = BeautifulSoup( html_page,\"lxml\")\n", + " \n", + "countries = soup.find_all('div', {'class': 'col-md-4 country'} )\n", + "\n", + "\n", + "for c in countries:\n", + " \n", + " name = c.find('h3' , { 'class':'country-name'})\n", + " print(name.text.strip())\n", + " \n", + " capital = c.find('span', { 'class':'country-capital'}).text\n", + " population = c.find('span', { 'class':'country-population'}).text\n", + " area = c.find('span', { 'class':'country-area'}).text\n", + " \n", + " print(f' Capital: {capital}')\n", + " print(f' Population: {population}')\n", + " print(f' Area: {area}')\n", + " print()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 10.7.\n", + "\n", + "Download all the images shown on the following page: \n", + "\n", + "You can follow these steps:\n", + "\n", + "1. Download the HTML file\n", + "1. 'Scrape' the HTML file you downloaded. As images in HTML are encoded using the `` element, try to create a list containing all occurrences of this element. \n", + "1. Find the URLS of all the images. Within these `` element, there should be a `src` attribute containing the URL of the image. \n", + "1. The bbc.com website uses images as part of the user interface. These images all have the word 'line' in their filenames. Try to exclude these images whose file names contain the word 'line'. \n", + "1. Download all the images that you found in this way, using the `requests` library. In the `Response` object that is created following a succesful download, you need to work with the `content` property to obtain the actual file. Save all these images on your computer, using `open()` and `write()`. In the `open()` function, use the string `\"wb\"` (write binary) as a second parameter (instead of only `\"w\"`) to make sure that the contents are saved as bytes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "url = 'https://www.bbc.com/news/in-pictures-61014501'\n", + "\n", + "response = requests.get(url)\n", + "\n", + "if response:\n", + " html_page = response.text\n", + " soup = BeautifulSoup( html_page,\"lxml\")\n", + " images = soup.find_all('img')\n", + " for i in images:\n", + " img_url = i.get('src')\n", + " if 'line' not in img_url:\n", + " response = requests.get(img_url)\n", + " if response:\n", + " file_name = os.path.basename(img_url)\n", + " print(file_name)\n", + " out = open( file_name , 'wb' )\n", + " out.write(response.content)\n", + " out.close()\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Additional challenges\n", + "\n", + "### Exercise 10.2.\n", + "\n", + "Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import json\n", + "\n", + "# Let's construct the full API call (which is a URL) piece by piece\n", + "base_url = 'https://en.wikipedia.org/w/api.php?action=opensearch'\n", + "\n", + "search_term = \"Dutch\"\n", + "limit = 30\n", + "data_format = 'json'\n", + "\n", + "api_call = f'{base_url}&search={search_term}&limit={limit}&format={data_format}'\n", + "\n", + "# Get the data using the Requests library\n", + "response_data = requests.get( api_call )\n", + "\n", + "# Because we asked for and got JSON-formatted data, Requests lets us access\n", + "# the data as a Python data structure using the .json() method\n", + "wiki_results = response_data.json()\n", + "\n", + "# Now we print the search results \n", + "for i in range( 0 , len(wiki_results[1]) ):\n", + " print( 'Title: ' + wiki_results[1][i] )\n", + " print( 'Tagline: ' + wiki_results[2][i] )\n", + " print( 'Url: ' + wiki_results[3][i] + '\\n')\n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 10.3.\n", + "\n", + "Write an application in Python that extracts all the publications that have been added to a specific ORCID account, using the ORCID API.\n", + "\n", + "Information about individual ORCID accounts can be obtained by appending their ID to the base URL . The ORCID API returns data in XML by default. In the XML, the list of publications can be found using the XPath `r:record/a:activities-summary/a:works/a:group` (using the namespace declarations given below).\n", + "\n", + "*Note: we use the [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) library to process the XML data. It is very powerful, but has a quite steep learning curve.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "orcid = '0000-0002-8469-6804'\n", + "\n", + "\n", + "import re\n", + "import requests\n", + "import xml.etree.ElementTree as ET\n", + "\n", + "# Declare namespace abbreviations\n", + "ns = {'o': 'http://www.orcid.org/ns/orcid' ,\n", + "'s' : 'http://www.orcid.org/ns/search' ,\n", + "'h': 'http://www.orcid.org/ns/history' ,\n", + "'p': 'http://www.orcid.org/ns/person' ,\n", + "'pd': 'http://www.orcid.org/ns/personal-details' ,\n", + "'a': 'http://www.orcid.org/ns/activities' ,\n", + "'e': 'http://www.orcid.org/ns/employment' ,\n", + "'c': 'http://www.orcid.org/ns/common' , \n", + "'w': 'http://www.orcid.org/ns/work',\n", + "'r': 'http://www.orcid.org/ns/record'}\n", + "\n", + "\n", + "try:\n", + " # Construct the API call and retrieve the data\n", + " orcidUrl = \"https://pub.orcid.org/v2.0/\" + orcid\n", + " print( orcidUrl )\n", + " \n", + " response = requests.get( orcidUrl )\n", + " \n", + " # Parse XML string into its Python ElementTree object representation\n", + " root = ET.fromstring(response.text)\n", + " \n", + " # Find and print the ORCID creation date\n", + " creationDate = root.find('h:history/h:submission-date' , ns ).text\n", + " \n", + " print('\\nORCID created on:')\n", + " print(creationDate)\n", + " \n", + " # Print the title and DOI of each work (DOI only when available)\n", + " print('\\nWorks:')\n", + " \n", + " works = root.findall('a:activities-summary/a:works/a:group' , ns )\n", + " for w in works:\n", + " title = w.find('w:work-summary/w:title/c:title' , ns ).text\n", + " print(title)\n", + " doiEl = w.find('c:external-ids/c:external-id/c:external-id-url' , ns )\n", + " if doiEl is not None:\n", + " doi = doiEl.text\n", + " print(doi)\n", + " \n", + "except:\n", + " print(\"Data could not be downloaded\")" + ] + }, { "cell_type": "code", "execution_count": null, @@ -664,7 +659,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.9.7" } }, "nbformat": 4,