{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### GISC 420 T1 2022\n", "# Web scraping in `python`\n", "There's a lot of data out there on the web these days, much of it spatial in nature. How can we get those data into a form we can use in GIS or other geospatial tools? That's what this week's lab materials focus on.\n", "\n", "To get there, we need to understand a little bit about how websites and web pages work.\n", "\n", "## What's in a web page\n", "OK... let's take a look at a [simple web page](https://southosullivan.com/gisc420/other/simple-web-page.html). This is an example of very simple web map.\n", "\n", "In the browser you are viewing it with, do something like right-click **View page source**.\n", "\n", "You should see the following:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "\n", "\n", "\n", "
\n", "This page demonstrates some very basic HTML for use in my classes.\n", "
\n", "This page was written by Prof David O'Sullivan.\n", "
\n", "David is from Ireland. Sometimes, he misses Ireland's glorious summers, like the one in\n", " this picture.\n", "
\n", "That's Muckish Mountain in Donegal. Here is a map.
\n", " \n", "The picture was taken from roughly here. The google car got better weather than I did.\n", "
\n", " \n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although, what you see will be syntax-coloured to help understand it a little better.\n", "\n", "The key thing from our present perspective is to note that there is *a tree like structure* to this page.\n", "\n", "At the 'trunk' of the tree is the whole page denoted by `` **tags**.\n", "\n", "Nested below this are the `` and `` **elements**.\n", "\n", "Inside each of these are other elements, headings, paragraphs, an image, a figure caption, and something called an iframe, inside of which is the map.\n", "\n", "This is an exceedingly simple web page. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Something a bit more complicated\n", "So now look at [this page](https://southosullivan.com/gisc422/interpolation/#/).\n", "\n", "Let's again, take a closer look, and also examine it with your browser's **web console** tools.\n", "\n", "**I'll talk you through this**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summing things up\n", "Web pages are assembled from several different pieces\n", "\n", "1. **Hyper Text Markup Language** (HTML) tags describe the structure of the page using hierarchically nested tags that defined the *elements* which make up the page.\n", "2. **Cascading Style Sheets** (CSS) determine how various elements in a page should be displayed.\n", "3. Client-side code in **JavaScript** which is often used to build interactive elements in the page, which are responsive to the page's readers.\n", "\n", "There is a fourth 'hidden' element to most pages, which is the server-side database(s) used to populate typical pages with requested information. On human-facing pages, this part of the process is generally handled by a combination of client-side interface elements that obtain information from users (search terms and so on), and form these into query strings that are decoded into database queries on the server-side. Responses from the database populate the page that is eventually seen by a user.\n", "\n", "[DuckDuckGo](https://duckduckgo.com) and [Google maps](https://maps.google.com) and [Let Me Google That For You](https://lmgtfy.com) provide simple examples of query strings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## So much for simple\n", "The simple structure of the web pages we have looked at is about the limit of what can be built by hand by a semi-skilled human (like me...). Almost all web pages today are assembled in automated fashion. Take a look for example at [this page](https://www.pbtech.co.nz/product/BATPAN5254/Panasonic-K-KJ55MCC4TA-3hr-Quick-Charger--4-x-AA-e). Click on another page at the same site, and compare the code you can see in the web console inspector. Or here is [another example](https://en.wikipedia.org/wiki/HTML).\n", "\n", "The point is that every page has a similar structure, and it is this structure that we can use to automate the process of obtaining useable data from webpages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## But... before we go there\n", "OK... it's important to realise that there are situations where automation may not be the point. There are two extremes to this.\n", "\n", "**First**, you might just be interested in a particular dataset in a once-off fashion. Here's an example [the New Zealand AED map](https://aedlocations.co.nz/) of how that can work.\n", "\n", "**Alternatively**, the website in question may have an API, which allows you to make queries in an organised way. Many companies publish APIs because they _want_ other websites to be able to link to theirs in an interactive way. Here's [an example](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html). There are downsides to using an API. For a start, they might not be free. Or at any rate, the free version may be limited in some way. In many cases you are rate limited to the number of searches allowed in a given time period, and also required to authenticate (i.e. you need an account with the service). Furthermore, even premium versions may not offer all the functionality you would like. On the other hand, they are generally well documented, and greatly simplify the business of regularly pulling data from websites of interest. \n", "\n", "A middle ground is that you might find that others have written tools to interface with popular APIs. For example, see this list of [python flickr related projects on github](https://github.com/search?q=python+flickr)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting serious about web scraping\n", "Assuming none of the above options are available or useful for whatever reason, you may want to develop code to *scrape* data from websites. There may be additional advantages to going down this path\n", "\n", "* no registration necessary\n", "* no rate limiting (although this comes with **serious caveats**)\n", "\n", "Ultimately, anything viewable on a website can be scraped. Having said that:\n", "\n", "* Check terms and conditions before scraping\n", "* Be nice: it's unfortunately easy to inadvertently launch a denial of service attack on a website if you hit it too often with too many requests\n", "* Website change all the time, so scraping code that works one day might break at any moment; be prepared to have to continually update your code\n", "\n", "Getting into web scraping requires us to delve a bit more deeply into that tree structure of web pages that I have already mentioned. And that means..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ... getting to know the DOM\n", "Like everything else in computing, that tree-like structure has an acronym. It is the key idea behind the **Document Object Model** by which web pages are organised.\n", "\n", "The DOM is a language independent API that allows manipulation of the contents of a document such that its contents and appearance can be changed programmatically. It allows automatic navigation and search of the structure of a document.\n", "\n", "So, how does it work?\n", "\n", "### A tree\n", "Every document is organised as a nested set of *elements* in tree-like fashion.\n", "\n", "