#### GISC 420 T1 2022
# Web scraping in `python`
There's a lot of data out there on the web these days, much of it spatial in nature. How can we get those data into a form we can use in GIS or other geospatial tools? That's what this week's lab materials focus on.

To get there, we need to understand a little bit about how websites and web pages work.

## What's in a web page
OK... let's take a look at a [simple web page](https://southosullivan.com/gisc420/other/simple-web-page.html). This is an example of very simple web map.

In the browser you are viewing it with, do something like right-click **View page source**.

You should see the following:

```
<!DOCTYPE html>

<html>
    <head>
        <title>A very simple web page</title>
    </head>

    <body>
        <h1>My first web page
        </h1>
        <p>This page demonstrates some <i>very</i> basic HTML for use in my classes.
        </p>
        <p>This page was written by <a href=
        "https://southosullivan.com/geodos">Prof David O'Sullivan</a>.
        </p>
        <p>David is from Ireland.  Sometimes, he misses Ireland's glorious summers, like the one in
        this picture.
        </p>
        <img src="muckish.jpg" width=640>
        <figcaption>Yes, that picture really was taken in August</figcaption>
        <p>That's Muckish Mountain in Donegal.  Here is a map.</p>
        <iframe src="https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d51581.682424507155!2d-8.022036861268594!3d55.097443300030626!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0000000000000000%3A0xd25d3a3fef4d920c!2sMt.+Muckish!5e0!3m2!1sen!2sus!4v1441395150485" width="400" height="300" frameborder="0" style="border:0" allowfullscreen></iframe>
        <p>The picture was taken from <a href=
"https://www.google.com/maps/@55.214797,-7.978438,3a,75y,205.64h,91.69t/data=!3m4!1e1!3m2!1ssdYiz3D8tz6atCSdR9200w!2e0?hl=en">roughly here</a>.  The google car got better weather than I did.
        </p>
    </body>
</html>
```

Although, what you see will be syntax-coloured to help understand it a little better.

The key thing from our present perspective is to note that there is *a tree like structure* to this page.

At the 'trunk' of the tree is the whole page denoted by `<html></html>` **tags**.

Nested below this are the `<head></head>` and `<body></body>` **elements**.

Inside each of these are other elements, headings, paragraphs, an image, a figure caption, and something called an iframe, inside of which is the map.

This is an exceedingly simple web page. 

## Something a bit more complicated
So now look at [this page](https://southosullivan.com/gisc422/interpolation/#/).

Let's again, take a closer look, and also examine it with your browser's **web console** tools.

**I'll talk you through this**.

## Summing things up
Web pages are assembled from several different pieces

1. **Hyper Text Markup Language** (HTML) tags describe the structure of the page using hierarchically  nested tags that defined the *elements* which make up the page.
2. **Cascading Style Sheets** (CSS) determine how various elements in a page should be displayed.
3. Client-side code in **JavaScript** which is often used to build interactive elements in the page, which are responsive to the page's readers.

There is a fourth 'hidden' element to most pages, which is the server-side database(s) used to populate typical pages with requested information. On human-facing pages, this part of the process is generally handled by a combination of client-side interface elements that obtain information from users (search terms and so on), and form these into query strings that are decoded into database queries on the server-side. Responses from the database populate the page that is eventually seen by a user.

[DuckDuckGo](https://duckduckgo.com) and [Google maps](https://maps.google.com) and [Let Me Google That For You](https://lmgtfy.com) provide simple examples of query strings.

## So much for simple
The simple structure of the web pages we have looked at is about the limit of what can be built by hand by a semi-skilled human (like me...). Almost all web pages today are assembled in automated fashion. Take a look for example at [this page](https://www.pbtech.co.nz/product/BATPAN5254/Panasonic-K-KJ55MCC4TA-3hr-Quick-Charger--4-x-AA-e). Click on another page at the same site, and compare the code you can see in the web console inspector. Or here is [another example](https://en.wikipedia.org/wiki/HTML).

The point is that every page has a similar structure, and it is this structure that we can use to automate the process of obtaining useable data from webpages.

## But... before we go there
OK... it's important to realise that there are situations where automation may not be the point. There are two extremes to this.

**First**, you might just be interested in a particular dataset in a once-off fashion. Here's an example [the New Zealand AED map](https://aedlocations.co.nz/) of how that can work.

**Alternatively**, the website in question may have an API, which allows you to make queries in an organised way. Many companies publish APIs because they _want_ other websites to be able to link to theirs in an interactive way. Here's [an example](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html). There are downsides to using an API. For a start, they might not be free. Or at any rate, the free version may be limited in some way. In many cases you are rate limited to the number of searches allowed in a given time period, and also required to authenticate (i.e. you need an account with the service). Furthermore, even premium versions may not offer all the functionality you would like. On the other hand, they are generally well documented, and greatly simplify the business of regularly pulling data from websites of interest. 

A middle ground is that you might find that others have written tools to interface with popular APIs. For example,  see this list of [python flickr related projects on github](https://github.com/search?q=python+flickr).

## Getting serious about web scraping
Assuming none of the above options are available or useful for whatever reason, you may want to develop code to *scrape* data from websites. There may be additional advantages to going down this path

* no registration necessary
* no rate limiting (although this comes with **serious caveats**)

Ultimately, anything viewable on a website can be scraped. Having said that:

* Check terms and conditions before scraping
* Be nice: it's unfortunately easy to inadvertently launch a denial of service attack on a website if you hit it too often with too many requests
* Website change all the time, so scraping code that works one day might break at any moment; be prepared to have to continually update your code

Getting into web scraping requires us to delve a bit more deeply into that tree structure of web pages that I have already mentioned. And that means...

## ... getting to know the DOM
Like everything else in computing, that tree-like structure has an acronym. It is the key idea behind the **Document Object Model** by which web pages are organised.

The DOM is a language independent API that allows manipulation of the contents of a document such that its contents and appearance can be changed programmatically. It allows automatic navigation and search of the structure of a document.

So, how does it work?

### A tree
Every document is organised as a nested set of *elements* in tree-like fashion.

<img src='dom.png'>

**Source**: [DOM and JQuery](https://cs.wellesley.edu/~cs110/reading/DOM-JQ.html)

This nested structure, and the fact that all the pages on a given site of a given kind will have the same structure, makes it possible to write code to *traverse* the tree and extract the elements of interest, including any data they may contain.

Modern browsers *parse* HTML into the DOM format, and then render the webpage. In web scraping we use the same information the browsers uses to render the page to figure out which pieces of the page we want to scrape.

### A tree with labels
The HTML elements in the DOM tree usually have three components. 

First there are **tags** which designate the beginning and end of the element. These are the opening `<tagname>` and closing `</tagname>` chunks we have seen all over the web page source material we have been looking at.

Second tags have associated with them **attributes** and **content**. The content is all the material (which may include additional nested elements) between the tag start and end markers. The attributes are additional information describing the tag (in effect data about the tag itself.

Some common HTML tags.

 Tag   | Usage
 --- | -----
`<html>` | Designates a HTML document
`<head><body>` | The header and main body of a document
`<div>` | A general container for content used to structure a document into sections
`<span>` | A container for inline material, often used to mark particular items
`<h1><h2><h3><h4><h5><h6>` | Headings (6 levels)
`<ol><ul><li>` | Lists of various kinds
`<a>` | Anchors - most often used for hyperlinks
`<p>` | Paragraphs

### More labels
In addition to the tags themselves, and any inbuilt attributes they might have (such as, e.g. `<a href="..">` where href is an expected attribute of an anchor tag, most web pages have additional `class` or `id` tags that web designers use in tandem with CSS to control the format of the web page.

We aren't too concerned about the details of this (if you want to know more, then you'll need to learn web design...). For our purposes, the important thing about the CSS class and id features is that they often make it easier to search for particular items of interest on a page.

So the basic idea is to investigate the web page you are interested in scraping, using the web tools, particularly the inspector, to identify the particular combination of tag, class and id selectors associated with the information you are interested in. Hopefully there is a pattern, and hopefully that will enable automated collection of the data.

## Enough already, show us some code!
OK... to do this we need to install a new module.

In [None]:
pip install BeautifulSoup4

[`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a python module that makes it fairly easy to parse the DOM of a web page, and thus assemble the information of interest in an automated way. Once installed, we need to import it

In [None]:
from bs4 import BeautifulSoup

Now we need to point it at some content. To do this we use another module, called [`requests`](http://docs.python-requests.org/en/latest/user/quickstart/)

In [None]:
import requests

Requests allows you to get all the information from a URL. It's pretty easy to use.

In [None]:
simplepage = requests.get('http://southosullivan.com/gisc425/other/simple-web-page.html')
simplepage.text

So the content of the page at the URL requested is now contained in a text string, and we can analyse it at our leisure.

To do anything serious we probably need to request not just fixed web pages, but ones with associated search parameters. This is also easy with `requests`.

In [None]:
raywhite = requests.get('https://raywhite.co.nz/search/', params={'regionSelect': 7, 'districtSelect': 20})

Don't try to look at all the text (it's a big page):

In [None]:
raywhite.text[:5000]

How did I figure out what parameters to send the site? These are encoded in the extended URL that you see when you've done a search on a website. To figure this out you have to do some sleuthing on the web page. 

You can check that the correct URL was sent

In [None]:
raywhite.url

### Now pick the data apart
I had a look at the Ray White web page before, and figured out where the addresses are in the structure of the pages. I can use this information to extract only that information from the mess of nonsense in the raw HTML above.

In [None]:
result = BeautifulSoup(raywhite.text, "html.parser")

This has returned a `BeautifulSoup` object that can parse HTML just like a web browser does. 

In [None]:
type(result)

To see how smart it is, we can ask it to make the mess we saw above prettier, and easier to read. Note that I am only asking for the first 1000 characters to save space.

In [None]:
print(result.prettify()[:5000])

The main thing we need to be able to do, is to use `BeautifulSoup` to select only the elements we are interested in. It can make selections using HTML tags, HTML attributes or CSS selectors (named classes or ids). So... for example, to get all paragraphs on a page...

In [None]:
result.find_all("p")

We can refine searches by specifying classes and other aspects of the tags of interest. Again, this is based on sleuthing on the web page with the developer tools in the browser.

In [None]:
result.find_all("h5", class_ = "card-title font-lato font-bold text-grey")

Now we are getting warm! 

In [None]:
for element in result.find_all("h5", class_ = "card-title font-lato font-bold text-grey"):
    print(element.text)

Looking a bit more closely at the content, we might do even better with

In [None]:
for element in result.find_all("h5", "card-title font-lato font-bold text-grey"):
    print(element.text.split("\n")[1].strip())

So, that is the first page of results. How do we get more? Going back to the website, after a bit more sleuthing we can see how the URL for a series of pages of results is structured with a `result_p` element added to the URL.

In [None]:
import time # Now there's a weird command to issue...

And here we are:

In [None]:
addresses = []
# Notice we have a loop starting from 1 and going to 5, not 0 to 4
for i in range(1, 6):
    page = requests.get("https://raywhite.co.nz/search/", 
                        params = {"regionSelect": 7, "districtSelect": 20, "result_p": i})
    # page = requests.get("https://raywhite.co.nz/search/?regionSelect=7&districtSelect=20&result_p=" + str(i))
    print(page.url) # just for some reassurance
    pagecontent = BeautifulSoup(page.text)
    # add the results to our empty list
    for e in pagecontent.find_all("h5", "card-title font-lato font-bold text-grey"):
        addresses.append(e.text.split("\n")[1].strip())
    time.sleep(1)

In [None]:
addresses

That, believe it or not, is more or less all you need to know to scrape websites.

For fun, you might see if you can do a similar search on one of the other property websites. Or maybe change the example above to look for 2 bedroom rentals.

### More
There are other web scraping tools out there. One that is popular is [`scrapy`](https://scrapy.org/) a  framework for building web crawlers and scrapers, so it handles more of the complexities of multiple searches and so on.

But really that's it! Potentially a useful tool for research projects.