A very simple web page

David is from Ireland. Sometimes, he misses Ireland's glorious summers, like the one in\n", " this picture.\n", "

The picture was taken from roughly here. The google car got better weather than I did.\n", "

` | A general container for content used to structure a document into sections\n", "`` | A container for inline material, often used to mark particular items\n", "`

` | Headings (6 levels)\n", "`
` | Lists of various kinds\n", "`` | Anchors - most often used for hyperlinks\n", "`
` | Paragraphs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More labels\n", "In addition to the tags themselves, and any inbuilt attributes they might have (such as, e.g. `` where href is an expected attribute of an anchor tag, most web pages have additional `class` or `id` tags that web designers use in tandem with CSS to control the format of the web page.\n", "\n", "We aren't too concerned about the details of this (if you want to know more, then you'll need to learn web design...). For our purposes, the important thing about the CSS class and id features is that they often make it easier to search for particular items of interest on a page.\n", "\n", "So the basic idea is to investigate the web page you are interested in scraping, using the web tools, particularly the inspector, to identify the particular combination of tag, class and id selectors associated with the information you are interested in. Hopefully there is a pattern, and hopefully that will enable automated collection of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enough already, show us some code!\n", "OK... to do this we need to install a new module." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install BeautifulSoup4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a python module that makes it fairly easy to parse the DOM of a web page, and thus assemble the information of interest in an automated way. Once installed, we need to import it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to point it at some content. To do this we use another module, called [`requests`](http://docs.python-requests.org/en/latest/user/quickstart/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Requests allows you to get all the information from a URL. It's pretty easy to use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "simplepage = requests.get('http://southosullivan.com/gisc425/other/simple-web-page.html')\n", "simplepage.text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the content of the page at the URL requested is now contained in a text string, and we can analyse it at our leisure.\n", "\n", "To do anything serious we probably need to request not just fixed web pages, but ones with associated search parameters. This is also easy with `requests`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raywhite = requests.get('https://raywhite.co.nz/search/', params={'regionSelect': 7, 'districtSelect': 20})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't try to look at all the text (it's a big page):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raywhite.text[:5000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How did I figure out what parameters to send the site? These are encoded in the extended URL that you see when you've done a search on a website. To figure this out you have to do some sleuthing on the web page. \n", "\n", "You can check that the correct URL was sent" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raywhite.url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now pick the data apart\n", "I had a look at the Ray White web page before, and figured out where the addresses are in the structure of the pages. I can use this information to extract only that information from the mess of nonsense in the raw HTML above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = BeautifulSoup(raywhite.text, \"html.parser\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This has returned a `BeautifulSoup` object that can parse HTML just like a web browser does. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see how smart it is, we can ask it to make the mess we saw above prettier, and easier to read. Note that I am only asking for the first 1000 characters to save space." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(result.prettify()[:5000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main thing we need to be able to do, is to use `BeautifulSoup` to select only the elements we are interested in. It can make selections using HTML tags, HTML attributes or CSS selectors (named classes or ids). So... for example, to get all paragraphs on a page..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.find_all(\"p\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can refine searches by specifying classes and other aspects of the tags of interest. Again, this is based on sleuthing on the web page with the developer tools in the browser." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.find_all(\"h5\", class_ = \"card-title font-lato font-bold text-grey\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are getting warm! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for element in result.find_all(\"h5\", class_ = \"card-title font-lato font-bold text-grey\"):\n", " print(element.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking a bit more closely at the content, we might do even better with" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for element in result.find_all(\"h5\", \"card-title font-lato font-bold text-grey\"):\n", " print(element.text.split(\"\\n\")[1].strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, that is the first page of results. How do we get more? Going back to the website, after a bit more sleuthing we can see how the URL for a series of pages of results is structured with a `result_p` element added to the URL." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time # Now there's a weird command to issue..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here we are:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "addresses = []\n", "# Notice we have a loop starting from 1 and going to 5, not 0 to 4\n", "for i in range(1, 6):\n", " page = requests.get(\"https://raywhite.co.nz/search/\", \n", " params = {\"regionSelect\": 7, \"districtSelect\": 20, \"result_p\": i})\n", " # page = requests.get(\"https://raywhite.co.nz/search/?regionSelect=7&districtSelect=20&result_p=\" + str(i))\n", " print(page.url) # just for some reassurance\n", " pagecontent = BeautifulSoup(page.text)\n", " # add the results to our empty list\n", " for e in pagecontent.find_all(\"h5\", \"card-title font-lato font-bold text-grey\"):\n", " addresses.append(e.text.split(\"\\n\")[1].strip())\n", " time.sleep(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "addresses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That, believe it or not, is more or less all you need to know to scrape websites.\n", "\n", "For fun, you might see if you can do a similar search on one of the other property websites. Or maybe change the example above to look for 2 bedroom rentals.\n", "\n", "### More\n", "There are other web scraping tools out there. One that is popular is [`scrapy`](https://scrapy.org/) a framework for building web crawlers and scrapers, so it handles more of the complexities of multiple searches and so on.\n", "\n", "But really that's it! Potentially a useful tool for research projects." ] } ], "metadata": { "interpreter": { "hash": "b9476ba453617cf67dc901a6664618619892412b6f80972fafe3f677f6bd1d4c" }, "kernelspec": { "display_name": "Python 3.9.7 ('g420')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }

My first web page\n", "