Items : 0
Subtotal : $0.00 USD
View CartCheck Out

Jacob Learns Scraping. V1

I like to say I like stats, but sometimes I wonder if I don’t have enough exposure to really mean it.

I’ve done some report programming but exclusively at work. It’s fun enough, but always involves making reports based on someone else’s requirements using their existing database with their curated metrics. Having never really studied stats or done anything with them on my own, I can’t say I’m very skillful. I don’t even think its accurate to say that it’s a hobby of mine.

Still, graphs are beautiful, and uncovering meaningful insights out of data is like an archeological dig. Polling and displaying data relevantly scratch the same itch as playing Sudoku. If I want to see if I actually enjoy stats, I should try to become more skillful in this area. This reminds me of when I first started programming. Not the first class I took, but all the time I spent at home working on it afterwards and on top of that class.

Let’s bring it back to that. Mostly I wrote a lot of python back then, long since discarded. I hear python is pretty good at web stuff so I’m going to try to re-learn python, focusing on gathering my own data using web scraping. A quick google search led me here. I recognized “FreeCodeCamp” so I’ll use this article since I don’t know anything about scraping. One of the things I really enjoyed when I was learning this stuff the first time was using emacs and bash. Since the start of that tutorial starts by throwing some *nix commands out there, let’s do this on Ubuntu! Conveniently now available on the Windows 10 Store for free.

As easy as it seems, once Ubuntu starts up we can use some simple house-keeping immediately, let’s update and upgrade anything that came out of the box.

sudo apt update -y
sudo apt dist-upgrade -yu

Then we’ll get my favorite CLI text editor, emacs.

sudo apt install emacs25

We should also check our python version before we get into things

python –version

Looks ok.

The last thing we need before we jump right in is some kind of goal. I work in gaming, let’s pull some data about games.

I want to make a spreadsheet that looks like this:

Starting the tutorial, we need to run two commands for some more installs.

easy_install pip

pip install BeatifulSoup4

Both of those threw errors for me since I was missing stuff the first run through, but apt is usually pretty good at letting you know what to type to get what you need.

Following that tutorial for a bit we get a file that looks like this:

#import libraries
import urllib2
from bs4 import BeautifulSoup

#specify the url
quote_page = 'http://www.bloomberg.com/quote/SPX:IND'

#query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)

#parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')

# Take out the <div> of name and get its value
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip() #strip is used to remove starting and trailing
print name

# get the index price
price_box = soup.find('div', attrs={'class': 'price'})
price = price_box.text
print price

The author says “When you run the program, you should be able to see that it prints out the current price of the S&P 500 Index.” So, at this point I try to run my script.

Well that didn’t quite work.
AttributeError: ‘NoneType’ object has no attribute ‘text’
Looks a lot like a nullref from C#. Searching that error, it basically is a nullref. Whatever we were accessing was, instead of what we thought it was, None. Which I guess I is Python-ese for null.

soup.find('h1', attrs={'class': 'name'}) returned None then. I want to know what is in the ‘soup’ var. This leads us to Beautiful Soup documentation. I look down the nav bar on the left of the page and “Making the Soup” stands out to me as a good place to start. Reading for a bit it looks like the ‘soup’ var should contain some kind of tree of tags and strings and junk. Probably printable, lets just try printing the whole thing and seeing what it spits out. Thinking back on my original Python-era of programming I would usually keep the live interpreter running in a tab for tests like this. Let’s throw everything into there so we can play with this more interactively.

Well. I suppose I am… That spits out a garbage-y html file source at us with some more stuff about completing a CAPTCHA at the bottom. This is not happening. Let’s try another page and maybe it’ll be more friendly. Let’s try the No Sleep homepage!

That takes so long I have time to get a screenshot of it and crop the image for this post. It does finish successfully though, and printing it gets us… Just an absolute tonne of source code. Probably not a captcha page though, which is cool. Now our soup is made, let’s inspect some page source and see if we can grab anything specific out of it.

Trying to grab the title of the most recent blog on this page, we can find it here in the developer console. That does not look directly searchable, but it is an a tag inside an h3 with a class that includes the substring “blog-title” so we might be able to get somewhere with that. Back to the docs! After skimming most of the “Searching the Tree” section it looks like there is a Find(), a Find_All(), a shortcut for Find_All() which is to just call the soup object directly, and some kind of multi-object find called Select() that takes in strings representing common CSS selectors. Very weird, but cool. Let’s start by just calling find with h3 and the exact class value we want, just like the demo from the tutorial does.

>>> blog_title_find_results = soup.find('h3', attrs={'class': 'gdlr-core-blog-title'})
>>> print blog_title_find_results
<h3 class="gdlr-core-blog-title gdlr-core-skin-title"><a href="https://www.nosleep.io/news/radio-violence-is-entering-early-access/">Radio Violence is Entering Early Access</a></h3>

Cool, that did something… Wrong item though. We didn’t look around enough in the page source, apparently all the title cards use that same class. Maybe let’s try getting all of them?

>>> blog_title_find_results = soup.find_all('h3', attrs={'class': 'gdlr-core-blog-title'})
>>> print blog_title_find_results

Sure enough, looks like all the title cards come up at once. Something does stand out in the results though, looks like “blogs” have /blog/ in the url. We should be able to filter on that. We need to know what type of object find_all is actually returning though. In the soup docs they called type(someObject) so let’s try that here.

>>> type(blog_title_find_results)
<class 'bs4.element.ResultSet'>

Neat, so it’s some kind of internal bs4 collection. At this point I remember one of the things I used to do was call help() from the live interpreter and often get better documentation than the remote site. Calling help(‘bs4’) absolutely returns a great resource. Read that but nothing jumped out at me. Tried this:

>>> h3_results = blog_title_find_results
>>> h3_results.find('a', attrs={'href': re.compile('blog')})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/john/.local/lib/python2.7/site-packages/bs4/element.py", line 1884, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
>>> import re

I cannot use find on this collection even though it’s inside of bs4. At this point I notice, since I have the docs open on the side of my terminal, that I need to import re in order to use regex, which I think I’m going to want. If this is just a collection, I should be able to iterate it and compare elements. Better yet, I think Python implements map and filter by default. I’m gonna look that up real quick. Yes it does. Filtering h3_results with a regex should do the trick of getting us all the blog titles then. Takes me about 30 seconds to realize I’m just trying to find out if a string contains a substring.

>>> just_blogs = filter(lambda t: ".io/blog/" in t.a, h3_results)
>>> print just_blogs

That did not work. Obviously my predicate returned false every time, but why? Let’s inspect individual elements.

>>> h3_results[0].a
<a href="https://www.nosleep.io/news/radio-violence-is-entering-early-access/">Radio Violence is Entering Early Access</a>
>>> h3_results[1].a
<a href="https://www.nosleep.io/news/no-sleep-streaming/">No Sleep Streaming</a>
>>> h3_results[2].a
<a href="https://www.nosleep.io/news/no-sleep-at-eglx/">No Sleep At EGLX</a>
>>> h3_results[3].a
<a href="https://www.nosleep.io/blog/functional-functional-programming/">\u201cFunctional\u201d Functional Programming</a>
>>> ".io/blog/" in h3_results[3].a

Only a little surprising, because we kind of already knew it would do that from the last thing, but a little confusing to me right now. With a simpler substring?

>>> "blog" in h3_results[3].a
False #Apparently not
>>> 'blog' in h3_results[3].a
False #Python sometimes uses single quotes for strings?
>>> len(h3_results[3].a)
1 #I am proud momentarily that I remembered how to check length of something in python
>>> type(h3_results[3].a)
<class 'bs4.element.Tag'>

That’s the basic object for Beautiful Soup, so we have a type mismatch when we’re trying to grab a substring. To the docs again to look up properties of that object! Looks like I can grab specific attributes out of it with an accessor.

>>> type(h3_results[3].a)
<class 'bs4.element.Tag'>
>>> h3_results[3].a['href']
>>> type(h3_results[3].a['href'])
<type 'unicode'>
>>> 'blog' in h3_results[3].a['href']

There we go. Let’s filter again.

>>> just_blogs = filter(lambda t: 'blog' in t.a['href'], h3_results)
>>> just_blogs # I realize that in the live interpreter I don’t need to type print all the time
[<h3 class="gdlr-core-blog-title gdlr-core-skin-title"><a href="https://www.nosleep.io/blog/functional-functional-programming/">\u201cFunctional\u201d Functional Programming</a></h3>, <h3 class="gdlr-core-blog-title gdlr-core-skin-title"><a href="https://www.nosleep.io/blog/a-candle-in-the-dark/">A Candle in the Dark</a></h3>, <h3 class="gdlr-core-blog-title gdlr-core-skin-title"><a href="https://www.nosleep.io/blog/from-code-to-convention-an-eglx-retrospective/">From Code to Convention: an EGLX Retrospective</a></h3>]

That is all the blogs, spitting out of that bs4 collection with the h3 tags still attached. Which means…

>>> just_blogs[0].a.contents
[u'\u201cFunctional\u201d Functional Programming']
>>> just_blogs[0].a.contents.text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'text'
>>> just_blogs[0].a.text
u'\u201cFunctional\u201d Functional Programming'

I’m just playing a formatting game now. We got a value out of the soup, and some more stuff besides. We ended up with all the blogs. Next time we can finish that tutorial and store the results somewhere.

I plan on continuing to write episodes of this series in this kind of “stream of consciousness” style side by side adventure as I try to write a (hopefully) useful web-crawling scraper that can be used to gather useful market data about video games for No Sleep to better figure out which projects to take on. I don’t actually know how to do any of that, so I get to learn stuff and play with cool tools and learn about stats, and you get to see how I work on things I don’t understand and how I learn to do this stuff. Would you rather this was done on a live stream or something? I wonder if that would be just too much of me reading docs.

Jacob NoSleep
About the author

Designer, developer, contract-doer, and co-founder of No Sleep Software. I spend a lot of time thinking about systems design and staring at Visual Studio.

Leave a Reply