0
Items : 0
Subtotal : $0.00 USD
View CartCheck Out

Jacob Learns V1.1

Welcome to episode two! That means there was one before this, if you haven’t read it yet you can find it here. As a recap:

S1E1:

  1. We tried to follow this tutorial, we got about halfway through.
  2. We set a goal to fill a spreadsheet with console information we don’t really have access to
  3. We got our development environment of choice set up (Ubuntu through Windows, python 2.7.12, CLI emacs)
  4. We navigated around some roadblocks in said tutorial and pulled some interesting information out of the No Sleep homepage instead.

Cool. Not super helpful to our goals, however. We should look at some content that could theoretically feed into our target data set. Let’s look at some platform pages and make some guesses after deconstructing the No Sleep homepage.

Grabbing game specific pages off PlayStation, Nintendo, and Microsoft Store websites we can get an idea about the layouts and content of the average store page:

A couple things stand out upon review of all those pages:

  • They are all location specific, MS and PS have /en-ca/ in the URL and Nintendo just has /ca/.
  • PlayStation prominently displays the price, sale price, number of reviews, release date, sale duration (didn’t think of that before), sale conditions (or that), and platforms of distribution (oh dear I don’t even want to consider that).
  • Nintendo page includes price, platform of distribution, and that’s about it. ( US Store Page for Mario Odyssey comes with a cancer warning. So there’s also that.)
  • Microsoft offers us the price, number of reviews, release date, and Category (which is nice)
  • The only game on sale was Yakuza Kizami, so it seems likely that similar information would likely be reported for other sites as well, if the games I clicked first were also on sale
  • A quick review of the source code for each page:
    • PlayStation has human readable tags, making it likely it is the easiest to scrape
    • Microsoft has source that looked generated, but included readable key phrases
    • Nintendo looked mostly intended for internal tools/ exclusively computers

What exactly is going on in California that this warning was included?

For all my initial judgments, I tried to find Price followed by Title. I thought those two items were the most essential. Also in the case of Nintendo, that was pretty much all they had. Obviously, they will be the most difficult (I say, not actually knowing the final outcomes). We’re going to try to grab data off PlayStation first since they seem the easiest based off my one sample.

We start off by repeating the beginning of the tutorial code, but with different values.

#import libraries
import urllib2
from bs4 import BeautifulSoup

#specify the url, in this case the one for Yakuza Kiwami from PlayStation store</code
url = 'https://store.playstation.com/en-ca/product/UP0177-CUSA06997_00-YAKUZAKIWAMI0100'

#query the website and return the html to the variable 'page'
page = urllib2.urlopen(url)

#demo explicitly calls "html" parser, but documentation said that’s the default, so we won't call
it explicitly
soup = BeautifulSoup(page)

game_name =

At this point I can’t recall what I’m allowed to call to search soup objects, so I refer back to the docs… This will help. We can use the same format we used to search last time, with both “name” and “attrs” (attributes), as shown as available by the signature of .find().Signature: find(name, attrs, recursive, string, **kwargs). Since we’re trying to find the inner value of <h2 class="pdp__title" style="direction:ltr">Yakuza Kiwami</h2> the name is <h2> and the attribute is class with a value of pdp_title. First run of this code looks like this:

And it gets me this:

That’s an annoying warning, and we got None back as the name. Trying game_name = soup.find(“pdp_title”) also returns None and trying game_name = soup.h2 throws me a bunch of errors.

Traceback (most recent call last): File "psSingle.py", line 9, in <module> page = urllib2.urlopen(url)
File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open '_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1241, in https_open context=self._context)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open raise URLError(err)urllib2.URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>

Looks like we shouldn’t repeatedly try to scrape the same page, like you do when you’re testing code. This won’t do since we really need to play with this a little bit. We need some way to test our code and learn how this works, with realistic data, without hitting security or protocol errors. I think we can scrape once, and just dump the whole soup object into a file locally, and then pull from that to test our code.
To do that we’ll need to open our URL, read the contents into soup, and then write soup out to a file (maybe even just write the page source out to a file).

#import libraries
import urllib2
from bs4 import BeautifulSoup

#specify the url, in this case the one for Yakuza Kiwami from PlayStation store
url = 'https://store.playstation.com/en-ca/product/UP0177-CUSA06997_00-YAKUZAKIWAMI0100'

#query the website and return the html to the variable 'page'
page = urllib2.urlopen(url)

#demo explicitly calls "html" parser, but documentation said that’s the default, so we won't call it explicitly
soup = BeautifulSoup(page)

So that part stays the same. Then we look up how to write to a file using python. I know that if you open a file, you generally need to close it again, and those keywords are curiously absent from that W3Schools page (normally so good, tsk!). Looking around a little more I come across this page with this nice secure syntax:

with open('output.txt', 'w') as file: # Use file to refer to the file object
file.write('Hi there!')

That’s nice, it’ll automagically handle closing our file when we’re done with it, or if we error out. Much safer. We should also test code we grab off the internet, no matter how reputable it seems (versions change sometimes). Tossing that snippet into the live interpreter we get:

>>> with open('foo.txt', 'w') as f:
... f.write('New file!')
...
>>> #no error, I guess it worked.

Let’s try with whatever comes out of our urllib2.urlopen call.

import urllib2

url = 'https://store.playstation.com/en-ca/product/UP0177-CUSA06997_00-YAKUZAKIWAMI0100'
page = urllib2.urlopen(url)

with open('psSingle.html', 'w') as f:
f.write(page)

That throws an error.

Traceback (most recent call last):
File "writeLocalSource.py", line 7, in <module> f.write(page)
TypeError: expected a string or other character buffer object

As flexible as python may be, it doesn’t like trying to parse any old objects. Ok, we need to make sure it’s trying to write strings at least. What are we getting back from that call, and how do we make it strings?
Apparently you can read from the returned object as if it is a file itself, and then write the contents from that.

import urllib2

url = 'https://store.playstation.com/en-ca/product/UP0177-CUSA06997_00-YAKUZAKIWAMI0100'
page = urllib2.urlopen(url)

with open('psSingle.html', 'w') as f:
f.write(page.read())

Let’s try that.

Oh goodie, my favorite. No error, no issue, no results. With nothing else to go on, let’s try that again in the live interpreter and see what is happening.

That’s a lot of jumbled text in there. Can we write it out of the live interpreter?

Weird. But ok, now we have this file psSingle.html that holds out source contents for the page we want to scrape, so we don’t have to repeatedly call the actual website. That will probably make this easier… If we can open it and make soup from it again. Right from the soup docs: “To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:”

#import libraries
import urllib2
from bs4 import BeautifulSoup #specify the url
#url = 'https://store.playstation.com/en-ca/product/UP0177-CUSA06997_00-YAKUZAKIWAMI0100' #not using this, but it’s nice to know where the data came from
#query the website and return the html to the variable 'page’
#page = urllib2.urlopen(url) # we’re not using the url to open the page anymore, so this isn’t happening either #open our html file and read that into the soup constructor

with open('psSingle.html') as page:
soup = BeautifulSoup(page)
game_name = soup.h2; # we keep getting None from this, so I keep being more and more simple with it in the hopes we get something, anything, back
print game_name

Let’s give that a run, we’re looking pretty good right now. Most of the file has been commented out, and one of the things I say is that the less code you run, the fewer places you can get errors.

john:~/bSoup4$ python psSingle.pypsSingle.py:14: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 14 of the file psSingle.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor. soup = BeautifulSoup(page)
<h2 class="pdp__title" style="direction:ltr">Yakuza Kiwami</h2>

Fantastic! A couple things become clear here:

1) It worked!
2) I should really specify my parser, that warning is annoying
3) There is only one h2 tag, and it has the information we want in it!
4) We clearly have no idea how to properly search the soup.

Referring back to the soup doc, we can find the piece of the result we want is probably the .string property, from the tags section.

john:~/bSoup4$ python psSingle.py
Yakuza Kiwami

That is way cleaner. Let’s get the price too.

Our code now looks like this:

#import libraries
import urllib2
from bs4 import BeautifulSoup
#specify the url
#url = 'https://store.playstation.com/en-ca/product/UP0177-CUSA06997_00-YAKUZAKIWAMI0100'

#query the website and return the html to the variable 'page'
#contents of this have been saved into psSingle.html
#page = urllib2.urlopen(url) # using our local data so we don’t bother the website too much
with open('psSingle.html') as page:
soup = BeautifulSoup(page, 'html.parser')

game_name = soup.h2.string #really nice that this workedprint game_name game_price = soup.h3.string #let’s try just grabbing the front most tag again
print game_price

And the result:

john:~/bSoup4$ python psSingle.py
Yakuza Kiwami
$19.99
john:~/bSoup4$

I’m floored, that worked well. Thanks, Sony, for being consistent in the way you put your pages together. We should get one more piece of information at least before we call this a success, we need to get something to represent how much attention this product has gotten. In this case, the number of ratings is prominently displayed at the top, so let’s get that.

I know that div is a much more widely used tag than h2, so I anticipate the need to use a more specific search. Going back to the live interpreter (it’s nice to be able to just run lines of code and see what works right away)
divs = soup.find_all('div', attrs={'class':'provider-info_rating-count'})# using the same syntax we learned las time, searching div tag with class that has the same content we saw in the source
>>> divs[]
>>> divs = soup.find_all('div', attrs={'class':'provider-info__rating-count'}) # two underscores? Really?
>>> divs[
\n 1898 Ratings\n

] # this string is gross, and there is only one result
>>> divs = soup.find('div', attrs={'class':'provider-info__rating-count'}) # if theres only one I can find instead of find_all, then I won’t have a collection it should just be one object
>>> divs
\n 1898 Ratings\n #not that it looks any different, but I know that a collection will be harder to work with
>>> divs.stripped_string
>>> print(div.stripped_string)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'div' is not defined
>>> print(divs.stripped_string)
None
>>> print(divs.stripped_strings)<generator object stripped_strings at 0x7f1a90700c30>
>>> divs.stripped_strings<generator object stripped_strings at 0x7f1a90700c30>
>>> for string in divs.stripped_strings:
... print(repr(string))
...u'1898 Ratings'
>>> divs.string
u'\n 1898 Ratings\n

I find it frustrating that this item won’t be as clean looking as the other two.
>>> divs.stripped_string
>>> print(div.stripped_string) # remember your variable names Jacob!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>NameError: name 'div' is not defined
>>> print(divs.stripped_string)
None
>>> print(divs.stripped_strings)<generator object stripped_strings at 0x7f1a90700c30>

That error is helpful, it lets us know that the stripped_strings property is a generator. A generator can be thought of as a special kind of function with some peculiar and sometimes useful properties. It also gives us something to search. Thanks to StackOverflow, we know we should be able to do this:

>>> next(divs.stripped_strings)
u'1898 Ratings'

Much better. Let’s add that to our script and remove all that commented code.

Running that gets us:

Success! A little anti-climactic maybe? This sets us up well for the next time though, when we try to gather the rest of the information, we might want off that page, and assemble it into a spreadsheet so we could make use of it. I would love some feedback on these pieces so far, are they too long? Short? Too frequent? Too far apart? Should I spend less time narrating and more time explaining code? Let me know, and ask any questions you have, in the comments below. Or message me, I’m now on twitter even (@NSJacob1)!

Jacob NoSleep
About the author

Designer, developer, contract-doer, and co-founder of No Sleep Software. I spend a lot of time thinking about systems design and staring at Visual Studio.

Leave a Reply