0
Items : 0
Subtotal : $0.00 USD
View CartCheck Out

Welcome to episode three! If you haven’t read the previous post yet, you can find it here. As a recap:

S1E2:

  1. We learned it made more sense to scrape the page once and dump the results into a file for repeated local testing
  2. We got better at searching the soup
  3. We managed to get the name of the game (Yakuza Kiwami), the price, and the number of ratings, off the Playstation page for this game

Our script as we left off last time looks like this:
from bs4 import BeautifulSoup

with open('psSingle.html') as page:
    soup = BeautifulSoup(page, 'html.parser')

game_name = soup.h2.string
print game_name

game_price = soup.h3.string
print game_price

ratings_tag = soup.find('div', attrs={'class':'provider-info__rating-count'})
game_ratings = next(ratings_tag.stripped_strings)
print game_ratings

I think we should now: save a copy of the code we used to scrape the page and dump results and try to export our findings into a spreadsheet.

Let’s start by fixing up our script that pulled source off websites and into local files. It didn’t work on our first try, and we ended up finishing that code in the live interpreter, and that means it isn’t saved for later use (and we’ll probably be using it again). It also wasn’t very flexible with those hard-coded strings. Reviewing the code from last week:

import urllib2

url = 'https://store.playstation.com/en-ca/product/UP0177-CUSA06997_00-YAKUZAKIWAMI0100'
page = urllib2.urlopen(url)

with open('psSingle.html', 'w') as f:
    f.write(page.read())

That did not work, but also didn’t error out. The only thing I can see upon review of that script and the code that worked in the interpreter was that in the interpreter I put the results of page.read() into a variable. I don’t know for sure why that would make a difference, but let’s try doing that in our file and seeing if that works. We should also take in command line arguments for the URL string and the file name. Let’s google how to do that. According to the linked article we can use sys.argv as a list. Let’s give that a quick try.

And the results:

I love it when things are that easy. Worth noting that sys.argv includes the name of the script we ran, so if we want the next two arguments to be the URL and filename we will need to use sys.argv[1] and sys.argv[2]. That means our new script should look like this:

import urllib2
import sys

url = sys.argv[1]
page = urllib2.urlopen(url)
contents = page.read()

with open(sys.argv[2], 'w') as f:
    f.write(contents)

That works just as intended, and we can re-use it later!

We don’t need this right now, however. Let’s change gears a little bit and try to make a CSV file out of the information we were printing on the screen with our script from last episode. We have that information stored as variables already, we just need to write them to a CSV. The example below from the linked page looks pretty good for the information we have:

import csv

with open('names.csv', 'w') as csvfile:
    fieldnames = ['first_name', 'last_name']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({'first_name': 'Baked', 'last_name': 'Beans'})
    writer.writerow({'first_name': 'Lovely', 'last_name': 'Spam'})
    writer.writerow({'first_name': 'Wonderful', 'last_name': 'Spam'})

Let’s do that with our variables instead of printing them to the screen. But first some housekeeping:

Once things are looking tidier in our working directory, we will open up that new copy of our script from last episode and make some additions:

And we error. Go figure, basically nothing runs on the first try. Let’s review the error that gets thrown out:

john:~/bSoup4$ python psSingleToCSV.py yakuzaKiwamiSource.html firstPSCSV.csv
File "psSingleToCSV.py", line 21
writer.writerow({'game_name':game_name, 'game_price':game_price, 'game_ratings':game_ratings)
^
SyntaxError: invalid syntax

Syntax error. Let’s read that line over very carefully… Do you spot what I did wrong? I can see it now, and I feel silly, but I won’t cut this out. I’m using a glorified text editor; a modern IDE would have highlighted this for me, but no excuses. I must be getting soft using Visual Studio all the time. I’m missing a “}” at the end of that implicit list. Let’s try again with the required bracket.

john:~/bSoup4$ python psSingleToCSV.py yakuzaKiwamiSource.html firstPSCSV.csv
john:~/bSoup4$ ls
demoOriginal.py firstPSCSV.csv psSingleToCSV.py writeURLtoFILE.py
demo.py psSingle.py psSingleToCSV.py~ yakuzaKiwamiSource.html
john:~/bSoup4$ cat firstPSCSV.csv
game_name,game_price,game_ratings
Yakuza Kiwami,$19.99,1916 Ratings
john:~/bSoup4$ rm *~

I run our script again, this time no errors! I use ls (that’s a lowercase LS) to check if out file was made, and it was. I run cat to read our file out to the command line, which looks good to me. Finally, I run rm *~ to get rid of that temp copy with the “~” at the end (emacs autosaves for you once in a while and makes those).

We have one entry in a CSV file, huge success! We will need a lot more entries before this data is useful though. We are going to need some way to automate the collection of this data across a target site. As short as this post was compared to the last two, that sounds like considerably different subject matter, so I’m going to cut this one here.

If you found the shorter post easier to digest let me know. If you have any questions of course please reach out to me here or on Twitter (@NSJacob1), please share this or any other content you enjoy here and check back for more updates!

Jacob NoSleep
About the author

Designer, developer, contract-doer, and co-founder of No Sleep Software. I spend a lot of time thinking about systems design and staring at Visual Studio.

Leave a Reply