Thursday, October 1, 2015

Using Python to Sound Like a Wine Snob


Web Scraping with Beautiful Soup & Using Markov Chain to Create

Wine Snob Gibberish

Some Background

The term, Markov chain, is named after Russian mathematician, Andrey Markov (1871 - 1897). In mathematics, a Markov Chain is a discrete random process with the Markov property. According to Wikipedia, "A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it." This process changes randomly throughout each iteration in discrete steps. Saaaay what!? Jason Young has a video on YouTube that better explains what a Markov Chain is.

"Markon, Markov, Markon, Markov" - Mr. Miyagi [not really]

Our lives have all graced by Markov Chains. Think about those websites websites you've come across with nonsensical text. These pages are often generated by the use of the Markov Chain. Why do these pages exist? They're used to optimize search engine rankings (the darker side of SEO). Bummer.

Fun with the Markov Chain

While researching Markov Chains, I came across Tony's (@tonyfischetti) blog post. It inspired me to create a Python script that emulates essentially what his example does, but with using the BeautifulSoup library to scrape the initial website content.

Requirements

  • beautifulsoup4 # This is for extracting the data we want from the downloaded web content
  • requests # This is for downloading the web content
  • Note: I used Python 2.7.9 when I created this. I haven't tested the code with Python 3.

Steps

  • Download initial webpage from winespectator.com and determine last "page" number

The idea behind web scraping is to get raw content from a wbsite and extract from it usable data. This is where the python library, BeautifulSoup, comes in handy.

Let's Begin

We'll start by importing the modules we'll need to download a website and extract the data we want.

import requests
from bs4 import BeautifulSoup
from random import choice

We want to pick a random webpage from the website for to feed into our Markov Chain. First, let's find out how many pages this site has at http://www.winespectator.com/dailypicks/category/catid/1/page/???.

Now we'll download the HTML source from a wine website and generate a Beautiful Soup object using the BeautifulSoup function.

base_url = "http://www.winespectator.com/dailypicks/category/catid/1/page"
r = requests.get(base_url)
soup = BeautifulSoup(r.text)

Next, we'll take a look at a section of the website's HTML to figure out what "element" we want to extract in order to get the last page number.

print soup.prettify()[44750:45750]
items -->
      <center>
       <div class="pagination">
        <strong>
         1
        </strong>
        <a href="/dailypicks/category/catid/1/page/2" title="Goto page 2">
         2
        </a>
        <a href="/dailypicks/category/catid/1/page/3" title="Goto page 3">
         3
        </a>
        <a href="/dailypicks/category/catid/1/page/4" title="Goto page 4">
         4
        </a>
        <a href="/dailypicks/category/catid/1/page/5" title="Goto page 5">
         5
        </a>
        <a href="/dailypicks/category/catid/1/page/6" title="Goto page 6">
         6
        </a>
        <a href="/dailypicks/category/catid/1/page/7" title="Goto page 7">
         7
        </a>
        <a href="/dailypicks/category/catid/1/page/2" title="Goto Next Page">
         &gt;&gt;
        </a>
        <a href="/dailypicks/category/catid/1/page/814" title="Goto Last Page">
         Last (814)
        </a>
       </div>
      </center>
     </div>
     <!-- /.mod-container -->
     <div

There's a lot of junk we need to sift through. Looking at the source code from above, we now know we're looking for an element called "div" with class "pagination". Beautiful Soup makes it easy to find and extract this.

As of this writing, we see that the last page is "814". That means the range of possible pages we can download is from 1 - 814. Right on!

Let's extract the last page number (i.e., "814") from the HTML using BeautifulSoup with this knowledge.

html_chunk = soup.find_all("div", class_="pagination")
last_page = (str(html_chunk[0]).split('Last (')[1]).split(')')[0]
print last_page
814

Now let's pick a random page and generate our URL.

random_page_number = choice(xrange(int(last_page)))
url = "{}/{}".format(base_url, random_page_number)
print url
http://www.winespectator.com/dailypicks/category/catid/1/page/288

We'll download this randomly selected page.

r = requests.get(url)
soup = BeautifulSoup(r.text)

Similar to how we found the last page number, let's use BeautifulSoup to look at the HTML we downloaded to figure out what element we're looking for.

print soup.prettify()[41800:42800]
BODEGAS CAMPO VIEJO Rioja Crianza 2007
        </a>
        <h6>
         86 points, $12
        </h6>
        <div class="paragraph">
         Light, firm tannins support a pleasingly plump texture in this fresh red, which offers black cherry, leaf and tobacco notes, with a smoky finish. Drink now through 2013. 50,000 cases imported.
         <em>
          —Thomas Matthews
         </em>
        </div>
        <!-- /.paragraph -->
       </h5>
      </div>
      <!-- /.list-items -->
      <div class="daily-wine-items">
       <span>
        Jan. 11, 2011
       </span>
       <h5>
        <a href="/wine/detail?note_id=288244">
         STANDING STONE Chardonnay Finger Lakes 2009
        </a>
        <h6>
         85 points, $11
        </h6>
        <div class="paragraph">
         Up front, with green apple, melon and butter hints. Just tangy enough on the finish to keep it all honest. Drink now. 1,184 cases made.
         <em>
          —James Molesworth
         </em>
        </d

Analyzing the above HTML, we learn that we're looking for an element called "div" with class "paragraph". Let's extract away...

verbiage = soup.find_all("div", class_="paragraph")
print type(verbiage)
<class 'bs4.element.ResultSet'>

We can now begin looking at the text that Beautiful Soup Extracted. Each element section can be called by its index.

verbiage[0]
<div class="paragraph">4070 Daily Wine Picks found in this category.</div>

We'll probably want to ignore verbiage[0] later.

verbiage[1]
<div class="paragraph">
   Light, firm tannins support a pleasingly plump texture in this fresh red, which offers black cherry, leaf and tobacco notes, with a smoky finish. Drink now through 2013. 50,000 cases imported.      <em>—Thomas Matthews </em>
</div>

Nice - this looks like some content we want to extract.

Ok, we know we don't want verbiage[0], so we'll start iterating through the entries starting at index 1 (i.e., "[1:]"). We'll also encode the text to UTF-8. Next, we'll want to remove any newlines and tabs that are in the text ... then remove any leading/trailing spaces ... and then split the line so that we ignore the "em" element; we don't care about who wrote the comment on the website. We'll combine all of the sanitized text into a string called scraped_text.

scraped_text = ""
for entry in verbiage[1:]:
    entry = entry.get_text().encode('utf-8')
    entry = entry.replace('\t', '')
    entry = entry.replace('\n', '')
    entry = entry.strip()
    entry = entry.split('—')[0]
    scraped_text += "{} ".format(entry.replace('Back to Top', ''))
scraped_text = str(scraped_text.split('Featured:')[0])

Let's see what we got by printing scrapted_text.

print(scraped_text)
Light, firm tannins support a pleasingly plump texture in this fresh red, which offers black cherry, leaf and tobacco notes, with a smoky finish. Drink now through 2013. 50,000 cases imported. Up front, with green apple, melon and butter hints. Just tangy enough on the finish to keep it all honest. Drink now. 1,184 cases made. Tasty, showing citrus, pear and apple flavors that have a pleasant ripeness and a floral quality. Balanced and juicy. Drink now. 40,000 cases made. Vibrant and mouthwatering, with a laser beam of lemon, lime, grapefruit and apricot flavors. Hints of fresh herbs and flowers add to the complexity. Drink now. 250,000 cases imported. Syrah-like, with layers of plum, spice and violet flavors framed by a fine layer of tannins, followed by a focused, tar-tinged finish. Drink now. 60,000 cases made. Browse our exclusive lists of the world's top wine values, top value producers and easy-to-find wines.

Yawn. I like wine. Well, I like to drink wine.

At this point, we have the text we want to work with. Let's create the Markcov Chain and generate some new text.

We'll define a function that splits text passed on to it into a dictionary of Markcov Chain chunks, returning the new dict once it's done. For example, take the sentance, "I love walking cats in New York City". The sentance is first chunked into bi-grams:

  • I love
  • love walking
  • walking cats
  • cats in
  • in New
  • New York
  • York City

With Python, we'll make these immutable keys in a dictionary (dict):

  • {('I', 'love'): '', ('love', 'walking'): '', ... }

We'll then need to add values to each of the keys. The values will consist of the word that comes after each instance of the bi-grams. So, in the case of "I love", the third word is "walking".

If we feed more data into our function, there may be multiple instances of "I love". For example, "I love walking cats in New York city. I love eating pizza.". The words "walking" and "eating" both come after "I love" (there are two instances of "I love"). The value we assign to the "('I', 'love')" dictionary key is a list consisting of "['walking', 'eating']"

Our dictionary begins to look like:

  • {('I', 'love'): ['walking', 'eating'], ... }

Once completed, we return the dict.

def create_markcov_dict(original_text):
    original_text = original_text
    split_text = original_text.split()
    markcov_dict = {}
    for i in xrange(len(split_text) - 2):
        key_name = (split_text[i], split_text[i+1])
        key_value = split_text[i+2]
        if key_name in markcov_dict:
            markcov_dict[key_name].append(key_value)
        else:
            markcov_dict[key_name] = [key_value]
    return markcov_dict

Let's send the above function our scraped text from the website.

markcov_dict = create_markcov_dict(scraped_text)
print(markcov_dict)
{('top', 'wine'): ['values,'], ('lime,', 'grapefruit'): ['and'], ('wine', 'values,'): ['top'], ('Hints', 'of'): ['fresh'], ('laser', 'beam'): ['of'], ('add', 'to'): ['the'], ('and', 'mouthwatering,'): ['with'], ('which', 'offers'): ['black'], ('green', 'apple,'): ['melon'], ('violet', 'flavors'): ['framed'], ('notes,', 'with'): ['a'], ('value', 'producers'): ['and'], ('tobacco', 'notes,'): ['with'], ('imported.', 'Up'): ['front,'], ('made.', 'Vibrant'): ['and'], ('and', 'easy-to-find'): ['wines.'], ('mouthwatering,', 'with'): ['a'], ('tannins,', 'followed'): ['by'], ('of', 'tannins,'): ['followed'], ('Tasty,', 'showing'): ['citrus,'], ('flavors', 'framed'): ['by'], ('of', 'plum,'): ['spice'], ('of', 'lemon,'): ['lime,'], ('a', 'pleasingly'): ['plump'], ('40,000', 'cases'): ['made.'], ('and', 'apple'): ['flavors'], ('250,000', 'cases'): ['imported.'], ('values,', 'top'): ['value'], ('2013.', '50,000'): ['cases'], ('flavors', 'that'): ['have'], ('butter', 'hints.'): ['Just'], ('ripeness', 'and'): ['a'], ('lists', 'of'): ['the'], ('and', 'butter'): ['hints.'], ('of', 'the'): ["world's"], ('finish.', 'Drink'): ['now', 'now.'], ('now.', '60,000'): ['cases'], ('Drink', 'now.'): ['1,184', '40,000', '250,000', '60,000'], ('and', 'apricot'): ['flavors.'], ('Syrah-like,', 'with'): ['layers'], ('honest.', 'Drink'): ['now.'], ('that', 'have'): ['a'], ('front,', 'with'): ['green'], ('fine', 'layer'): ['of'], ('top', 'value'): ['producers'], ('1,184', 'cases'): ['made.'], ('and', 'flowers'): ['add'], ('all', 'honest.'): ['Drink'], ('cases', 'imported.'): ['Up', 'Syrah-like,'], ('apple,', 'melon'): ['and'], ('Up', 'front,'): ['with'], ('floral', 'quality.'): ['Balanced'], ('texture', 'in'): ['this'], ('the', 'complexity.'): ['Drink'], ('plum,', 'spice'): ['and'], ('to', 'the'): ['complexity.'], ('now.', '40,000'): ['cases'], ('a', 'fine'): ['layer'], ('flavors.', 'Hints'): ['of'], ('juicy.', 'Drink'): ['now.'], ('fresh', 'herbs'): ['and'], ('tar-tinged', 'finish.'): ['Drink'], ('hints.', 'Just'): ['tangy'], ('and', 'tobacco'): ['notes,'], ('pleasingly', 'plump'): ['texture'], ('framed', 'by'): ['a'], ('Light,', 'firm'): ['tannins'], ('now.', '1,184'): ['cases'], ('of', 'fresh'): ['herbs'], ('with', 'green'): ['apple,'], ('grapefruit', 'and'): ['apricot'], ('melon', 'and'): ['butter'], ('have', 'a'): ['pleasant'], ('leaf', 'and'): ['tobacco'], ('cherry,', 'leaf'): ['and'], ('beam', 'of'): ['lemon,'], ('smoky', 'finish.'): ['Drink'], ('red,', 'which'): ['offers'], ('keep', 'it'): ['all'], ('showing', 'citrus,'): ['pear'], ('the', "world's"): ['top'], ('offers', 'black'): ['cherry,'], ('now', 'through'): ['2013.'], ('in', 'this'): ['fresh'], ('now.', '250,000'): ['cases'], ('complexity.', 'Drink'): ['now.'], ('a', 'laser'): ['beam'], ('made.', 'Tasty,'): ['showing'], ('Balanced', 'and'): ['juicy.'], ('60,000', 'cases'): ['made.'], ('our', 'exclusive'): ['lists'], ('this', 'fresh'): ['red,'], ('firm', 'tannins'): ['support'], ('Drink', 'now'): ['through'], ('flowers', 'add'): ['to'], ('pleasant', 'ripeness'): ['and'], ('imported.', 'Syrah-like,'): ['with'], ('producers', 'and'): ['easy-to-find'], ('Just', 'tangy'): ['enough'], ('apple', 'flavors'): ['that'], ('with', 'layers'): ['of'], ('cases', 'made.'): ['Tasty,', 'Vibrant', 'Browse'], ('focused,', 'tar-tinged'): ['finish.'], ('enough', 'on'): ['the'], ('to', 'keep'): ['it'], ('followed', 'by'): ['a'], ('pear', 'and'): ['apple'], ('quality.', 'Balanced'): ['and'], ('plump', 'texture'): ['in'], ('a', 'pleasant'): ['ripeness'], ('black', 'cherry,'): ['leaf'], ('finish', 'to'): ['keep'], ('Browse', 'our'): ['exclusive'], ('it', 'all'): ['honest.'], ('layer', 'of'): ['tannins,'], ('on', 'the'): ['finish'], ('exclusive', 'lists'): ['of'], ('a', 'floral'): ['quality.'], ('the', 'finish'): ['to'], ('made.', 'Browse'): ['our'], ('a', 'smoky'): ['finish.'], ('with', 'a'): ['smoky', 'laser'], ('through', '2013.'): ['50,000'], ('lemon,', 'lime,'): ['grapefruit'], ('apricot', 'flavors.'): ['Hints'], ("world's", 'top'): ['wine'], ('and', 'violet'): ['flavors'], ('Vibrant', 'and'): ['mouthwatering,'], ('and', 'a'): ['floral'], ('tangy', 'enough'): ['on'], ('citrus,', 'pear'): ['and'], ('fresh', 'red,'): ['which'], ('50,000', 'cases'): ['imported.'], ('by', 'a'): ['fine', 'focused,'], ('a', 'focused,'): ['tar-tinged'], ('and', 'juicy.'): ['Drink'], ('tannins', 'support'): ['a'], ('layers', 'of'): ['plum,'], ('support', 'a'): ['pleasingly'], ('spice', 'and'): ['violet'], ('herbs', 'and'): ['flowers']}

We'll create a new function that we can feed this Markov'ian dictionary to and have the newly generated test we want returned.

def create_markcov_text(markcov_dict):
    # Pick a random starting point
    selected_words_tuple = choice(markcov_dict.keys())
    markcov_text = [selected_words_tuple[0], selected_words_tuple[1]]

    # Generate the Markcov text, ending the Markcov text when we create a "key" that doesn't exist
    while selected_words_tuple in markcov_dict:
        next_word = choice(markcov_dict[selected_words_tuple])
        markcov_text.append(next_word)
        selected_words_tuple = (selected_words_tuple[1], next_word)

    # Return our newly generated Markcov poem/story/text
    return (" ".join(markcov_text)).capitalize()

We'll pass markcov_dict to the above function.

markcov_text = create_markcov_text(markcov_dict)

Drumroll ... let's finally print our newly generated wine'snobbery text.

print(markcov_text)
Tobacco notes, with a laser beam of lemon, lime, grapefruit and apricot flavors. Hints of fresh herbs and flowers add to the complexity. Drink now. 250,000 cases imported. Up front, with green apple, melon and butter hints. Just tangy enough on the finish to keep it all honest. Drink now. 1,184 cases made. Tasty, showing citrus, pear and apple flavors that have a pleasant ripeness and a floral quality. Balanced and juicy. Drink now. 40,000 cases made. Vibrant and mouthwatering, with a smoky finish. Drink now. 250,000 cases imported. Up front, with green apple, melon and butter hints. Just tangy enough on the finish to keep it all honest. Drink now. 1,184 cases made. Tasty, showing citrus, pear and apple flavors that have a pleasant ripeness and a floral quality. Balanced and juicy. Drink now. 250,000 cases imported. Up front, with green apple, melon and butter hints. Just tangy enough on the finish to keep it all honest. Drink now. 1,184 cases made. Tasty, showing citrus, pear and apple flavors that have a pleasant ripeness and a floral quality. Balanced and juicy. Drink now. 250,000 cases imported. Up front, with green apple, melon and butter hints. Just tangy enough on the finish to keep it all honest. Drink now. 250,000 cases imported. Up front, with green apple, melon and butter hints. Just tangy enough on the finish to keep it all honest. Drink now. 1,184 cases made. Browse our exclusive lists of the world's top wine values, top value producers and easy-to-find wines.

Cheers!

No comments: