{ "metadata": { "name": "nlpa-nltk-automated-tagging" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import nltk\n", "import urllib2\n", "import re" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 51 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Automatic Tagging with NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the above results are neat, they aren't all that useful in practice\n", "because most texts we want to visualize in such ways aren't tagged, and tagging\n", "them by hand ist costly.\n", "\n", "What we need is an *automated tagger*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a page off Wikipedia and tag it automatically." ] }, { "cell_type": "code", "collapsed": false, "input": [ "opener = urllib2.build_opener()\n", "opener.addheaders = [('User-agent', 'Mozilla/5.0')]\n", "infile = opener.open('http://en.wikipedia.org/w/index.php?title=George_Washington&printable=yes')\n", "page = infile.read().decode(\"utf-8\")\n", "page[:400]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 52, "text": [ "u'\\n\\n\\nGeorge Washington - Wikipedia, the free encyclopedia\\n\\n\\n\\n\\n\\n