February 25, 2012

Restringing a Racket with Haskell

My friend Alex is an Emacs lisp hacker, and is in the process of teaching himself Racket. I want him to learn Haskell. Selfish of me, I know. I also know that programming languages are religions for otherwise rational people. Well, I'm done with keeping my faith to myself! Welcome to my first blog post.

But why would I want to waste my time evangelizing some esoteric language like a Thelemite in the Java Bible Belt? It's because I believe that Haskell has made me a better programmer. It's true. Even when I'm not using Haskell, there's some residual Haskellishness that works its way into all my efforts, helping me to break it down, keep it organized, and sometimes even get it right the first time. I don't know if I'll be able to get any of that across here, but let's give it a try. At the very least, I hope to get across a Haskell workflow that's productive and fun.

So to learn Racket, Alex assigned himself the task of extracting microdata from html5. If this sounds unappealing to you, (a) you're probably not alone, and (b) maybe you can think of it this way: It's just a task of transforming one sort of ugly tree into another (which is 67.5% of what programming is all about). Tree transformation is right up Haskell's alley. Here's the Racket code that Alex sent me:

(require (prefix-in h: html)
         (prefix-in x: xml))

(struct itemprop (name value) #:transparent)
(struct itemscope ([itemtype #:mutable]
                   [properties #:mutable]) #:transparent)

(define (html-itemscope? element)
  (if (h:html-element? element)
    (findf (lambda (a) (eq? (x:attribute-name a) 'itemscope))
           (h:html-element-attributes element))
  #f))

(define (extract-html5-value element scope)
  (when (h:html-full? element)
    (cond [(html-itemscope? element) scope]
          [else (apply string-append
                       (flatten
                        (map (lambda (c)
                               (cond
                                [(x:pcdata? c)
                                 (list (x:pcdata-string c))]
                                [else (list)]))
                             (h:html-full-content element))))])))

(define (extract-microdata root [scope (void)] [result (box '())])
  (let ([new-scope (void)])
    (when (h:html-full? root)
      (for ([element (h:html-full-content root)])
           (match element
             [(h:html-element (list (x:attribute start stop key value) ...))
              (let ([attrs (for/list ([k key] [v value])
                                     (cons k v))])
                (when (assoc 'itemscope attrs)
                  (set! new-scope
                        (itemscope (cdr (or (assoc 'itemtype attrs)
                                            (cons "" #f))) '()))
                  (when (void? scope)
                    (set-box! result (cons new-scope (unbox result)))))

                (for ([prop attrs]
                      #:when (eq? (car prop) 'itemprop))
                     (set-itemscope-properties!
                      scope (cons (itemprop (cdr prop)
                                            (extract-html5-value element
                                               new-scope))
                                  (itemscope-properties scope)))))
              (extract-microdata element new-scope result)]
             [else #f]))))
  (unbox result))

(extract-microdata (h:read-html (open-input-string
              (string-append "
" "
" "

quux

" "
" " Foobar" "
"))))

It's not complete by any means but it's enough code to get your head around Racket. (Hmmm... This doesn't seem like your father's Scheme that I learned way back when... MIT 6001 and all that. Hopefully I can fake my way through it.) I'm not going to explain all of his code here, but I'll attempt to explain it as we go along. Also, I'm going to take this very slowly, so if you're an expert and see some obvious mistake, or point-free (pointless) simplification, just keep it to yourself. I'll just say I did it for expository purposes, and leave it at that.

Getting started with Haskell

First, go download the Haskell Platform for your platform, and get it installed. That should give you the command-line tools we'll be using here (ghc, ghci, cabal, etc). Then, since you're an Emacs user, install the haskell-mode described here. (If you're not an Emacs user, I pity you, but you can just run ghci in the shell.)

Of course you've already read Learn You a Haskell, for Great Good! You haven't? Good god man! I'm not going to review all that stuff about creating lists and defining functions.

I will tell you how to get going in Emacs though. Start by creating a Haskell file (say "microdata.hs") and make sure it's in haskell-mode. Type in the following masterpiece:

foo = 1 + 1

Now type C-c C-l. That will create an inferior Haskell mode buffer named *haskell*, start the interpreter running, and will load your program in, checking syntax and compiling as it goes. You can hop over to that buffer and type in foo at the *Main> prompt. You'll get your answer. Not so inferior after all, is it?

[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )
Ok, modules loaded: Main.
*Main> foo
2
*Main> 

If you happen to get a Haskell error when you evaluate a buffer like this, you can use C-x ` to jump your cursor to the next error message.

Now it's time to take a swing at this Racket.

Parsing HTML

The first 2 lines of the Racket file load Racket's HTML parser. We'll need to find a Haskell equivalent. Since I don't know about these things off the top of my head, I made a stop at hackage to see what might be available. If you click the Packages tab, you'll get a long list of packages that people have uploaded. Turns out there are quite a few HTML packages. After staring at them for a bit, I chose xmlhtml -- partly because I saw that a couple of other nice packages depend on it, and partly because I could find a parseHTML function in the first documentation file I opened.

To make this package available in your Haskell environment, you'll need to use the cabal package manager to download and build it. In your shell:

$ cabal install xmlhtml

This will grind for a while, building that package and all its dependencies. Fortunately, you only have to do this one time. You'll probably want to kill your *haskell* buffer at this point so that you'll start a new ghci session and pick up the newly installed packages.

Now we're ready to include xmlhtml in our "microdata.hs" file. Get rid of the foo = 1 + 1 crap, and replace it with this:

import Text.XmlHtml

tree = parseHTML "" "
"

This says, import everything from Text.XmlHtml and make its definitions immediately available at the top-level of our file. (Later we'll do some qualified imports to avoid some name collisions.) Then define the variable tree to be the parsed version of the HTML string. Blam:

Prelude> :load "/Users/warrenharris/projects/racket-hs/microdata.hs"
[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )

/Users/warrenharris/projects/racket-hs/microdata.hs:3:21:
    Couldn't match expected type `Data.ByteString.Internal.ByteString'
                with actual type `[Char]'
    In the second argument of `parseHTML', namely
      `"
"' In the expression: parseHTML "" "
" In an equation for `tree': tree = parseHTML "" "
" Failed, modules loaded: none. Prelude>

Our first error. What going on here is that parseHTML expects a ByteString as an argument (binary data), but we gave it a regular String. Regular strings are lists of characters, which is different from an array of bytes, and BTW, Haskell's String type is a synonym for [Char].

Fortunately, we can roll out another bit of magic here called OverloadedStrings (a language feature). By turning this on, we can make the compiler automatically translate strings into ByteString (or the Text type, which we'll also make use of in this code). You can turn on OverloadedStrings by adding a declaration at the top of your file:

{-# LANGUAGE OverloadedStrings #-}

import Text.XmlHtml

tree = parseHTML "" "
"

Now if we evaluate the file, and then examine tree in the *haskell* buffer, we can see our parse tree:

Prelude> :load "/Users/warrenharris/projects/racket-hs/microdata.hs"
[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )
Ok, modules loaded: Main.
*Main> tree
Right (HtmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [Element {elementTag = "div", elementAttrs = [("itemtype","foo")], elementChildren = []}]})
*Main> 

This parse tree is basically a bunch of constructors -- mostly from the Text.XmlHtml package although Right is kind of mysterious. If we examine the type of parseHTML we see that its result is Either String Document, i.e. either a String, or a Document. Haskell provides this type to cover the cases where you can only have one of 2 things. Then if the constructor Left is returned, you get the type on the left (String in this case), or if Right is returned, you get the type on the right (Document in this case). Haskell often uses Either to designate success/fail situations. Left is used for failures, usually returning an error message string, and Right is used for success, returning the parse tree here. (I've often though the 2 constructors of Either should be called Right and Wrong.)

We can also see from the parse tree that's return the names of several other constructors and record accessor functions that we'll need, e.g. HtmlDocument, docContent, Element, elementAttrs, etc. We'll use these in the next section.

Defining data types

Let's make some equivalent data structure definitions for the itemprop and itemscope structs next. There are several ways to do this in Haskell. The simplest might be to write this (for Itemprop):

data Itemprop = Itemprop Text Text

This creates a data structure definition which introduces a new type named Itemprop and a new constructor function (also called Itemprop) that takes 2 parameters where both of them are Text values. (We've used Text instead of String because it's more efficient, and because it's what Text.XmlHtml uses.) When we want to create an instance of this type, we would just write something like Itemprop "text1" "text2" and this would return a new Itemprop value. To access the 2 parameters, we would pattern match against Itemprop t1 t2, binding variables t1 and t2. Nice... but we can do better. Let's write our definition like this, using a record type:

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: Text
  } deriving (Show, Eq)

This not only defines the type and same constructor function we mentioned above (completely identical), but at the same time defines some accessor functions that extract the 2 fields. By convention, we prefix the name these accessors with the name of the type (lower-cased, of course) so that we avoid collisions with similar accessor functions of other data types.

But what's this deriving (Show, Eq) stuff? This is Haskell's Scrap Your Boilerplate magic that automatically creates instances of the specified type classes. Show defines a show function which can stringify instances of the data type. It's extremely handy when using the interpreter (so you can print out these kinds of objects as results). And Eq defines the == and != functions that let values of these data types be structurally compared. There are a bunch of other things you can derive in this way, avoiding typing and mistakes.

So let's define our Itemscope type in a similar way:

data Itemscope = Itemscope {
  itemscopeType :: Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

and since we're using the Text type, we mustn't forget to import it too (this has to go at the top of the file with the rest of the imports):

import Data.Text (Text)

Here we've said that we only want the Text type constructor from this package. If we want more functions to be imported from this package, we can include their names in the parens.

Our first function

Now let's tackle the first Racket function, html-itemscope?. We'll use the name hasItemscope in Haskell since we can't use hyphens and question marks. Here's a first attempt at a definition:

hasItemscope elt = undefined

Huh? How can this possibly be a definition like the Racket one? What I've done here is used the undefined value for the result of this function. undefined is a magical value that can have any Haskell type -- so your code always compiles. (Go ahead, C-c C-l and you'll see that I'm not kidding.) This can be quite helpful for stubbing out something you want to deal with later. Only at runtime if you encounter an undefined value do you get:

*** Exception: Prelude.undefined

So undefined gives you something that you thought you could only get with dynamically typed code: the ability to defer writing code that you don't need to call yet, and a way to leave lurking time bombs in your code! (It is quite handy though.)

Let's try a little harder though, and deal with the first conditional h:html-element?. In the Text.XmlHtml package, elements vs other types of data are dealt with by the Node type which is an abstract data type with several cases. One of those is Element. We can use pattern matching to test for elements like this:

hasItemscope elt = case elt of
  Element _ _ _ -> undefined
  otherwise -> False

Here Element _ _ _ matches the Element constructor and its 3 arguments (the record fields). Since we don't care about these fields, we use underscores as wildcard "don't care" variables. And since the first thing we do here is match one of the arguments for a data type constructor, there's a more clever way to write this -- with several top-level function cases:

hasItemscope elt@(Element _ _ _) = undefined
hasItemscope _ = False

Here the elt@(Element _ _ _) syntax matches the elt parameter to the Element constructor and binds elt only if it matches and evaluates the body of the function case. If the first case doesn't match, it moves on to the next one. In fact, if you don't use wildcards, the compiler will even tell you if you've covered all the possible cases.

Next, we want to search for an "itemprop" attribute in the element's list of attributes. In the Data.List package there's a find function that's very similar to Racket's findf that takes a test function (we'll need to import Data.List (find) to get it). Let's rewrite the first case to call it:

hasItemscope elt@(Element _ _ _) =
  find hasItemscopeAttr (elementAttrs elt)
  where hasItemscopeAttr _ = undefined
hasItemscope _ = False

Instead of using a lambda like in the Racket code, we've defined a local function named hasItemscopeAttr (again with an undefined body) using a where clause (where at the bottom is just using a let at the top). However, if we try to compile this, we'll find a problem already:

*Main> :load "/Users/warrenharris/projects/racket-hs/microdata.hs"
[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )

/Users/warrenharris/projects/racket-hs/microdata.hs:29:19:
    Couldn't match expected type `Maybe (Text, Text)'
                with actual type `Bool'
    In the expression: False
    In an equation for `hasItemscope': hasItemscope _ = False
Failed, modules loaded: none.
Prelude> 

This is because the type checker inferred an inconsistency in the two cases. The first case returns a Maybe (Text, Text) (because this is what find returns), but the second case returns a Bool (what we want). We can fix this with a bit more pattern matching:

hasItemscope elt@(Element _ _ _) =
  case find hasItemscopeAttr (elementAttrs elt) of
    Just _ -> True
    Nothing -> False
  where hasItemscopeAttr _ = undefined
hasItemscope _ = False

However, since we're translating a Maybe type into a Bool, you might guess that this is quite common and there's a shortcut for it. Let's see if we can find it. This is were the magic of Hoogle comes in. Hoogle is a way to search all Haskell functions uploaded to Hackage by their types. If we type Maybe a -> Bool into the search box, we'll find the first result is:

which is exactly what we want. Let's import Data.Maybe (isJust) and rewrite the function to use it:

hasItemscope elt@(Element _ _ _) =
  isJust (find hasItemscopeAttr (elementAttrs elt))
  where hasItemscopeAttr _ = undefined
hasItemscope _ = False

Now we can turn our attention to hasItemscopeAttr. First, what kind of data is passed as that first argument? We can use the *haskell* buffer to ask:

*Main> :t find
find :: (a -> Bool) -> [a] -> Maybe a
*Main> 

Or you can put your cursor on find in your file buffer, and type C-c t.

Since our list is coming from elementAttrs, we know this returns an association list of Text values ([(Text, Text)]). So hasItemscopeAttr should get a single pair. We can use pattern matching in its argument list to isolate the case we're looking for -- when the key is "itemscope":

hasItemscope elt@(Element _ _ _) =
  isJust (find hasItemscopeAttr (elementAttrs elt))
  where hasItemscopeAttr ("itemscope", _) = True
        hasItemscopeAttr _ = False
hasItemscope _ = False

Well, that's pretty good, except we can do one better. When the find function is used to search for a single item, well, that's exactly what the lookup function from the same package is all about. So we didn't really need all this find and hasItemscopeAttr nonsense after all. We can write things much more simply like this:

hasItemscope elt@(Element _ _ _) =
  isJust (lookup "itemscope" (elementAttrs elt))
hasItemscope _ = False

This probably means the Racket code can be simpler too.

We're done... except for one last thing. Haskellers hate parentheses! Absolutely hate them. And they'll go to extreme lengths to avoid them, whenever possible. (Such a shame, given how much Haskell owes to Lisp.) Anyway, the primary tool they use to avoid them is a little infix operator named $. The $ operator just applies the thing on the left-hand side (a function) to the thing on the right-hand side (a value) -- which is completely uninteresting except for the fact that the operator precedence of $ allows a set of parens be avoided. I am not kidding you!

So given the line of code isJust (lookup "itemscope" (elementAttrs elt)), we can use one $ operator to rewrite it to this: isJust $ lookup "itemscope" (elementAttrs elt), and then another $ to rewrite it to this: isJust $ lookup "itemscope" $ elementAttrs elt. $ takes as much stuff as it can on the right-hand side, which is why using 2 $ operators as we've done here parses with the same precedence as our original set of parens (i.e it's right-associative).

And another thing... Good Haskellers always declare the types of their functions. Although the type checker is perfectly happy to figure these out automatically, us humans are much more lazy and prefer to have someone write it down for us. So think of the next guy and write it down for them. Putting it all together, the "professional" version of our hasItemscope function would look like this:

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

Review

Did I say we were done? We've only written 2 types and 1 function so far. But let's review what our code looks like so far:

{-# LANGUAGE OverloadedStrings #-}

import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: Text
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

For the rest, I'm afraid you're going to have to move on to part 2.

No comments:

Post a Comment