February 25, 2012

Restringing a Racket with Haskell (pt 4)

Houston, we have a problem. If we evaluate result from part 3 and look at it closely, we'll see it:

type ExtractState a = State [Itemscope] a
*Main> result1
Right [Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"},Itemprop {itempropName = "item", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]},Itemscope {itemscopeType = Just "type", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"}]}]

Uh, let me format that for you:

Right [
  Itemscope {itemscopeType = Nothing,
             itemscopeProperties = [
               Itemprop {itempropName = "name",
                         itempropValue = TextValue "quux"},
               Itemprop {itempropName = "item",
                         itempropValue = ItemscopeValue (
                           Itemscope {itemscopeType = Nothing,
                                      itemscopeProperties = []})}]},
  Itemscope {itemscopeType = Just "type",
             itemscopeProperties = [
               Itemprop {itempropName = "name",
                         itempropValue = TextValue "Foobar"}]}]

and pull up our original HTML for reference:

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
"

Here we can see that the Itemprops aren't in the right places at all. The "name"="quux" one should be inside the empty Itemscope that's the value of "item", and that Itemprop should be a sibling of the "name"="Foobar" one that's below. What's going wrong?

Well, after thinking a bit more about what the algorithm here needs to do, it's clear that we need to traverse one tree (the HTML), and build a new one (the Itemscopes). But although we used foldr to push items onto a stack as we walked down the HTML tree, we didn't pop any items from that stack as we unwound the HTML tree. So we were always left with last pushed item on the top of the stack when we updated the Itemprop values. This is completely broken.

Just to be a little more clear, there are 3 uses of foldr in our code: in extractItemprops, extractMicrodata and in extractMicrodataFromDoc. The one in extractItemprops is fine because it is simply processing a list of attributes that all belong to the same HTML element. Similarly, the one in extractMicrodataFromDoc is fine because it is processing the content of the document which is a list of HTML nodes all at the top-level of the document. The problem is in extractMicrodata which not only processes the immediate children of an element, but calls itself to process each of their children.

It isn't obvious how we can modify extractMicrodata to do this pushing and popping since it isn't until we get deep into its implementation do we know that we're looking at an element with an "itemscope" attribute. A better approach here would be to recurse down the xmlhtml tree, calling our extraction operations in a mutually recursive way, and returning a correctly constructed Itemscope from each call.

By mutually recursively, we mean calling extractMicrodata, extractItemprops, extractNewItemscope from each other as necessary -- so the 3 of these form a recursive cycle. But which one should be called first? To determine that, we need to consider the html structure. Some nodes like the div node in the example above have both itemprop and itemscope attributes. When this occurs we want the itemprop ("item" in this case) to have a value that contains the itemscope and all its sub-properties ("name"="quux" here). This means that extractItemprops should be called first (from extractMicrodataFromDoc), and it should in turn call extractNewItemscope to build a new Itemscope value when the itemscope attribute is present. Finally extractNewItemscope should call extractMicrodata to recursively process the html children, and this can recurse back to extractItemprops once again.

Unlike before when each of these functions would take a stack (list) of Itemscopes, now it's only necessary for them to take the current Itemscope object, and return a (possibly new) Itemscope on the way back out. This is because our call stack will replace the explicit stack (that we never popped) from before.

Let's start with extractMicrodata, since it's at the bottom of this call stack. Rather than calling all 3 extraction functions on each child, it now only calls extractItemprops:

extractMicrodata :: Node -> Itemscope -> Itemscope
extractMicrodata elt@(Element _ _ children) scope =
  foldr extractItemprops scope children
extractMicrodata _ scope = scope

Pretty straightforward. Next, extractMicrodata is called from extractNewItemscope. In the case where we have an element with an itemscope attribute, we'll construct a new Itemscope and pass this to extractMicrodata. Otherwise, we'll continue working with the one we were given:

extractNewItemscope :: Node -> Itemscope -> Itemscope
extractNewItemscope elt@(Element _ attrs _) scope
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
    in extractMicrodata elt newScope
extractNewItemscope node scope = extractMicrodata node scope

Next we can rework extractItemprops (and it's associated routine, extractItemprop) to call extractNewItemscope:

extractItemprop :: Node -> (Text, Text) -> Itemscope -> Itemscope
extractItemprop elt ("itemprop", name) scope =
  addProp name (extractHtml5Value elt scope) scope
extractItemprop _ _ scope = scope

extractItemprops :: Node -> Itemscope -> Itemscope
extractItemprops elt@(Element _ attrs _) scope =
  let scope' = extractNewItemscope elt scope
  in foldr (extractItemprop elt) scope' attrs
extractItemprops _ scope = scope

Now let's test them out on some simple html. To make doing this easier in ghci, let's first define a helper function to parse the html and get rid of the Either node (by throwing an exception in the Left (wrong) case), and a dummy Itemscope to use at the top-level:

html str = case parseHTML "" str of
  Left msg -> error msg
  Right rslt -> head $ docContent rslt

topscope = Itemscope (Just "top") []

Let's try our new functions out. Be sure to first enable OverloadedStrings in ghci so that we can have the literal html string converted to the correct type for our helper function:

*Main> :set -XOverloadedStrings
*Main> extractNewItemscope (html "

quux

") topscope Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]} *Main> extractNewItemscope (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]}

This seem to be working correctly. When we call extractNewItemscope with a node with itemscope, we built a new Itemscope object and store the "name"="quux" property there. And when the node doesn't have itemscope, we store "name"="quux" in the current Itemscope that's passed in (topscope in this case).

But wait a minute... In the first case above, we ended up dropping the passed-in scope on the floor. You can see this because the scope parameter is never used in extractNewItemscope's first function case. This isn't right, because that scope might already contain someone's "good stuff" that they're expecting to see in the final result. What we really need to do in extractNewItemscope is return both the previous scope (our state) and the newly constructed Itemscope object (our result value) if we do indeed construct one. Likewise, in the case where we don't construct one, then the updated scope will be both our state, and our Itemscope result value:

extractNewItemscope :: Node -> Itemscope -> (Itemscope, Itemscope)
extractNewItemscope elt@(Element _ attrs _) scope
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
    in (extractMicrodata elt newScope, scope)
extractNewItemscope node scope =
  let scope' = extractMicrodata node scope
  in (scope', scope')

Since we've changed the type of extractNewItemscope, this requires a similar change in extractItemprops. Now we need to receive both the result value and the updated state from extractNewItemscope and use the new Itemscope value whenever the itemprop keyword is seen:

extractItemprop :: Node -> Itemscope -> (Text, Text) -> Itemscope -> Itemscope
extractItemprop elt is ("itemprop", name) scope =
  addProp name (extractHtml5Value elt is) scope
extractItemprop _ _ _ scope = scope

extractItemprops :: Node -> Itemscope -> Itemscope
extractItemprops elt@(Element _ attrs _) scope =
  let (is, scope') = extractNewItemscope elt scope
  in foldr (extractItemprop elt is) scope' attrs
extractItemprops _ scope = scope

Let's see if this works:

*Main> extractItemprops (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]} *Main> extractItemprops (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = []}

Hmmm... another problem. In the first case where there is no itemscope attribute things work properly. The topscope is updated to contain the "name"="quux" property as we expect. But in the second case when itemscope is present, there's another problem. We're only getting an empty topscope here. The reason for this is because there is no itemprop that connects the topscope to the inner Itemscope object. Let's add one and verify that this is indeed the problem:

*Main> extractItemprops (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "connect", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]})}]}

This is what we're expecting to see -- the topscope now contains the "connect" property that connects it to the previously orphaned Itemscope containing "name"="quux". So this means that although extractItemprops appears to be working ok for the case where everything is connected, we have an oversight for the case when we have isolated Itemscopes that aren't connected. We need to return these as results too somehow.

The easiest way to do this might be to find the point in the code were we're about to drop an orphaned Itemscope, and store it in a list somewhere so that we can get it when the algorithm completes. This new list becomes another piece of state that we need to pass around.

In order to keep things tidy, let's introduce a State type and use this in all of our stateful functions. That will be a little clearer than making the result be a tuple. Here's our state which includes the current Itemscope that the algorithm is working on, which we had before, and the new list of orphaned Itemscopes:

type State = (Itemscope, [Itemscope])

Now we can rewrite our stateful code as follows:

extractMicrodata :: Node -> State -> State
extractMicrodata elt@(Element _ _ children) state =
  foldr extractItemprops state children
extractMicrodata _ state = state

extractNewItemscope :: Node -> State -> (Itemscope, State)
extractNewItemscope elt@(Element _ attrs _) (scope, orphans)
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
        (scopeVal, orphans') = extractMicrodata elt (newScope, orphans)
    in (scopeVal, (scope, orphans'))
extractNewItemscope node state =
  let state'@(scopeVal, _) = extractMicrodata node state
  in (scopeVal, state')  -- return current scope as scopeVal

extractItemprop :: Node -> Itemscope -> (Text, Text) -> State -> State
extractItemprop elt is ("itemprop", name) (scope, orphans) =
  (addProp name (extractHtml5Value elt is) scope, orphans)
extractItemprop _ _ _ state = state

extractItemprops :: Node -> State -> State
extractItemprops elt@(Element _ attrs _) state =
  let (is, state') = extractNewItemscope elt state
  in foldr (extractItemprop elt is) state' attrs
extractItemprops _ scope = scope

The problem now is that although we're passing our new result state around, we are never storing anything into it. As the algorithm is implemented, it isn't clear where this should happen... so I'll give you a hint. It happens in the case where the foldr in extractItemprops doesn't encounter an itemprop attribute. I guess foldr isn't what we want anymore then.

Now we could do another pass over the attributes to determine whether itemprop doesn't occur, or we could pass a boolean through the foldr accumulator and update it if anything is found. But let's make it easy on ourselves and say that the itemprop attribute can only occur once in an html node. That allows us to rewrite extractItemprops to use find instead of foldr to look for the itemprop attribute:

extractItemprops :: Node -> State -> State
extractItemprops elt@(Element _ attrs _) state =
  let (is, (scope, orphans)) = extractNewItemscope elt state
  in case lookup "itemprop" attrs of
    Just name -> (addProp name (extractHtml5Value elt is) scope, orphans)
    Nothing -> (scope, is:orphans)
extractItemprops _ state = state

Now in the case where we find something, we add a new property as before. But in the case were we don't, we throw the Itemscope object on the result list instead.

Let's wrap this up by reworking the last function we need, extractMicrodataFromDoc:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let initialState = (Itemscope (Just "top") [], [])
      (scope, orphans) = foldr extractItemprops initialState $ docContent doc
  in scope:orphans

And finally, our debugged, working code:

{-# LANGUAGE OverloadedStrings #-}

import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

type State = (Itemscope, [Itemscope])

extractMicrodata :: Node -> State -> State
extractMicrodata elt@(Element _ _ children) state =
  foldr extractItemprops state children
extractMicrodata _ state = state

extractNewItemscope :: Node -> State -> (Itemscope, State)
extractNewItemscope elt@(Element _ attrs _) (scope, orphans)
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
        (scopeVal, orphans') = extractMicrodata elt (newScope, orphans)
    in (scopeVal, (scope, orphans'))
extractNewItemscope node state =
  let state'@(scopeVal, _) = extractMicrodata node state
  in (scopeVal, state')  -- return current scope as scopeVal

extractItemprops :: Node -> State -> State
extractItemprops elt@(Element _ attrs _) state =
  let (is, (scope, orphans)) = extractNewItemscope elt state
  in case lookup "itemprop" attrs of
    Just name -> (addProp name (extractHtml5Value elt is) scope, orphans)
    Nothing -> (scope, is:orphans)
extractItemprops _ state = state

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let initialState = (Itemscope (Just "top") [], [])
      (scope, orphans) = foldr extractItemprops initialState $ docContent doc
  in scope:orphans

-- Debugging stuff:

html str = case parseHTML "" str of
  Left msg -> error msg
  Right rslt -> head $ docContent rslt

topscope = Itemscope (Just "top") []

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

Now are we done? There's always one more thing! In part 5 we'll see how to use the "m" word to simplify this further.

Restringing a Racket with Haskell (pt 3)

Now we tackle the tricky part. extract-microdata uses the dreaded hobgoblin of functional programming: mutable state! Annoyingly convenient little boxes with updatable values, and mutable struct fields. One problem with all this mutable stuff is that it allows the control flow to follow a different path from the state changes. To turn this into functional Haskell code, we'll need to do a bit of untangling so that state can travel along with the control flow, and be replace with new state whenever a change is needed.

Here's the Racket function again, for reference:

(define (extract-microdata root [scope (void)] [result (box '())])
  (let ([new-scope (void)])
    (when (h:html-full? root)
      (for ([element (h:html-full-content root)])
           (match element
             [(h:html-element (list (x:attribute start stop key value) ...))
              (let ([attrs (for/list ([k key] [v value])
                                     (cons k v))])
                (when (assoc 'itemscope attrs)
                  (set! new-scope
                        (itemscope (cdr (or (assoc 'itemtype attrs)
                                            (cons "" #f))) '()))
                  (when (void? scope)
                    (set-box! result (cons new-scope (unbox result)))))

                (for ([prop attrs]
                      #:when (eq? (car prop) 'itemprop))
                     (set-itemscope-properties!
                      scope (cons (itemprop (cdr prop)
                                            (extract-html5-value element
                                              new-scope))
                                  (itemscope-properties scope)))))
              (extract-microdata element new-scope result)]
             [else #f]))))
  (unbox result))

This function iterates over all the content nodes of the document, and when they are elements it potentially creates a new itemscope structs and pushes them onto/into the result box. It also iterates over all the attributes of the elements looking for itemprop attributes. When it finds them it extracts their HTML5 value and pushes it into the properties field of the current itemscope object. Finally it recurses over each of the element's children. Perhaps we can break this down into several more manageable pieces.

First, let's focus on recognizing itemscope attributes and creating new Itemscope values. Let's call this piece extractNewItemscope. Since this function can potentially push a new Itemscope on a stack, it needs to take a list of Itemscope as input, and produce another one as output (linked lists in Haskell make perfect stacks):

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt scopes = undefined

Again, we need to make function cases based on whether the Node is an Element or something else, and we need to test for when the attributes of the element contain an "itemscope" key:

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes =
  if isJust $ lookup "itemscope" attrs
  then undefined
  else scopes
extractNewItemscope _ scopes = scopes

Since the first thing we're doing in the first function case is another test, we can use a guard clause to further refine it, letting the second case pick up the work the else clause was doing:

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    undefined
extractNewItemscope _ scopes = scopes

Now the body of this function only needs to build a new Itemscope value and push it onto the scopes stack. The type of this Itemscope is the value of the "itemtype" attribute (if any), and the properties start out empty. The colon (:) is the "cons" operator that pushes a new value onto the left-hand side of an existing list (scopes here):

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    Itemscope (lookup "itemtype" attrs) [] : scopes
extractNewItemscope _ scopes = scopes

Next, let's define a function to update the properties of an Itemscope when we see an "itemprop" attribute. To make this clearer, we'll first define a helper function to add a new property to an existing Itemscope: (since it's functional, it returns a new Itemscope):

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

Next we can use this to update the the top-most Itemscope on the stack when we see an "itemprop" attribute:

extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope]
extractItemprop elt ("itemprop", name) (scope:scopes) =
  addProp name (extractHtml5Value elt scope) scope : scopes
extractItemprop _ _ scopes = scopes

The first line of the let creates a new Itemprop value with the name and HTML5 value. The second line pushes this new Itemprop onto the list of existing properties. The third line then creates a new Itemscope value that will replace the one we're starting with. Notice how we pattern matched the entire scope stack with (scope:scopes) so that we can throw the old scope away and return the tail of the list with newScope pushed on the front.

Oops -- when we compile this we see that another tweak is needed. We previously had defined the value field of the Itemprop data type as Text, but in part 2 we discovered that values needed to be of type MicrodataValue. Let's fix that:

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

One nice thing about Haskell modules (the stuff within a single file) is that all the definitions are mutually recursive. That means that you can list functions in top-down dependency order, bottom-up, or however you want to organize things. The same goes for types, which is why we can change this field to MicrodataValue even though it's defined below in the file. (I spent several years programming in O'Caml, and it requires things to be defined in bottom-up dependency order. Believe me, this is better.)

Next, let's define a function that's similar to extractItemprop, but that works with all attributes of an element, not just one. We might be able to use map, but since we want each attribute to be able to update the top-most Itemscope on the stack we need to thread the stack (scopes) through all these map operations. That's exactly what a "fold" does -- it passes an accumulator parameter along and expects you to accumulate your result into the accumulator instead of just returning it like map does. map is a fold where the accumulation is done for you -- the function results are accumulated into a list.

We'll use foldr to do this. foldr folds from right-to-left. This might seem backwards, but its usually the right thing in Haskell. The reason it's often the right thing is because lists are built by pushing newer elements on the left, so if you want to preserve the order of your results with respect to the original list, you need to start from the right. Also, if you're relying on Haskell's laziness to avoid computing values from the tail of the list that never used, then foldr is exactly what you want. (On the other hand, if you don't care about preserving the left-to-right order, then foldl' is way more efficient. See Foldr Foldl Foldl' if you want to understand this better.)

extractItemprops :: Node -> [Itemscope] -> [Itemscope]
extractItemprops elt@(Element _ attrs _) scopes =
  foldr (extractItemprop elt) scopes attrs
extractItemprops _ scopes = scopes

That's it for extractItemprops. The scopes stack is passed to each extractItemprop call, and a new scope stack is returned that will be passed to the next call, or returned as the final result. Note that rather than just calling extractItemprop directly we doing a bit of currying magic here, calling (extractItemprop elt) instead. This gives each call the same elt parameter, and the partial application returns a new function that fits the requirement for foldr.

Now we're at a place where we can put these pieces together to write a functional version of extract-microdata. Since it's functional we'll want to pass our state along at every step, and return new state as a result. And since it will want to operate on every child of an element, we'll want to use our frind foldr again. The only tricky part if figuring out what each step performed by foldr actually needs to do:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes = undefined
extractMicrodata _ scopes = scopes

Hopefully at this point, there are no surprises here. The extractMicrodata function looks a lot like extractItemprops that we've seen already. But why name the local function "go"? That's just a common Haskell idiom (in Zetalisp it was always "frob", so this is just another step forward).

The body of go just needs to thread our state (scopes) through each of the steps:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          let scopes' = extractNewItemscope child scopes
              scopes'' = extractItemprops child scopes'
          in extractMicrodata child scopes''
extractMicrodata _ scopes = scopes

First extractNewItemscope is called which potentially pushes a new Itemscope on the stack. We'll call its result scopes' (using a "prime" is just another idiom -- the single quote is a perfectly valid identifier character in Haskell). Then extractItemprops is called with scopes' and returns scopes''. Finally we recursively call extractMicrodata for the child so we can process its children, passing scopes''. This ultimately returns a new scope stack as our result.

Of course we're never happy with this much verbosity in Haskell, so we can write this a little more concisely once again:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          extractMicrodata child $
            extractItemprops child $
              extractNewItemscope child scopes
extractMicrodata _ scopes = scopes

How's that? Well, let me counter your question with how's this?:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          (extractMicrodata child .
           extractItemprops child .
           extractNewItemscope child) scopes
extractMicrodata _ scopes = scopes

We're using the function composition operator, dot (.), to compose the 3 functions that all get passed scopes. Note that the composition operator in Haskell works by calling the right-most function first, and passing the result to the left most -- the way mathematicians do it, but perhaps backwards from what we expect from most programming languages. It is defined as:

(.) :: (b -> c) -> (a -> b) -> a -> c
(f . g) x = f (g x)

One more tweak:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child =
          extractMicrodata child .
          extractItemprops child .
          extractNewItemscope child
extractMicrodata _ scopes = scopes

Now we're dropping the application of the composed functions to scopes, and dropping scopes from the parameter list. Instead, we're just returning the composed function itself since it has the correct type. We're moving into what they call the "point-free style" here (sometimes referred to as "pointless style"). Maybe this is getting too carried away.

Wrap it up...

The only thing we need to do now is put a bow on it, and test it out.

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = case parseHTML "" test1 of Left msg -> Left msg Right tree -> Right $ extractMicrodataFromDoc tree

extractMicrodataFromDoc simply folds extractMicrodata over all the document's content. The variables test1 and result1 are things I've thrown in for testing purposes. You shouldn't leave them in production code, but if you evaluate result1 in the *haskell* buffer, you'll see our handiwork:

*Main> result1
Right [Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"},Itemprop {itempropName = "name", itempropValue = TextValue "quux"},Itemprop {itempropName = "item", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]},Itemscope {itemscopeType = Just "type", itemscopeProperties = []}]

Of course, there are always improvements to be made. Note how in the definition of result1 we're reaching into the Either data structure and in the Right case calling another function on what's there. Effectively, we're doing an operation that might be generalized like this:

mapEither :: (a -> b) -> Either c a -> Either c b
mapEither f (Either l r) = Either l (f r)

Well, there's a very common generalization of this pattern called fmap. Instead of operating on an Either data type, it operates on anything that supports the Functor type class.

fmap :: Functor f => (a -> b) -> f a -> f b

The Functor type class is basically one that allows functions applied to data structures to be applied to that data structure's elements. We can use that to simplify the result1 definition:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

fmap takes a bit to get your head around, but it's really powerful and useful. If you look at the definition of mapEither that I gave you can get some idea of how it works. In fact, given that definition, we could have defined the Functor instance for Either like this (if we didn't have a definition already):

instance Functor (Either a) where
  fmap = mapEither

Why (Either a) in the above? Great Good explains it better than I can.

Review

Where are we now?

{-# LANGUAGE OverloadedStrings #-}

import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    Itemscope (lookup "itemtype" attrs) [] : scopes
extractNewItemscope _ scopes = scopes

extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope]
extractItemprop elt ("itemprop", name) (scope:scopes) =
  addProp name (extractHtml5Value elt scope) scope : scopes
extractItemprop _ _ scopes = scopes

extractItemprops :: Node -> [Itemscope] -> [Itemscope]
extractItemprops elt@(Element _ attrs _) scopes =
  foldr (extractItemprop elt) scopes attrs
extractItemprops _ scopes = scopes

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child =
          extractMicrodata child .
          extractItemprops child .
          extractNewItemscope child
extractMicrodata _ scopes = scopes

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

So we're done? Well, if you're an astute reader, you've already realized that this code doesn't actually work as intended. Stay tuned for part 4 where we'll do a little debugging.

Restringing a Racket with Haskell (pt 2)

In part 1, we wasted a lot of time traipsing through Hackage, goofing off with Emacs, Hoogling (yes, that is a word!), and writing undefined functions. We're going to have to pick up the pace if we want to get this done before dinner.

Let's take a look at Alex's next Racket function, extract-html5-value. In case you don't have it memorized:

(define (extract-html5-value element scope)
  (when (h:html-full? element)
    (cond [(html-itemscope? element) scope]
          [else (apply string-append
                       (flatten
                        (map (lambda (c)
                               (cond
                                [(x:pcdata? c)
                                 (list (x:pcdata-string c))]
                                [else (list)]))
                             (h:html-full-content element))))])))

It's fairly clear that this function takes an HTML element as its first parameter, but what is scope? Scope gets returned if html-itemscope? returns true. Great -- that tells us nothing. Oh, but type-wise, it must be the same type as the second case in the cond, and that looks like it returns a string. (Gee, if we had a typed language, this would be a whole lot simpler.) So we want a Haskell function like this:

extractHtml5Value :: Node -> Text -> Text
extractHtml5Value elt scope = undefined

Or do we? The thing is, if you examine the rest of the Racket program, you'll find that what get passed to extract-html5-value is not a string, but the value of new-scope (from line 46) which is an itemscope structure (from line 36). How can we have an itemscope in one case, and a string in the other?

Answer: Alex is implementing this (or at least part of it). In one case (the first one) a value is an itemscope attribute, but in another (the last one) it's a piece of text. This is a job for algebraic data types!

An algebraic data type is a fancy name for a union. Not your Pipelayer's Local 57, but a type that has a number of different cases. In Haskell, we can define algebraic data types by defining a data type with more than one constructor. Let's create a MicrodataValue data type this way:

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

Here we have 2 constructors: ItemscopeValue that takes an Itemscope as a parameter, and TextValue that takes Text as a parameter (the vertical bar separates them). You can think of these constructors as the tags in a tagged union... because that's exactly what they are. You can discriminate these 2 cases by pattern matching. In fact, we've seen this already with the Node data type.

But how did Racket get away with not needing something like this? It's because values in Racket (like all Lisp derivatives) have types at runtime. (In Haskell, types go away after the program is compiled.) Racket's runtime type system can distinguish structs from strings, so it 's happy. However, if you wanted to distinguish strings used for different purposes, you'd have to resort to some sort of home-grown tagging (maybe using another structure type to wrap them). In Haskell we're forced to create wrappers around these different cases whenever we want to to distinguish alternatives at runtime -- by using algebraic data types.

So I guess what we really want is a extractHtml5Value function that looks like this:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope = undefined

but without so much undefined stuff. The first case is easy:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope =
  if hasItemscope elt
  then ItemscopeValue scope
  else undefined

Since the first thing we're doing here is an if statement, we can convert this into a couple of top-level function cases that use a guard clause to restrict whether the first case triggers. This is just a little cleaner:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope = undefined

In the first case we've just wrapped the ItemscopeValue constructor around the Itemscope value we were given. The second case in the Racket code is doing some fancy footwork to concatenate the PCDATA of the element together. To do this in Haskell, we can map over the element's children, and if the child is a TextNode (from the Text.XmlHtml package -- part of the Node algebraic data type) return the text. If it isn't a TextNode, just return some empty text. This will give us a list of Text values that we can concatenate:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ concat $ map getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

Unfortunately, this doesn't work. What's going on?

/Users/warrenharris/projects/racket-hs/microdata.hs:57:29:
    Couldn't match expected type `[a0]' with actual type `Text'
    Expected type: [[a0]]
      Actual type: [Text]
    In the second argument of `($)', namely
      `map getText $ elementChildren elt'
    In the second argument of `($)', namely
      `concat $ map getText $ elementChildren elt'
Failed, modules loaded: none.

We're looking at you, map. Actually, map's return type is what we're expecting, [Text]. It seems that the expected type isn't what we want -- a list of lists of things. (a0 is some type variable that isn't (yet) instantiated. Type variables always start with lower-case letters, whereas concrete types are always upper-case.) So the problem must be concat. What's it's type?

Prelude> :t concat
concat :: [[a]] -> [a]

So this isn't at all what we're expecting. We want something that converts [Text] -> Text. Let's ask Hoogle: [Text] -> Text. (BTW, you can enable Hoogle in ghci if you like.) Doh. We want the concat in Data.Text. Let's add another import:

import Data.Text (Text, concat)

Fail.

/Users/warrenharris/projects/racket-hs/microdata.hs:57:20:
    Ambiguous occurrence `concat'
    It could refer to either `Prelude.concat', imported from Prelude
                          or `Data.Text.concat',
                             imported from Data.Text at /Users/warrenharris/projects/racket-hs/microdata.hs:6:25-30
Failed, modules loaded: none.

Let's hide the version of concat coming from the Prelude module (stuff that's there if you don't as for anything special):

import Prelude hiding (concat)

Works. But we can do better. Concatenating lists that result from mapping is a common pattern, and there's probably a better way to do this. Let's ask Hoogle again (by giving the overall type of the concat $ map combined functions): (a -> Text) -> [a] -> Text. Hmmm... foldMap looks vaguely like something that would fit:

foldMap :: (Foldable t, Monoid m) => (a -> m) -> t a -> m

but it requires a Foldable t type function applied to a (what does that mean) whereas we asked for [a], a list of whatever. Could it be that the list type is a Foldable type function? Could be. And we asked for the result to be Text, but Hoogle gave us something that returns a Monoid (wtf?). Is Text a Monoid? I guess we can give it a shot. Let's import:

import Data.Foldable (foldMap)

and try it out:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

Yeah! Hoogle is awesome.

Review

Where are we now?

{-# LANGUAGE OverloadedStrings #-}

import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: Text
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

Ready for part 3?

Restringing a Racket with Haskell

My friend Alex is an Emacs lisp hacker, and is in the process of teaching himself Racket. I want him to learn Haskell. Selfish of me, I know. I also know that programming languages are religions for otherwise rational people. Well, I'm done with keeping my faith to myself! Welcome to my first blog post.

But why would I want to waste my time evangelizing some esoteric language like a Thelemite in the Java Bible Belt? It's because I believe that Haskell has made me a better programmer. It's true. Even when I'm not using Haskell, there's some residual Haskellishness that works its way into all my efforts, helping me to break it down, keep it organized, and sometimes even get it right the first time. I don't know if I'll be able to get any of that across here, but let's give it a try. At the very least, I hope to get across a Haskell workflow that's productive and fun.

So to learn Racket, Alex assigned himself the task of extracting microdata from html5. If this sounds unappealing to you, (a) you're probably not alone, and (b) maybe you can think of it this way: It's just a task of transforming one sort of ugly tree into another (which is 67.5% of what programming is all about). Tree transformation is right up Haskell's alley. Here's the Racket code that Alex sent me:

(require (prefix-in h: html)
         (prefix-in x: xml))

(struct itemprop (name value) #:transparent)
(struct itemscope ([itemtype #:mutable]
                   [properties #:mutable]) #:transparent)

(define (html-itemscope? element)
  (if (h:html-element? element)
    (findf (lambda (a) (eq? (x:attribute-name a) 'itemscope))
           (h:html-element-attributes element))
  #f))

(define (extract-html5-value element scope)
  (when (h:html-full? element)
    (cond [(html-itemscope? element) scope]
          [else (apply string-append
                       (flatten
                        (map (lambda (c)
                               (cond
                                [(x:pcdata? c)
                                 (list (x:pcdata-string c))]
                                [else (list)]))
                             (h:html-full-content element))))])))

(define (extract-microdata root [scope (void)] [result (box '())])
  (let ([new-scope (void)])
    (when (h:html-full? root)
      (for ([element (h:html-full-content root)])
           (match element
             [(h:html-element (list (x:attribute start stop key value) ...))
              (let ([attrs (for/list ([k key] [v value])
                                     (cons k v))])
                (when (assoc 'itemscope attrs)
                  (set! new-scope
                        (itemscope (cdr (or (assoc 'itemtype attrs)
                                            (cons "" #f))) '()))
                  (when (void? scope)
                    (set-box! result (cons new-scope (unbox result)))))

                (for ([prop attrs]
                      #:when (eq? (car prop) 'itemprop))
                     (set-itemscope-properties!
                      scope (cons (itemprop (cdr prop)
                                            (extract-html5-value element
                                               new-scope))
                                  (itemscope-properties scope)))))
              (extract-microdata element new-scope result)]
             [else #f]))))
  (unbox result))

(extract-microdata (h:read-html (open-input-string
              (string-append "
" "
" "

quux

" "
" " Foobar" "
"))))

It's not complete by any means but it's enough code to get your head around Racket. (Hmmm... This doesn't seem like your father's Scheme that I learned way back when... MIT 6001 and all that. Hopefully I can fake my way through it.) I'm not going to explain all of his code here, but I'll attempt to explain it as we go along. Also, I'm going to take this very slowly, so if you're an expert and see some obvious mistake, or point-free (pointless) simplification, just keep it to yourself. I'll just say I did it for expository purposes, and leave it at that.

Getting started with Haskell

First, go download the Haskell Platform for your platform, and get it installed. That should give you the command-line tools we'll be using here (ghc, ghci, cabal, etc). Then, since you're an Emacs user, install the haskell-mode described here. (If you're not an Emacs user, I pity you, but you can just run ghci in the shell.)

Of course you've already read Learn You a Haskell, for Great Good! You haven't? Good god man! I'm not going to review all that stuff about creating lists and defining functions.

I will tell you how to get going in Emacs though. Start by creating a Haskell file (say "microdata.hs") and make sure it's in haskell-mode. Type in the following masterpiece:

foo = 1 + 1

Now type C-c C-l. That will create an inferior Haskell mode buffer named *haskell*, start the interpreter running, and will load your program in, checking syntax and compiling as it goes. You can hop over to that buffer and type in foo at the *Main> prompt. You'll get your answer. Not so inferior after all, is it?

[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )
Ok, modules loaded: Main.
*Main> foo
2
*Main> 

If you happen to get a Haskell error when you evaluate a buffer like this, you can use C-x ` to jump your cursor to the next error message.

Now it's time to take a swing at this Racket.

Parsing HTML

The first 2 lines of the Racket file load Racket's HTML parser. We'll need to find a Haskell equivalent. Since I don't know about these things off the top of my head, I made a stop at hackage to see what might be available. If you click the Packages tab, you'll get a long list of packages that people have uploaded. Turns out there are quite a few HTML packages. After staring at them for a bit, I chose xmlhtml -- partly because I saw that a couple of other nice packages depend on it, and partly because I could find a parseHTML function in the first documentation file I opened.

To make this package available in your Haskell environment, you'll need to use the cabal package manager to download and build it. In your shell:

$ cabal install xmlhtml

This will grind for a while, building that package and all its dependencies. Fortunately, you only have to do this one time. You'll probably want to kill your *haskell* buffer at this point so that you'll start a new ghci session and pick up the newly installed packages.

Now we're ready to include xmlhtml in our "microdata.hs" file. Get rid of the foo = 1 + 1 crap, and replace it with this:

import Text.XmlHtml

tree = parseHTML "" "
"

This says, import everything from Text.XmlHtml and make its definitions immediately available at the top-level of our file. (Later we'll do some qualified imports to avoid some name collisions.) Then define the variable tree to be the parsed version of the HTML string. Blam:

Prelude> :load "/Users/warrenharris/projects/racket-hs/microdata.hs"
[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )

/Users/warrenharris/projects/racket-hs/microdata.hs:3:21:
    Couldn't match expected type `Data.ByteString.Internal.ByteString'
                with actual type `[Char]'
    In the second argument of `parseHTML', namely
      `"
"' In the expression: parseHTML "" "
" In an equation for `tree': tree = parseHTML "" "
" Failed, modules loaded: none. Prelude>

Our first error. What going on here is that parseHTML expects a ByteString as an argument (binary data), but we gave it a regular String. Regular strings are lists of characters, which is different from an array of bytes, and BTW, Haskell's String type is a synonym for [Char].

Fortunately, we can roll out another bit of magic here called OverloadedStrings (a language feature). By turning this on, we can make the compiler automatically translate strings into ByteString (or the Text type, which we'll also make use of in this code). You can turn on OverloadedStrings by adding a declaration at the top of your file:

{-# LANGUAGE OverloadedStrings #-}

import Text.XmlHtml

tree = parseHTML "" "
"

Now if we evaluate the file, and then examine tree in the *haskell* buffer, we can see our parse tree:

Prelude> :load "/Users/warrenharris/projects/racket-hs/microdata.hs"
[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )
Ok, modules loaded: Main.
*Main> tree
Right (HtmlDocument {docEncoding = UTF8, docType = Nothing, docContent = [Element {elementTag = "div", elementAttrs = [("itemtype","foo")], elementChildren = []}]})
*Main> 

This parse tree is basically a bunch of constructors -- mostly from the Text.XmlHtml package although Right is kind of mysterious. If we examine the type of parseHTML we see that its result is Either String Document, i.e. either a String, or a Document. Haskell provides this type to cover the cases where you can only have one of 2 things. Then if the constructor Left is returned, you get the type on the left (String in this case), or if Right is returned, you get the type on the right (Document in this case). Haskell often uses Either to designate success/fail situations. Left is used for failures, usually returning an error message string, and Right is used for success, returning the parse tree here. (I've often though the 2 constructors of Either should be called Right and Wrong.)

We can also see from the parse tree that's return the names of several other constructors and record accessor functions that we'll need, e.g. HtmlDocument, docContent, Element, elementAttrs, etc. We'll use these in the next section.

Defining data types

Let's make some equivalent data structure definitions for the itemprop and itemscope structs next. There are several ways to do this in Haskell. The simplest might be to write this (for Itemprop):

data Itemprop = Itemprop Text Text

This creates a data structure definition which introduces a new type named Itemprop and a new constructor function (also called Itemprop) that takes 2 parameters where both of them are Text values. (We've used Text instead of String because it's more efficient, and because it's what Text.XmlHtml uses.) When we want to create an instance of this type, we would just write something like Itemprop "text1" "text2" and this would return a new Itemprop value. To access the 2 parameters, we would pattern match against Itemprop t1 t2, binding variables t1 and t2. Nice... but we can do better. Let's write our definition like this, using a record type:

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: Text
  } deriving (Show, Eq)

This not only defines the type and same constructor function we mentioned above (completely identical), but at the same time defines some accessor functions that extract the 2 fields. By convention, we prefix the name these accessors with the name of the type (lower-cased, of course) so that we avoid collisions with similar accessor functions of other data types.

But what's this deriving (Show, Eq) stuff? This is Haskell's Scrap Your Boilerplate magic that automatically creates instances of the specified type classes. Show defines a show function which can stringify instances of the data type. It's extremely handy when using the interpreter (so you can print out these kinds of objects as results). And Eq defines the == and != functions that let values of these data types be structurally compared. There are a bunch of other things you can derive in this way, avoiding typing and mistakes.

So let's define our Itemscope type in a similar way:

data Itemscope = Itemscope {
  itemscopeType :: Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

and since we're using the Text type, we mustn't forget to import it too (this has to go at the top of the file with the rest of the imports):

import Data.Text (Text)

Here we've said that we only want the Text type constructor from this package. If we want more functions to be imported from this package, we can include their names in the parens.

Our first function

Now let's tackle the first Racket function, html-itemscope?. We'll use the name hasItemscope in Haskell since we can't use hyphens and question marks. Here's a first attempt at a definition:

hasItemscope elt = undefined

Huh? How can this possibly be a definition like the Racket one? What I've done here is used the undefined value for the result of this function. undefined is a magical value that can have any Haskell type -- so your code always compiles. (Go ahead, C-c C-l and you'll see that I'm not kidding.) This can be quite helpful for stubbing out something you want to deal with later. Only at runtime if you encounter an undefined value do you get:

*** Exception: Prelude.undefined

So undefined gives you something that you thought you could only get with dynamically typed code: the ability to defer writing code that you don't need to call yet, and a way to leave lurking time bombs in your code! (It is quite handy though.)

Let's try a little harder though, and deal with the first conditional h:html-element?. In the Text.XmlHtml package, elements vs other types of data are dealt with by the Node type which is an abstract data type with several cases. One of those is Element. We can use pattern matching to test for elements like this:

hasItemscope elt = case elt of
  Element _ _ _ -> undefined
  otherwise -> False

Here Element _ _ _ matches the Element constructor and its 3 arguments (the record fields). Since we don't care about these fields, we use underscores as wildcard "don't care" variables. And since the first thing we do here is match one of the arguments for a data type constructor, there's a more clever way to write this -- with several top-level function cases:

hasItemscope elt@(Element _ _ _) = undefined
hasItemscope _ = False

Here the elt@(Element _ _ _) syntax matches the elt parameter to the Element constructor and binds elt only if it matches and evaluates the body of the function case. If the first case doesn't match, it moves on to the next one. In fact, if you don't use wildcards, the compiler will even tell you if you've covered all the possible cases.

Next, we want to search for an "itemprop" attribute in the element's list of attributes. In the Data.List package there's a find function that's very similar to Racket's findf that takes a test function (we'll need to import Data.List (find) to get it). Let's rewrite the first case to call it:

hasItemscope elt@(Element _ _ _) =
  find hasItemscopeAttr (elementAttrs elt)
  where hasItemscopeAttr _ = undefined
hasItemscope _ = False

Instead of using a lambda like in the Racket code, we've defined a local function named hasItemscopeAttr (again with an undefined body) using a where clause (where at the bottom is just using a let at the top). However, if we try to compile this, we'll find a problem already:

*Main> :load "/Users/warrenharris/projects/racket-hs/microdata.hs"
[1 of 1] Compiling Main             ( /Users/warrenharris/projects/racket-hs/microdata.hs, interpreted )

/Users/warrenharris/projects/racket-hs/microdata.hs:29:19:
    Couldn't match expected type `Maybe (Text, Text)'
                with actual type `Bool'
    In the expression: False
    In an equation for `hasItemscope': hasItemscope _ = False
Failed, modules loaded: none.
Prelude> 

This is because the type checker inferred an inconsistency in the two cases. The first case returns a Maybe (Text, Text) (because this is what find returns), but the second case returns a Bool (what we want). We can fix this with a bit more pattern matching:

hasItemscope elt@(Element _ _ _) =
  case find hasItemscopeAttr (elementAttrs elt) of
    Just _ -> True
    Nothing -> False
  where hasItemscopeAttr _ = undefined
hasItemscope _ = False

However, since we're translating a Maybe type into a Bool, you might guess that this is quite common and there's a shortcut for it. Let's see if we can find it. This is were the magic of Hoogle comes in. Hoogle is a way to search all Haskell functions uploaded to Hackage by their types. If we type Maybe a -> Bool into the search box, we'll find the first result is:

which is exactly what we want. Let's import Data.Maybe (isJust) and rewrite the function to use it:

hasItemscope elt@(Element _ _ _) =
  isJust (find hasItemscopeAttr (elementAttrs elt))
  where hasItemscopeAttr _ = undefined
hasItemscope _ = False

Now we can turn our attention to hasItemscopeAttr. First, what kind of data is passed as that first argument? We can use the *haskell* buffer to ask:

*Main> :t find
find :: (a -> Bool) -> [a] -> Maybe a
*Main> 

Or you can put your cursor on find in your file buffer, and type C-c t.

Since our list is coming from elementAttrs, we know this returns an association list of Text values ([(Text, Text)]). So hasItemscopeAttr should get a single pair. We can use pattern matching in its argument list to isolate the case we're looking for -- when the key is "itemscope":

hasItemscope elt@(Element _ _ _) =
  isJust (find hasItemscopeAttr (elementAttrs elt))
  where hasItemscopeAttr ("itemscope", _) = True
        hasItemscopeAttr _ = False
hasItemscope _ = False

Well, that's pretty good, except we can do one better. When the find function is used to search for a single item, well, that's exactly what the lookup function from the same package is all about. So we didn't really need all this find and hasItemscopeAttr nonsense after all. We can write things much more simply like this:

hasItemscope elt@(Element _ _ _) =
  isJust (lookup "itemscope" (elementAttrs elt))
hasItemscope _ = False

This probably means the Racket code can be simpler too.

We're done... except for one last thing. Haskellers hate parentheses! Absolutely hate them. And they'll go to extreme lengths to avoid them, whenever possible. (Such a shame, given how much Haskell owes to Lisp.) Anyway, the primary tool they use to avoid them is a little infix operator named $. The $ operator just applies the thing on the left-hand side (a function) to the thing on the right-hand side (a value) -- which is completely uninteresting except for the fact that the operator precedence of $ allows a set of parens be avoided. I am not kidding you!

So given the line of code isJust (lookup "itemscope" (elementAttrs elt)), we can use one $ operator to rewrite it to this: isJust $ lookup "itemscope" (elementAttrs elt), and then another $ to rewrite it to this: isJust $ lookup "itemscope" $ elementAttrs elt. $ takes as much stuff as it can on the right-hand side, which is why using 2 $ operators as we've done here parses with the same precedence as our original set of parens (i.e it's right-associative).

And another thing... Good Haskellers always declare the types of their functions. Although the type checker is perfectly happy to figure these out automatically, us humans are much more lazy and prefer to have someone write it down for us. So think of the next guy and write it down for them. Putting it all together, the "professional" version of our hasItemscope function would look like this:

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

Review

Did I say we were done? We've only written 2 types and 1 function so far. But let's review what our code looks like so far:

{-# LANGUAGE OverloadedStrings #-}

import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: Text
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

For the rest, I'm afraid you're going to have to move on to part 2.