February 25, 2012

Restringing a Racket with Haskell (pt 2)

In part 1, we wasted a lot of time traipsing through Hackage, goofing off with Emacs, Hoogling (yes, that is a word!), and writing undefined functions. We're going to have to pick up the pace if we want to get this done before dinner.

Let's take a look at Alex's next Racket function, extract-html5-value. In case you don't have it memorized:

(define (extract-html5-value element scope)
  (when (h:html-full? element)
    (cond [(html-itemscope? element) scope]
          [else (apply string-append
                       (flatten
                        (map (lambda (c)
                               (cond
                                [(x:pcdata? c)
                                 (list (x:pcdata-string c))]
                                [else (list)]))
                             (h:html-full-content element))))])))

It's fairly clear that this function takes an HTML element as its first parameter, but what is scope? Scope gets returned if html-itemscope? returns true. Great -- that tells us nothing. Oh, but type-wise, it must be the same type as the second case in the cond, and that looks like it returns a string. (Gee, if we had a typed language, this would be a whole lot simpler.) So we want a Haskell function like this:

extractHtml5Value :: Node -> Text -> Text
extractHtml5Value elt scope = undefined

Or do we? The thing is, if you examine the rest of the Racket program, you'll find that what get passed to extract-html5-value is not a string, but the value of new-scope (from line 46) which is an itemscope structure (from line 36). How can we have an itemscope in one case, and a string in the other?

Answer: Alex is implementing this (or at least part of it). In one case (the first one) a value is an itemscope attribute, but in another (the last one) it's a piece of text. This is a job for algebraic data types!

An algebraic data type is a fancy name for a union. Not your Pipelayer's Local 57, but a type that has a number of different cases. In Haskell, we can define algebraic data types by defining a data type with more than one constructor. Let's create a MicrodataValue data type this way:

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

Here we have 2 constructors: ItemscopeValue that takes an Itemscope as a parameter, and TextValue that takes Text as a parameter (the vertical bar separates them). You can think of these constructors as the tags in a tagged union... because that's exactly what they are. You can discriminate these 2 cases by pattern matching. In fact, we've seen this already with the Node data type.

But how did Racket get away with not needing something like this? It's because values in Racket (like all Lisp derivatives) have types at runtime. (In Haskell, types go away after the program is compiled.) Racket's runtime type system can distinguish structs from strings, so it 's happy. However, if you wanted to distinguish strings used for different purposes, you'd have to resort to some sort of home-grown tagging (maybe using another structure type to wrap them). In Haskell we're forced to create wrappers around these different cases whenever we want to to distinguish alternatives at runtime -- by using algebraic data types.

So I guess what we really want is a extractHtml5Value function that looks like this:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope = undefined

but without so much undefined stuff. The first case is easy:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope =
  if hasItemscope elt
  then ItemscopeValue scope
  else undefined

Since the first thing we're doing here is an if statement, we can convert this into a couple of top-level function cases that use a guard clause to restrict whether the first case triggers. This is just a little cleaner:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope = undefined

In the first case we've just wrapped the ItemscopeValue constructor around the Itemscope value we were given. The second case in the Racket code is doing some fancy footwork to concatenate the PCDATA of the element together. To do this in Haskell, we can map over the element's children, and if the child is a TextNode (from the Text.XmlHtml package -- part of the Node algebraic data type) return the text. If it isn't a TextNode, just return some empty text. This will give us a list of Text values that we can concatenate:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ concat $ map getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

Unfortunately, this doesn't work. What's going on?

/Users/warrenharris/projects/racket-hs/microdata.hs:57:29:
    Couldn't match expected type `[a0]' with actual type `Text'
    Expected type: [[a0]]
      Actual type: [Text]
    In the second argument of `($)', namely
      `map getText $ elementChildren elt'
    In the second argument of `($)', namely
      `concat $ map getText $ elementChildren elt'
Failed, modules loaded: none.

We're looking at you, map. Actually, map's return type is what we're expecting, [Text]. It seems that the expected type isn't what we want -- a list of lists of things. (a0 is some type variable that isn't (yet) instantiated. Type variables always start with lower-case letters, whereas concrete types are always upper-case.) So the problem must be concat. What's it's type?

Prelude> :t concat
concat :: [[a]] -> [a]

So this isn't at all what we're expecting. We want something that converts [Text] -> Text. Let's ask Hoogle: [Text] -> Text. (BTW, you can enable Hoogle in ghci if you like.) Doh. We want the concat in Data.Text. Let's add another import:

import Data.Text (Text, concat)

Fail.

/Users/warrenharris/projects/racket-hs/microdata.hs:57:20:
    Ambiguous occurrence `concat'
    It could refer to either `Prelude.concat', imported from Prelude
                          or `Data.Text.concat',
                             imported from Data.Text at /Users/warrenharris/projects/racket-hs/microdata.hs:6:25-30
Failed, modules loaded: none.

Let's hide the version of concat coming from the Prelude module (stuff that's there if you don't as for anything special):

import Prelude hiding (concat)

Works. But we can do better. Concatenating lists that result from mapping is a common pattern, and there's probably a better way to do this. Let's ask Hoogle again (by giving the overall type of the concat $ map combined functions): (a -> Text) -> [a] -> Text. Hmmm... foldMap looks vaguely like something that would fit:

foldMap :: (Foldable t, Monoid m) => (a -> m) -> t a -> m

but it requires a Foldable t type function applied to a (what does that mean) whereas we asked for [a], a list of whatever. Could it be that the list type is a Foldable type function? Could be. And we asked for the result to be Text, but Hoogle gave us something that returns a Monoid (wtf?). Is Text a Monoid? I guess we can give it a shot. Let's import:

import Data.Foldable (foldMap)

and try it out:

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

Yeah! Hoogle is awesome.

Review

Where are we now?

{-# LANGUAGE OverloadedStrings #-}

import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: Text
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

Ready for part 3?

No comments:

Post a Comment