March 25, 2012

A comic look at Haskell (pt 2)

In the last part, we defined some Haskell foreign functions for the libarchive library. Now we'll get down to the actual business of translating a few Racket routines into Haskell, redesign them a bit, and ultimately show how lazy IO can be a very nice way to interact with this library.

Let's begin by translating the archive-read-current-file function that we saw previously:

(define (archive-read-current-file archive entry-size)
  (let ([buffer (malloc 'atomic entry-size)])
    (memset buffer 0 entry-size)
    (archive_read_data archive buffer entry-size)
    (make-sized-byte-string buffer entry-size)))

Here, the low-level archive_read_data native method wants to be passed a buffer to write to. In the last section, we did this by making archiveReadData allow it to write to a ByteString value (which can be mutated as you can see, if you call the right "unsafe" operations, e.g. unsafeUseAsCStringLen). So now our problem is simply to provide an appropriately sized ByteString for the Haskell version of this routine to write to. We can do this with Data.ByteString.replicate (which like malloc and memset). Then based on the actual number of bytes read, we can trim the ByteString to the exact size with take:

archiveReadCurrentFile :: Archive -> Int -> IO ByteString
archiveReadCurrentFile ar sz = do
  let bs = replicate sz 0
  cnt <- archiveReadData ar bs
  return $ take cnt bs

This probably isn't the most efficient way to do this. A better API might use Data.ByteString.Internal.createAndTrim which avoids making the final copy that take makes. But this would require redesigning archiveReadData to take a Ptr Word8 to write to instead, or simply calling archive_read_data directly. Something we can revisit later.

Next, let's translate the archive-read-next-file-entry function:

(define (archive-read-next-file-entry archive)
  (let-values ([(entry ret) (archive_read_next_header archive)])
    (if (= ret ARCHIVE_OK)
        (archive-entry (archive_entry_pathname entry)
                       (archive_entry_size entry)
                       (rewrite-file-mode (archive_entry_filetype entry)))
        (if (= ret ARCHIVE_EOF)
            #f
            (raise (list "error" ret))))))

This is pretty straightforward. It simply calls the archive_read_next_header function, checks the result, and builds a Racket archive-entry structure (note the dash instead of underscore in the name) from 3 parts of the C archive_entry struct that's returned. Now, I don't know why it's actually necessary to build another structure containing the archive_entry components (perhaps because Alex wanted to pattern match against this structure?), but in Haskell, I'll just rely on the native methods we designed earlier to access the components of the archive_entry:

archiveReadNextFileEntry :: Archive -> IO ArchiveEntry
archiveReadNextFileEntry ar = do
  rv <- archiveReadNextHeader ar
  case rv of
    Left err -> error (show err)
    Right entry -> return entry

Earlier we had made archiveReadNextHeader return a result of type Either Result ArchiveEntry. And if you recall, an Either value has Left and Right constructors that are used to distinguish either of 2 cases. Usually the Left case is used to specify an error result, and the Right case is for the common, non-error result. We could have returned a pair, (Result, Maybe ArchiveEntry), rather like the Racket code has done, but Either is a slightly cleaner way to do the same thing.

Now rather than returning our Either result, we want this routine to throw an exception for the error case, and simply return the ArchiveEntry for the common case. The case statement accomplishes this. Also, if you recall from earlier, we would like to be able to catch these exceptions, so we're going to swap out the error function with ioError and userError. And one more thing: The whole routine can be simplified by using the either function to say what to do in either of the 2 cases:

archiveReadNextFileEntry :: Archive -> IO ArchiveEntry
archiveReadNextFileEntry ar =
  archiveReadNextHeader ar >>= either (ioError . userError . show) return

This says: Call the archiveReadNextHeader function (that's in the IO monad), and pass its result to the either function (rather like a unix pipe). Then either takes 2 parameters: the first says if you get a Left value, call the composed function (ioError . userError . show) on it (remember, this is equivalent to (\left -> ioError (userError (show left))) -- yes the dots seem to work backwards from most languages we're familiar with). In the Right case, simply call return on the value.

Folding archives

Now we get to the fun part. Alex thought that a good API for visiting the elements of an archive file would be to fold over them. A fold is like a map that passes a piece of data long for the ride -- an accumulator value that it can change or replace as each file is visited. (When the accumulator is a list, you have a map operation exactly.) Here's the Racket fold routine:

(define (archive-folder archive proc init-val)
  (match (archive-read-next-file-entry archive)
    [(archive-entry path size filetype)
     (archive-folder
      archive proc
      (apply proc
             (list (archive-entry path size filetype)
                   (lambda () (archive-read-current-file archive size))
                   init-val)))]
    [#f init-val]))

This calls archive-read-next-file-entry, and when it returns an archive-entry the user's proc is called to handle it's data. The proc is passed the archive-entry (actually a copy of it -- tisk, tisk), a function that can be called to read the current file (a thunk -- a nullary function that's used when we want to delay evaluation -- rather like Haskell's laziness, don't you think?), and the accumulator value, init-val. The result of the user's proc becomes the new accumulator that's passed to a recursive call to archive-folder which will go on to read the next entry.

In Haskell, we can write this almost identically:

archiveFolder :: Archive -> (ArchiveEntry -> IO ByteString -> a -> IO a) ->
                 a -> IO a
archiveFolder ar f acc =
  catch
    (do entry <- archiveReadNextFileEntry ar
        acc' <- f entry (readCurrentEntry entry) acc
        archiveFolder ar f acc')
    (\_ -> return acc)
  where readCurrentEntry entry = do
          sz <- archiveEntrySize entry
          bs <- archiveReadCurrentFile ar sz
          return bs

There are a few subtle differences here. Instead of testing the result of archiveReadNextFileEntry, we're going to rely on catching an exception to know when to terminate the fold. And instead of using a thunk (which is rarely needed in a lazy language like Haskell), we're just going to pass an IO operation which is the result of calling the local function readCurrentEntry.

Let's define a few additional routines in a separate file so that we can test this out:

module Main where

import Codec.Archive.Libarchive
import Data.ByteString hiding (map)
import Data.Int
import qualified Data.ByteString.Lazy as L
import Prelude hiding (read, take)

bufSize :: Int
bufSize = 10240

archiveReadOpen :: String -> IO Archive
archiveReadOpen path = do
  ar <- archiveReadNew
  expectOK $ archiveReadSupportCompressionAll ar
  expectOK $ archiveReadSupportFormatAll ar
  expectOK $ archiveReadOpenFile ar path bufSize
  return ar

archiveFoldFile :: String -> (ArchiveEntry -> IO ByteString -> a -> IO a) ->
                   a -> IO a
archiveFoldFile path f acc = do
  ar <- archiveReadOpen path
  archiveFolder ar f acc

readEntry :: ArchiveEntry -> IO ByteString -> [(Int, ByteString)] ->
             IO [(Int, ByteString)]
readEntry entry reader acc = do
  t <- archiveEntryFiletype entry
  if t == ae_ifreg
    then do sz <- archiveEntrySize entry
            bs <- reader
            let sub = take (min sz 10) bs -- just keep some of the data for test purposes
            return $ (sz, sub):acc
    else return acc

path = "/Users/warrenharris/projects/blog/comic/v-vendetta-0.cbz"

test1 = archiveFoldFile path readEntry []

Let's call it:

*Main> test1
[(2535,"-=Book=-\r\r"),(956,"V For Vend"),(590199,"\255\216\255\224\NUL\DLEJFIF"),(374813,"\255\216\255\224\NUL\DLEJFIF"),(508622,"\255\216\255\224\NUL\DLEJFIF"),(518735,"\255\216\255\224\NUL\DLEJFIF"),(527499,"\255\216\255\224\NUL\DLEJFIF"),(451816,"\255\216\255\224\NUL\DLEJFIF"),(541239,"\255\216\255\224\NUL\DLEJFIF"),(481156,"\255\216\255\224\NUL\DLEJFIF"),(516998,"\255\216\255\224\NUL\DLEJFIF"),(510813,"\255\216\255\224\NUL\DLEJFIF"),(478150,"\255\216\255\224\NUL\DLEJFIF"),(437432,"\255\216\255\224\NUL\DLEJFIF"),(310424,"\255\216\255\224\NUL\DLEJFIF"),(265642,"\255\216\255\224\NUL\DLEJFIF"),(301822,"\255\216\255\224\NUL\DLEJFIF"),(676613,"\255\216\255\224\NUL\DLEJFIF"),(53109,"\255\216\255\224\NUL\DLEJFIF"),(452430,"\255\216\255\224\NUL\DLEJFIF"),(305523,"\255\216\255\224\NUL\DLEJFIF"),(319714,"\255\216\255\224\NUL\DLEJFIF"),(298048,"\255\216\255\224\NUL\DLEJFIF")]

Once thing you'll notice here is that the data got reversed. The last entry (beginning with "-=Book=-\r\r") got put at the front of the list. This is because the fold routine that we wrote is essentially a left-fold, which means it starts from the left (the beginning of the archive contents) and pushes its results into a list as it goes. So the list has the first element on the right. No big deal. We can simply reverse it if we want.

Folding archives -- take 2

Now the first thing the astute reader will notice about archiveFolder is that it really isn't necessary to provide readCurrentEntry at all. This is something the user's function can implement as long as it has the Archive and ArchiveEntry values in hand. We might rewrite this fold routine as follows:

archiveFolder2 :: Archive -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a
archiveFolder2 ar f acc =
  catch
    (do entry <- archiveReadNextFileEntry ar
        acc' <- f ar entry acc
        archiveFolder2 ar f acc')
    (\_ -> return acc)

Now let's provide a few additional routines so that we can test this out:

archiveFoldFile2 :: String -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a
archiveFoldFile2 path f acc = do
  ar <- archiveReadOpen path
  archiveFolder2 ar f acc

readEntry2 :: Archive -> ArchiveEntry -> [(Int, ByteString)] ->
              IO [(Int, ByteString)]
readEntry2 ar entry acc = do
  t <- archiveEntryFiletype entry
  if t == ae_ifreg
    then do sz <- archiveEntrySize entry
            bs <- archiveReadCurrentFile ar sz
            let sub = take (min sz 10) bs
            return $ (sz, sub):acc
    else return acc

test2 = archiveFoldFile2 path readEntry2 []

We get the exact same result if we call this one. All we've done is moved a bit of the work into the readEntry2 function.

Step away from the fold

Now the problem with using a fold operation to traverse an archive file is that it doesn't give you a way to control which parts of the archive get visited. Granted, the user's proc can decided not to slurp in the contents of an archive_entry, but it can't decided that it doesn't want to continue visiting entries. The fold operation is going to continue marching to the end of the archive, whether you've got enough data or not.

Now it might be possible to use exceptions, continuations or delimited control to provide an escape hatch for fold, but a nicer abstraction would be one that didn't have to resort to escape hatches at all -- one that just build the data structure you were about to use as you decided you needed it. Just like a file read operation that streams in chunks of the file each successive time read is called, wouldn't it be nice if we could provide a structured stream of data that only created the data values from the raw archive data as necessary? That's exactly what lazy IO can do for you.

If you remember what I said earlier, the IO monad isn't lazy at all. It's all about sequencing "real world" operations in an order that's predictable and timely. So before we get into the lazy part of lazy IO, let's just design a plain old IO operation that reads the entire archive.

To read an archive, we need to alternatively read headers and archive data -- so there's essentially 2 states to the reading process. And as we read, we need to accumulate this data somewhere. Perhaps the most natural place to accumulate it is as a list of ByteStrings -- one for each archive_entry.

However, instead of regular (strict) ByteStrings, we're going to use lazy ByteStrings instead. There are 2 reasons for this: First, a lazy ByteString provides the same interface abstraction as a regular ByteString, but underneath is uses a sequence of buffer segments to allow data to be brought in as needed. Second, is because it has laziness built into it from the get-go. Although we're not going to use the lazy aspect of L.ByteString just yet, we'll need it eventually -- just just hold on.

Let's start from the bottom up, by first providing a function to return a lazy ByteString for each archive_entry. It's easy to make a L.ByteString from a list of strict ones simply by calling L.fromChunks:

archiveReadEntry :: Ptr ARCHIVE -> IO L.ByteString
archiveReadEntry a = do
  chunks <- archiveReadEntryChunks a
  return $ L.fromChunks chunks

archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString]
archiveReadEntryChunks a = do
  bs <- archiveReadEntryChunk a
  if length bs == 0
    then return []
    else do rest <- archiveReadEntryChunks a
            return (bs:rest)

archiveReadEntryChunk :: Ptr ARCHIVE -> IO ByteString
archiveReadEntryChunk a = do
  buf <- mallocArray bufSize
  rv <- archive_read_data a buf (fromIntegral bufSize)
  unsafePackCStringFinalizer (castPtr buf) (fromIntegral rv) (free buf)

Here I've cheated a bit and made all these functions take a Ptr ARCHIVE instead of an Archive. That's so that I can unwrap the Archive once at the beginning, and also because I don't plan to export these functions from the module. Note that we're mallocing fixed-size chunks and calling archive_read_data to fill them. Then we're using unsafePackCStringFinalizer to associate each chunk with its free finalizer. That's a little more efficient than what we had before.

Next we can define a function that alternately reads headers and entry contents, and does this recursively until the end of the archive is reached:

archiveRead :: Archive -> IO [L.ByteString]
archiveRead ar = do
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- archiveReadNextHeader ar
    case rv of
      Right entry -> do
        t <- archiveEntryFiletype entry
        if t == ae_ifreg
          then do bs <- archiveReadEntry a
                  rest <- archiveRead ar
                  return (bs:rest)
          else archiveRead ar
      Left _ -> return []

This says to read an archive, first read a header. If there is no error, check the archive entry's file type, and if it's a regular file, read the archive's contents followed by a recursive call to read the rest of the archive. If it's not a regular file, we'll skip that entry and continue on to the next one. We'll continue doing this until archiveReadNextHeader returns an error.

Now we're almost done. The only thing that remains is this little detail I mentioned earlier -- the fact that the IO monad is strict, and as written this will pull the entire contents of the archive into memory, building the list of (fully read in) L.ByteStrings as it goes. What we need to do is introduce a little delay in a couple of critical places, so that some of these IO operations aren't performed until they're needed. This is a job for unsafeInterleaveIO.

There are 2 places where unsafeInterleaveIO is needed. First, we need to delay reading in the tail of the list of L.ByteStrings until more entries are needed. Second, we need to delay reading in the entire contents of any individual entry, so that its L.ByteString isn't fully materialized until the data is actually consumed. The first one of these goes in archiveRead itself, at the point where the recursive call is about to be made:

archiveRead :: Archive -> IO [L.ByteString]
archiveRead ar = do
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- archiveReadNextHeader ar
    case rv of
      Right entry -> do
        t <- archiveEntryFiletype entry
        if t == ae_ifreg
          then do bs <- archiveReadEntry a
                  rest <- unsafeInterleaveIO $ archiveRead ar
                  return (bs:rest)
          else archiveRead ar
      Left _ -> return []

The second one goes in archiveReadEntryChunks, again at the point where it is about to be called recursively:

archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString]
archiveReadEntryChunks a = do
  bs <- archiveReadEntryChunk a
  if length bs == 0
    then return []
    else do rest <- unsafeInterleaveIO $ archiveReadEntryChunks a
            return (bs:rest)

Now the cool thing about lazy IO is that nothing gets read in until it is needed, but due to the sequencing savviness of monads, everything happens in exactly the same order as if it had been read in eagerly. So we get the best of both worlds -- an elegant abstraction (a list of ByteStrings), and the benefits of streaming.

Conclusion

At this point I'm sure your disappointed to find that this article has had very little to do with comics, and in general hasn't been very funny at all. But stay tuned for part 3, where I'll attempt to put it this library to use and actually build that web app for your comic viewing pleasure.

A comic look at Haskell

Well, if you're one of the 2 people that read my last set of blog articles, Restringing a Racket with Haskell, you'll know that Alex is attempting to branch out from the ancient esoteric art of Emacs Lisp hacking to learn the well-mannered sport of Racket. I, on the other hand, am trying to steer him towards the abyss. As Oliver Wendell Holmes once said, "The mind, once expanded to the dimensions of larger ideas, never returns to its original size." Here's to Alex with a big head!

Now last time I made a wild and unsubstantiated claim that Haskell makes me a better programmer, and then proceeded to show you a bunch of boring bits about trying to get a silly HTML microdata extractor to work. Well, this time, you'll be happy to know that I'm going to stand by my claim, and likewise go on to bore you with the likes of foreign functions and lazy IO. Why am I wasting your time, you ask? First, you're the one chosing to read this! Second, we've got to learn a little bit of the Haskell landscape before the rockinghorse people can get down to gnashing those marshmallow pies. So if I can get Alex, or anyone else out there in the studio audience over the initial hurdle, I've done my bit.

So this time, I was tossed a comic book archive file (.cbz) and told that our goal was to create a little web app that displays the comic book pages. Along with this, I was given the following bit of Racket foreign function interface to libarchive, a nifty little package developed by another friend of ours, Tim, for reading all sorts of archive formats.

#lang racket/base
(require ffi/unsafe
         ffi/unsafe/define
         ffi/unsafe/alloc
         racket/match)

(define-ffi-definer define-archive (ffi-lib "libarchive" (list "12")))

(define ARCHIVE_OK 0)
(define ARCHIVE_EOF 1)

(define _archive-pointer (_cpointer 'archive))
(define _archive_entry-pointer (_cpointer 'archive-entry))

(define-archive archive_read_close
  (_fun _archive-pointer -> _int) #:wrap (deallocator))
(define-archive archive_read_free
  (_fun _archive-pointer -> _int) #:wrap (deallocator))
(define-archive archive_read_new (_fun -> _archive-pointer)
  #:wrap (allocator archive_read_free))
(define-archive archive_read_support_compression_all
  (_fun _archive-pointer -> _int))
(define-archive archive_read_support_format_all
  (_fun _archive-pointer -> _int))
(define-archive archive_read_open_file
  (_fun _archive-pointer _string _int -> _int)
  #:wrap (allocator archive_read_close))
(define-archive archive_read_next_header
  (_fun _archive-pointer (o : (_ptr o _archive_entry-pointer))
        -> (r : _int)
        -> (values o r)))
(define-archive archive_entry_pathname
  (_fun _archive_entry-pointer -> _string))
(define-archive archive_entry_size
  (_fun _archive_entry-pointer -> _int))
(define-archive archive_entry_filetype
  (_fun _archive_entry-pointer -> _int))
(define-archive archive_read_data
  (_fun _archive-pointer _pointer _int -> _int))

(define (archive-read-current-file archive entry-size)
  (let ([buffer (malloc 'atomic entry-size)])
    (memset buffer 0 entry-size)
    (archive_read_data archive buffer entry-size)
    (make-sized-byte-string buffer entry-size)))

(struct archive-entry (path size filetype) #:transparent)

(define (rewrite-file-mode mode)
  (match mode
    [#o0170000 'fmt]
    [#o0100000 'file]
    [#o0120000 'link]
    [#o0140000 'sock]
    [#o0020000 'chr]
    [#o0060000 'blk]
    [#o0040000 'dir]
    [#o0010000 'fifo]))

(define (archive-read-next-file-entry archive)
  (let-values ([(entry ret) (archive_read_next_header archive)])
    (if (= ret ARCHIVE_OK)
        (archive-entry (archive_entry_pathname entry)
                       (archive_entry_size entry)
                       (rewrite-file-mode (archive_entry_filetype entry)))
        (if (= ret ARCHIVE_EOF)
            #f
            (raise (list "error" ret))))))

(define (archive-folder archive proc init-val)
  (match (archive-read-next-file-entry archive)
    [(archive-entry path size filetype)
     (archive-folder
      archive proc
      (apply proc
             (list (archive-entry path size filetype)
                   (lambda () (archive-read-current-file archive size))
                   init-val)))]
    [#f init-val]))

(define (archive-fold-files archive-path proc init-val)
  (let ([archive (archive_read_new)]) ; check that archive != NULL
    (archive_read_support_compression_all archive) ; check return
    (archive_read_support_format_all archive) ; check return
    (archive_read_open_file archive archive-path 10240) ; check return
    (archive-folder archive proc init-val)))

(archive-fold-files
"/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr"
 (lambda (entry reader val)
   (if (equal? (archive-entry-filetype entry) 'file)
       (cons (cons (archive-entry-path entry)
                   (let ([data (reader)])
                    (subbytes data 0 (min (bytes-length data) 10)))) val)
       val))
 '())

;; (archive-fold-files
;;  "/home/alexbl/Downloads/Judge Dredd The Complete Case Files/Judge Dredd The Complete Case Files 001 (Digital Copy Colour Edit).cbr"
;;  (lambda (entry reader val)
;;    (cons (cons (archive-entry-path entry) (archive-entry-size entry)) val))
;;  '())

(provide archive-fold-files
         archive-entry)

As you can see from this file, there are a bunch of actual foreign function definitions (like archive_read_close), and then a number of higher-level routines to make it more convenient to get at an archive's contents.

Before we can begin translating this to Haskell foreign functions, there's a lot to learn about how Haskell's FFI works. There are a bunch of articles on that topic on the Haskell Wiki, as well as a very nice series of articles at Inside 233. But like always, I'm just going to cut to the chase and give you the answer, because we're both in a hurry.

To do this, we're going to use Haskell's lowest-level tool, a preprocessor called "hsc2hs" to convert a foreign function definition into working Haskell code. There are several other ways to generate FFI interfaces for Haskell, but hsc2hs isn't too painful, and it's always good to know the basics. Maybe later we'll revisit this with c2hs or something else.

FFI Boilerplate

We start by creating a .hsc file in the right place for Haskell's module hierarchy. If we look on Hackage, we see that many archive handling packages are in the Codec namespace. So we'll do the same and create a file called "Codec/Archive/Libarchive.hsc" and give it some initial boilerplate:

{-# LANGUAGE ForeignFunctionInterface, EmptyDataDecls, DeriveDataTypeable,
             ScopedTypeVariables #-}

#include <archive.h>
#include <archive_entry.h>

module Codec.Archive.Libarchive where

import Data.ByteString (ByteString, replicate, take)
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Unsafe (unsafePackCString, unsafeUseAsCStringLen,
                               unsafePackCStringFinalizer)
import Foreign.C (CInt, CSize, CString, withCAString)
import Foreign.Marshal.Alloc (alloca, free)
import Foreign.Marshal.Array (mallocArray)
import Foreign.Storable (peek)
import Foreign.Ptr (Ptr, FunPtr, castFunPtr, castPtr)
import Foreign.ForeignPtr (ForeignPtr, newForeignPtr, newForeignPtr_,
                           withForeignPtr)
import Prelude hiding (replicate, take)
import System.IO.Unsafe (unsafeInterleaveIO)

I've made this easy for you to see which functions come from which modules, in case you want to go look them up. You really don't need to declare the individual functions you use in this way, provided that there aren't any conflicts between modules. (You'll notice that there is a conflict in the above between ByteString's replicate and take which I'm hiding from the Prelude package. Also the Data.ByteString.Lazy module defines a whole parallel set of lazy ByteString functions that we'll get at by using the "L" prefix.)

Now one cool thing about hsc2hs is that you can just include C headers directly, like I've done with the 2 libarchive headers. We'll see how these get used in just a few minutes. But first a bit of bad news. Unlike normal Haskell .hs files, you can't sit in haskell-mode in Emacs and evaluate .hsc files in the same way. I know that's really disappointing, but maybe some Emacs Lisp hacker will come along and fix that sad situation eventually. In the mean time, we'll resort to Haskell's big fancy make system, Cabal, to build this file. Here's the cabal definition, a file we'll name "libarchive.cabal" that lives in our top-level directory:

name: libarchive
version: 0.1
license: BSD3
cabal-version: >= 1.8
copyright: (c) 2012 Warren Harris
author: Warren Harris <warrensomebody@gmail.com>
maintainer: Warren Harris <warrensomebody@gmail.com>
stability: alpha
synopsis: libarchive bindings
description: libarchive bindings
category: Codec
build-type: Simple

Library
    exposed-modules:    Codec.Archive.Libarchive
    build-depends:      base >= 3 && < 5,
                        bytestring >= 0.9
    extra-libraries:    archive
    ghc-options:        -Wall

First a little aside: If you're on a Mac and you don't yet have libarchive installed, you can get it via macports. I tried this the other day, and decided to port upgrade outdated at the same time. Eleven (yes, 11) hours later, I had all sorts of shiny new toys like gcc45 and Erlang (Erlang?). This was the last straw, so I decided to switch to Homebrew which provides a much more sane package manager for the Mac. Unfortunately, you need a bit of magic to install the development headers for libarchive with Homebrew:

$ brew tap homebrew/dupes
$ brew install `brew --prefix`/Library/Formula/libarchive.rb

Now we're ready to configure and build our (empty) libarchive module::

$ cabal configure
$ cabal install

We'll need to re-run the cabal install command whenever we make a change to our Libarchive.hsc file. At least until we get that Emacs integration.

Our first foreign functions

Now, let's create our first foreign function definitions, starting with the interface to archive_entry. We'll put these at the end of the Libarchive.hsc file:

data ARCHIVE_ENTRY      -- native type
newtype ArchiveEntry = ArchiveEntry { unArchiveEntry :: ForeignPtr ARCHIVE_ENTRY }
                     deriving (Show, Eq)

foreign import ccall unsafe "archive.h archive_entry_pathname"
  archive_entry_pathname :: Ptr ARCHIVE_ENTRY -> IO CString

archiveEntryPathname :: ArchiveEntry -> IO ByteString
archiveEntryPathname ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    cstr <- archive_entry_pathname a
    unsafePackCString cstr

foreign import ccall unsafe "archive.h archive_entry_size"
  archive_entry_size :: Ptr ARCHIVE_ENTRY -> IO CInt

archiveEntrySize :: ArchiveEntry -> IO Int
archiveEntrySize ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    sz <- archive_entry_size a
    return $ fromIntegral sz

foreign import ccall unsafe "archive.h archive_entry_filetype"
  archive_entry_filetype :: Ptr ARCHIVE_ENTRY -> IO CInt

archiveEntryFiletype :: ArchiveEntry -> IO Int
archiveEntryFiletype ae =
  withForeignPtr (unArchiveEntry ae) $ \a -> do
    mode <- archive_entry_filetype a
    return $ fromIntegral mode

The first line defines a new data type, ARCHIVE_ENTRY, with no constructors. The only reason for this is so that we can distinguish native pointers to C archive_entry structures from other kinds of pointers. In Haskell, you'll see Ptr ARCHIVE_ENTRY as the type of the first argument of the archive_entry accessors.

The second line defines a higher-level wrapper type for archive_entry structures. Here we point to our archive_entry with a Haskell ForeignPtr instead of a regular Ptr. ForeignPtrs allow us to associate a finalizer callback with a C pointer, so that when the garbage collector decides the object is no longer needed, the C structure can be automatically freed. You aren't required to design foreign function interfaces with ForeignPtrs. One can always provide APIs that require the user to explicitly free structures, or manage them in a block scope -- but here using finalizers seems like a good way to go. Now, unfortunately in the above routines we don't actually see the finalizer get associated with the instances. That's because all these structs are managed by the archive structure, and its finalizer will automatically destroy the archive_entrys. So we'll come back to this finalizer thing later.

Another thing: newtype is a cool way to define a type that incurs no runtime overhead. Unlike typedef in C that defines type synonyms, newtype defines a type that's logically distinct from the thing it wraps, so you can't accidentally use a ForeignPtr ARCHIVE_ENTRY when an ArchiveEntry is needed. The ArchiveEntry type constructor that wraps these values is only there at compile time, so there's no runtime overhead for using a newtype.

Finally, you'll see that each of these foreign functions definitions has a low-level C function declaration -- the foreign import statements with associated type information, and the higher-level user entry points that deal with unwrapping the ForeignPtrs and converting from C types like CInt to more convenient Haskell types like Int. Let's look at the first foreign import statement, archive_entry_pathname. It uses the unsafe declaration to say that this function is "unsafe" in the sense that it can't call back into the Haskell runtime. If it did, the Haskell runtime would become seriously confused. However, be aware that even if your native function doesn't call back into Haskell, you might want to declare a function safe if you want to allow Haskell's lightweight threads to preempt it. All unsafe functions run to completion, and hog the entire process while they're running.

All of these foreign import declarations also state that the native functions are in the IO monad (the final result such as CString is wrapped with IO). This might seem odd for a function like archive_entry_size which is purely functional in nature. The reason I'm doing this is due to the way libarchive works. Since it streams through the archive file's contents, if you call archive_entry_size after you move on to the next archive header, it returns 0. So it's really critical that when you use it in the IO monad it really runs right away, and isn't run lazily when the value is actually needed. If you allow it to run lazily, it runs too late and comes up empty handed. I found this out the hard way.

This brings us to an interesting point about the IO monad. Like all monads, the IO monad is about sequencing operations together, so that they happen in the right order. However, unlike most monads, the IO monad is strict (i.e. not lazy), which means that the operations happen as soon as you call the functions, rather than when the monad's result value is really needed. We'll come back to this when we talk about lazy IO later.

Moving on to the higher-level functions. First, I've named this by the standard Haskell camel-case naming conventions to distinguish them from the low-level routines. You'll notice that my module declaration doesn't declare which functions are exported (so everything will be exported by default), but the intention is to simply export the high-level functions and types, and hide the lower-level (and possibly dangerous) native methods. Each high-level function like archiveEntryPathname first unwraps the ArchiveEntry type by calling the unArchiveEntry record accessor to get at the ForeignPtr, and then uses withForeignPtr to get at the low-level Ptr ARCHIVE_ENTRY. This is then passed to the native method along with other arguments (translated into their C types), and the results are similarly translated back into Haskell types. For instance, unsafePackCString creates a ByteString from a null-terminated CString, and fromIntegral is the numeric workhorse that converts all sorts of numbers to whatever's needed.

Now one more thing we'll need when we call archiveEntryFiletype is a set of constants for the various modes it returns (things like AE_IFREG in the "archive.h" header). We could have defined these with data constructors, but since I'm lazy I'll just define a set of octal constants here. Note that they have to begin with a lower-case letter since they're values not types:

ae_ifmt :: Int
ae_ifmt = 0o170000

ae_ifreg :: Int
ae_ifreg = 0o100000

ae_iflnk :: Int
ae_iflnk = 0o120000

ae_ifsock :: Int
ae_ifsock = 0o140000

ae_ifchr :: Int
ae_ifchr = 0o020000

ae_ifblk :: Int
ae_ifblk = 0o060000

ae_ifdir :: Int
ae_ifdir = 0o040000

ae_ififo :: Int
ae_ififo = 0o010000

Skipping right along...

Before we define foreign functions for archive operations, let's first define some routines for dealing with these function's return types. Instead of testing an integer result, let's use a data type to define the possible result codes, and some helper functions for calling these functions and converting their int results to this type:

data Result = OK | EOF | RETRY | WARN | FAILED | FATAL
            deriving (Eq, Show)

result :: CInt -> Result
result 0 = OK
result 1 = EOF
result (-10) = RETRY
result (-20) = WARN
result (-25) = FAILED
result (-30) = FATAL
result _ = FATAL  -- unexpected result

withResult :: (Ptr ARCHIVE -> IO CInt) -> Archive -> IO Result
withResult f ar =
  withForeignPtr (unArchive ar) $ \a -> do
    rv <- f a
    return $ result rv

expectOK :: IO Result -> IO ()
expectOK m = m >>= go
  where go OK = return ()
        go err = ioError $ userError (show err)

The function expectOK is something that wraps an IO operation returning a Result into one that throws an exception instead. This is just for the cases where we'd like to streamline the common case, and not deal with error handling every step of the way. I'm not sure that this function makes any sense in a production-quality version of this library, but we'll use it here when it's convenient. Also note that I originally wrote the last line as go err = error (show err) (err is the Result value, and error is a function that aborts the computation by raising an error). However, the exceptions raise by error can't be caught like those by ioError, so I switched.

Finally, here are the rest of the FFI definitions we'll need. They aren't all that interesting after you've groked the above, so I'm not going to say any more about them:

data ARCHIVE     -- native type
newtype Archive = Archive { unArchive :: ForeignPtr ARCHIVE }
                deriving (Show, Eq)

foreign import ccall unsafe "archive.h archive_read_new"
  archive_read_new :: IO (Ptr ARCHIVE)

foreign import ccall unsafe "archive.h &archive_read_finish"
  p_archive_read_finish :: FunPtr (Ptr ARCHIVE -> IO CInt)

archiveReadNew :: IO Archive
archiveReadNew = do
  ar <- archive_read_new
  fp <- newForeignPtr (castFunPtr p_archive_read_finish) ar
  return $ Archive fp

foreign import ccall unsafe "archive.h archive_read_support_compression_all"
  archive_read_support_compression_all :: Ptr ARCHIVE -> IO CInt

archiveReadSupportCompressionAll :: Archive -> IO Result
archiveReadSupportCompressionAll ar =
  withResult archive_read_support_compression_all ar

foreign import ccall unsafe "archive.h archive_read_support_format_all"
  archive_read_support_format_all :: Ptr ARCHIVE -> IO CInt

archiveReadSupportFormatAll :: Archive -> IO Result
archiveReadSupportFormatAll ar =
  withResult archive_read_support_format_all ar

foreign import ccall unsafe "archive.h archive_read_open_file"
  archive_read_open_file :: Ptr ARCHIVE -> CString -> CSize -> IO CInt

archiveReadOpenFile :: Archive -> String -> Int -> IO Result
archiveReadOpenFile ar filename blocksize =
  withCAString filename $ \c_filename -> do
  let sz = fromIntegral blocksize
  withResult (\a -> archive_read_open_file a c_filename sz) ar

foreign import ccall unsafe "archive.h archive_read_next_header"
  archive_read_next_header :: Ptr ARCHIVE -> Ptr (Ptr ARCHIVE_ENTRY) -> IO CInt

archiveReadNextHeader :: Archive -> IO (Either Result ArchiveEntry)
archiveReadNextHeader ar =
  alloca $ \ptr -> do
    rv <- withResult (\a -> archive_read_next_header a ptr) ar
    case rv of
      OK -> do entry <- peek ptr
               fp <- newForeignPtr_ entry
               return $ Right $ ArchiveEntry fp
      err -> return $ Left err

foreign import ccall unsafe "archive.h archive_read_close"
  archive_read_close :: Ptr ARCHIVE -> IO CInt

archiveReadClose :: Archive -> IO Result
archiveReadClose ar = withResult archive_read_close ar

foreign import ccall unsafe "archive.h archive_read_finish"
  archive_read_finish :: Ptr ARCHIVE -> IO CInt

archiveReadFinish :: Archive -> IO Result
archiveReadFinish ar = withResult archive_read_finish ar

foreign import ccall unsafe "archive.h archive_read_data"
  archive_read_data :: Ptr ARCHIVE -> Ptr CString -> CSize -> IO CSize

archiveReadData :: Archive -> ByteString -> IO Int
archiveReadData ar bs =
  withForeignPtr (unArchive ar) $ \a ->
    unsafeUseAsCStringLen bs $ \(buf, len) -> do
    rv <- archive_read_data a (castPtr buf) (fromIntegral len)
    return $ fromIntegral rv

Ok, I will say one thing. In archiveReadNextHeader, you can see that the native archive_entry is wrapped as a ForeignPtr using newForeignPtr_. Unlike newForeignPtr, newForeignPtr_ doesn't take a finalizer. This is because the archive_entry struct will be destroyed when we destroy the archive structure, not when the Haskell ArchiveEntry object goes out of scope. However, in archiveReadNew you can see where we do associate the archive_read_finish native function as a destructor for the Haskell Archive object. (Note how we have to use FunPtr when we want a native pointer to a function. Also, the castPtr is used because we're ignoring the result of archive_read_finish. Haskell expects finalizers to return IO ().)

Next, in part 2 we'll get down to translating higher-level Racket routines like archive-fold-files. Don't doze off yet...

March 2, 2012

Restringing a Racket with Haskell (pt 5)

In the last section we finished up an initial version of our microdata extractor -- one that actually works this time. Admittedly, the code was starting to get a little ugly with all the state propagation going around. But don't worry -- with a simple monad we can simplify things and make them pretty once again.

Oh! Did I say monad? We aren't going to write yet another monad tutorial here, I hope! There are far too many of them in the world already. Instead I'm just going to port the code from part 3 to use the State monad to show that it's not that scary, and actually a little bit helpful.

First, we must import Control.Monad.State. If you go look at the package description with the link above, you'll see that it really uses Control.Monad.State.Lazy by default (there's a strict variant, but we won't go into it here).

Next, we need to figure out where to use this crazy thing. In the last part, we talked about propagating the scope stack -- the state -- through various operations. This was always apparent because their signatures ended with either Itemscope -> Itemscope or State -> State. Seeing something like this is a tell-tale sign that the State monad might be useful.

The two main operations that State provides are get and put. In many ways you can think of them as having a little mutable memory location where you can squirrel away things you need to update -- rather like box in Racket. However, unlike box, State follows the control flow of your program rather than just floating around in memory somewhere. (Haskell provide actual mutable Ref cells that do "float around in memory", but let's not go into that here.) State is cooler than some of the alternatives because it really is functional even though it feels imperative. Awesomeness abounds.

To use State with our microdata extractor program, we simply need to define a state monad that our program's operations will be "in". By "in," I mean that the final type in these operation's signature will be this state monad, or in other words, they'll return a monad value as their result. What's a monad value? It's just a function. In our case, it's just a function that takes a scope stack and returns a scope stack (possibly a different one than the one you gave).

Monads are a truly clever trick, and one that's well worth getting your head around in general. At a fundamental level, they're all about sequencing operations together (and Haskell provides build-in syntax for making this sequencing easier -- the do statement). But here, our interest is just a very specific kind of monad that deals with state.

So let's define a state monad for microdata extraction called ExtractState:

type ExtractState a = State (Itemscope, [Itemscope]) a

The type keyword defines a type synonym (like typedef in C++), and this type is parametrized by the variable a, which will be the result type of a state operation. ExtractState a is equivalent to the State type where the type of the actual state is fixed, and is the type of our state from the last section, (Itemscope, [Itemscope]). That's it.

Now the magic of defining this state type is that when we're in the ExtractState monad, calling get will get us our state -- seemingly out of the ether. And calling put will allow us to store a new scope stack into the background radiation. Really what's happening is that there's one more level of functional argument passing just below the monad surface that is shuttling our state around. Best not to think about it too deeply. Let's just use it.

Let's begin by defining a few helper functions to access the various parts of our state. We can define these in terms of get and put that State defines:

getScope :: ExtractState Itemscope
getScope = do (scope, _) <- get; return scope

putScope :: Itemscope -> ExtractState ()
putScope scope = do (_, orphans) <- get; put (scope, orphans)

You can see in put that the old scope is dropped on the floor, and the new scope is stored in its place. If we want to get fancier, we can rewrite putScope in terms of modify which passes the old state to a function which should return a new state to put:

putScope :: Itemscope -> ExtractState ()
putScope scope = modify $ \(_, orphans) -> (scope, orphans)

We can also write a state accessor to save an otherwise orphaned Itemscope onto the list of orphans:

saveOrphan :: Itemscope -> ExtractState ()
saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans)

With these, we're now ready to rewrite our code to use the ExtractState monad. Let's begin with extractMicrodata which used to look like this:

extractMicrodata :: Node -> State -> State
extractMicrodata elt@(Element _ _ children) state =
  foldr extractItemprops state children
extractMicrodata _ state = state

We can shift it into ExtractState gear by rewriting it like this (deferring the body of the first function cases for a moment):

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) = undefined
extractMicrodata _ = return ()

From this, you can see that the signature changed from Node -> State -> State to Node -> ExtractState (). ExtractState hides the details of the state transitions into the monad operations that are used to implement this type -- functions that take a state and return a state. At the same time, the scopes parameters went away because we can access it with our monad accessor functions when it's needed. The () (called "unit") type parameter is like void in C++. It simply means that there's no interesting value returned from this function (although the real "value" here is the monad itself). We've used the standard monad return function to return a unit value in the second function case. That's because we don't modify the state, and don't really have a return value otherwise.

Now filling in the details:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  foldr extractItemprops children
extractMicrodata _ = return ()

we'll get into a bit of trouble:

/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3:
    Couldn't match expected type `StateT
                                    (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()'
                with actual type `[a0] -> b0'
    Expected type: ExtractState ()
      Actual type: [a0] -> b0
    In the return type of a call of `foldr'
    In the expression: foldr extractItemprops children
Failed, modules loaded: none.

We can see that the expected type was ExtractState (), whereas the actual type was a function from lists to something: [a0] -> b0. We didn't even get the arity right here since foldr is expecting another argument (the initial accumulator which was our scope stack). Using map instead isn't much better:

/Users/warrenharris/projects/racket-hs/microdata5.hs:64:3:
    Couldn't match expected type `StateT
                                    (Itemscope, [Itemscope]) Data.Functor.Identity.Identity ()'
                with actual type `[b0]'
    Expected type: ExtractState ()
      Actual type: [b0]
    In the return type of a call of `map'
    In the expression: map extractItemprops children
Failed, modules loaded: none.

Now we're returning a list instead of a monad. What we need a monadic version of map instead. Fortunately, there is just such a thing called mapM:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  do _ <- mapM extractItemprops children
     return ()
extractMicrodata _ = return ()

Since mapM returns a list of results, we're discarding it with the wildcard bind, and returning unit instead. We could also write this using Haskell's lower-level monad operation >> which throws away the result of the left-hand side:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  mapM extractItemprops children >> return ()
extractMicrodata _ = return ()

But this is still unsatisfying because we're constructing a list that we're just going to throw away. What we need is a version of mapM that doesn't build lists at all, but simply iterates over the input list, sequencing the monadic operations accordingly and throwing away whatever they return. Fortunately there is just such an operation, called mapM_ (the underscore designates "returns nothing"). And since what we're doing looks more like iteration than anything, there's a variant of mapM_ called forM_ that reverses its argument order so that it looks more like a foreach construct:

*Main> :t mapM
mapM :: Monad m => (a -> m b) -> [a] -> m [b]
*Main> :t mapM_
mapM_ :: Monad m => (a -> m b) -> [a] -> m ()
*Main> :t forM_
forM_ :: Monad m => [a] -> (a -> m b) -> m ()

Let's use forM_ to simplify once again:

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children $ (\child ->
    extractItemprops child)
extractMicrodata _ = return ()

Hey, that looks a lot like imperative code! For each child in children, run the operation extractItemprops passing it the child. (The backslash here is a lambda, and in fact, since the precedence of lambda expressions is higher than $ we don't need the parens around the lambda expression either.) But since child is the last parameter passed to extractItemprops, we can simplify this just a little further (because \x -> f x is just f):

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children extractItemprops
extractMicrodata _ = return ()

Let's move on to extractNewItemscope. Here was our previous definition:

extractNewItemscope :: Node -> State -> (Itemscope, State)
extractNewItemscope elt@(Element _ attrs _) (scope, orphans)
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
        (scopeVal, orphans') = extractMicrodata elt (newScope, orphans)
    in (scopeVal, (scope, orphans'))
extractNewItemscope node state =
  let state'@(scopeVal, _) = extractMicrodata node state
  in (scopeVal, state')  -- return current scope as scopeVal

This time we don't have a simple State -> State at the end of the function signature. Instead the final return type is returning a value and a new state, State -> (Itemscope, State). Fortunately, we can describe this in the State monad as ExtractState Itemscope:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs = undefined
extractNewItemscope node = undefined

Filling in the details now:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs = do
    prevScope <- getScope
    let newScope = Itemscope (lookup "itemtype" attrs) []
    putScope newScope
    extractMicrodata elt
    scopeVal <- getScope
    putScope prevScope
    return scopeVal
extractNewItemscope node = do
  extractMicrodata node
  scopeVal <- getScope
  return scopeVal  -- return current scope as scopeVal

From this you can see that we use the getScope operation to get the current scope. The back-arrow operator (<-) lets us bind a pattern to the result of a monad operation, and here we're binding the variable prevScope. We create a newScope using a let statement (note that a let in a do sequence isn't followed by the in keyword). Next we use putScope to store the new scope, then call extractMicrodata, then use getScope again to get the modified verion of newScope (now called scopeVal). Then we use putScope again to put the previous scope back, and finally return scopeVal.

Now don't confuse let foo = ... and do foo <- .... They both more or less do something similar -- assign a variable -- but the latter works with monad operations. If we had used a let with a monad operation instead, we'd be getting our hands on the monad itself, not the return value that's in the monad. Using let with monads can be quite useful when manipulating control flow, or plugging pieces of programs together -- just not what we're interested in doing here. Monads are really powerful because they decouple the work of screwing the plumbing together from running the crap through the pipes (if you catch my drift).

Of course the code above is kind of ugly. Most of it is just manipulating previous and current scope. We can wrap this up in a nice little helper function that lets us lexically set the current scope:

withScope :: Itemscope -> ExtractState a -> ExtractState a
withScope scope m = do
  prevScope <- getScope
  putScope scope
  rv <- m
  putScope prevScope
  return rv

Here, withScope takes a scope and another monad operation, m. It sets up the new scope, invokes m, reverts the scope, and returns m's value. (If you were worried about m raising exceptions, you'd want to put an exception handler in here to revert to the previous scope value, but we don't raise any exceptions in our code.)

We can now simplify extractNewItemscope:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $ do
      extractMicrodata elt
      scopeVal <- getScope
      return scopeVal
extractNewItemscope node = do
  extractMicrodata node
  scopeVal <- getScope  -- return current scope as scopeVal
  return scopeVal

or using the >> operation to further simplify:

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $
      extractMicrodata elt >> getScope
extractNewItemscope node =
  extractMicrodata node >> getScope

Moving on to the extractItemprops function:

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name -> do
      scope <- getScope
      putScope $ addProp name (extractHtml5Value elt scopeVal) scope
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

There shouldn't be any surprises at this point. We're just using getScope to get the current scope, modifying it with addProp, and using putScope to store the new value back. This could be simpler too if we had a version of modify that only operated on the scope part of the state:

modifyScope :: (Itemscope -> Itemscope) -> ExtractState ()
modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans)

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name ->
      modifyScope $ addProp name (extractHtml5Value elt scopeVal)
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

Monads running free

So I've shown you how to rewrite some simple state propagating code to look cleaner and neater using the State monad, but how to you set it in motion -- supplying the initial state so that the monad can actually run? Well, every monad has some kind of run function that does this, and for State it's called (appropriately), runState. We can use this to rewrite extractMicrodataFromDoc:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let f = mapM_ extractItemprops $ docContent doc
      initialState = (Itemscope (Just "top") [], [])
      ((), (scope, orphans)) = runState f initialState
  in scope:orphans

What runState does is it takes an initial state (the empty scope stack in this case, []) and returns a pair of the final answer (unit here, ()) and the final state (scopes). Since we want extractMicrodataFromDoc to return the scope stack as its answer we just discard the unit result and return scopes instead.

Note that extractMicrodataFromDoc is not itself a monadic operation (since it's return type isn't ExtractState). That means we can call it just like before in our test case:

test1 = "
\ \ Foobar\ \
\ \

quux

\ \
\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

We're done

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad (forM)
import Control.Monad.State
import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

type ExtractState a = State (Itemscope, [Itemscope]) a

getScope :: ExtractState Itemscope
getScope = do (scope, _) <- get; return scope

putScope :: Itemscope -> ExtractState ()
putScope scope = modify $ \(_, orphans) -> (scope, orphans)

saveOrphan :: Itemscope -> ExtractState ()
saveOrphan r = modify $ \(scope, orphans) -> (scope, r:orphans)

withScope :: Itemscope -> ExtractState a -> ExtractState a
withScope scope m = do
  prevScope <- getScope
  putScope scope
  rv <- m
  putScope prevScope
  return rv

modifyScope :: (Itemscope -> Itemscope) -> ExtractState ()
modifyScope f = modify $ \(scope, orphans) -> (f scope, orphans)

extractMicrodata :: Node -> ExtractState ()
extractMicrodata elt@(Element _ _ children) =
  forM_ children extractItemprops
extractMicrodata _ = return ()

extractNewItemscope :: Node -> ExtractState Itemscope
extractNewItemscope elt@(Element _ attrs _)
  | isJust $ lookup "itemscope" attrs =
    withScope (Itemscope (lookup "itemtype" attrs) []) $
      extractMicrodata elt >> getScope
extractNewItemscope node =
  extractMicrodata node >> getScope

extractItemprops :: Node -> ExtractState ()
extractItemprops elt@(Element _ attrs _) = do
  scopeVal <- extractNewItemscope elt
  case lookup "itemprop" attrs of
    Just name ->
      modifyScope $ addProp name (extractHtml5Value elt scopeVal)
    Nothing -> saveOrphan scopeVal
extractItemprops _ = return ()

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let f = mapM_ extractItemprops $ docContent doc
      initialState = (Itemscope (Just "top") [], [])
      ((), (scope, orphans)) = runState f initialState
  in scope:orphans

-- Debugging stuff:

html str = case parseHTML "" str of
  Left msg -> error msg
  Right rslt -> head $ docContent rslt

topscope = Itemscope (Just "top") []

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1
I hope you got something out of this long-winded tour. Programming in Haskell can be pretty nice once you get over a few hurdles.

February 25, 2012

Restringing a Racket with Haskell (pt 4)

Houston, we have a problem. If we evaluate result from part 3 and look at it closely, we'll see it:

type ExtractState a = State [Itemscope] a
*Main> result1
Right [Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"},Itemprop {itempropName = "item", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]},Itemscope {itemscopeType = Just "type", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"}]}]

Uh, let me format that for you:

Right [
  Itemscope {itemscopeType = Nothing,
             itemscopeProperties = [
               Itemprop {itempropName = "name",
                         itempropValue = TextValue "quux"},
               Itemprop {itempropName = "item",
                         itempropValue = ItemscopeValue (
                           Itemscope {itemscopeType = Nothing,
                                      itemscopeProperties = []})}]},
  Itemscope {itemscopeType = Just "type",
             itemscopeProperties = [
               Itemprop {itempropName = "name",
                         itempropValue = TextValue "Foobar"}]}]

and pull up our original HTML for reference:

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
"

Here we can see that the Itemprops aren't in the right places at all. The "name"="quux" one should be inside the empty Itemscope that's the value of "item", and that Itemprop should be a sibling of the "name"="Foobar" one that's below. What's going wrong?

Well, after thinking a bit more about what the algorithm here needs to do, it's clear that we need to traverse one tree (the HTML), and build a new one (the Itemscopes). But although we used foldr to push items onto a stack as we walked down the HTML tree, we didn't pop any items from that stack as we unwound the HTML tree. So we were always left with last pushed item on the top of the stack when we updated the Itemprop values. This is completely broken.

Just to be a little more clear, there are 3 uses of foldr in our code: in extractItemprops, extractMicrodata and in extractMicrodataFromDoc. The one in extractItemprops is fine because it is simply processing a list of attributes that all belong to the same HTML element. Similarly, the one in extractMicrodataFromDoc is fine because it is processing the content of the document which is a list of HTML nodes all at the top-level of the document. The problem is in extractMicrodata which not only processes the immediate children of an element, but calls itself to process each of their children.

It isn't obvious how we can modify extractMicrodata to do this pushing and popping since it isn't until we get deep into its implementation do we know that we're looking at an element with an "itemscope" attribute. A better approach here would be to recurse down the xmlhtml tree, calling our extraction operations in a mutually recursive way, and returning a correctly constructed Itemscope from each call.

By mutually recursively, we mean calling extractMicrodata, extractItemprops, extractNewItemscope from each other as necessary -- so the 3 of these form a recursive cycle. But which one should be called first? To determine that, we need to consider the html structure. Some nodes like the div node in the example above have both itemprop and itemscope attributes. When this occurs we want the itemprop ("item" in this case) to have a value that contains the itemscope and all its sub-properties ("name"="quux" here). This means that extractItemprops should be called first (from extractMicrodataFromDoc), and it should in turn call extractNewItemscope to build a new Itemscope value when the itemscope attribute is present. Finally extractNewItemscope should call extractMicrodata to recursively process the html children, and this can recurse back to extractItemprops once again.

Unlike before when each of these functions would take a stack (list) of Itemscopes, now it's only necessary for them to take the current Itemscope object, and return a (possibly new) Itemscope on the way back out. This is because our call stack will replace the explicit stack (that we never popped) from before.

Let's start with extractMicrodata, since it's at the bottom of this call stack. Rather than calling all 3 extraction functions on each child, it now only calls extractItemprops:

extractMicrodata :: Node -> Itemscope -> Itemscope
extractMicrodata elt@(Element _ _ children) scope =
  foldr extractItemprops scope children
extractMicrodata _ scope = scope

Pretty straightforward. Next, extractMicrodata is called from extractNewItemscope. In the case where we have an element with an itemscope attribute, we'll construct a new Itemscope and pass this to extractMicrodata. Otherwise, we'll continue working with the one we were given:

extractNewItemscope :: Node -> Itemscope -> Itemscope
extractNewItemscope elt@(Element _ attrs _) scope
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
    in extractMicrodata elt newScope
extractNewItemscope node scope = extractMicrodata node scope

Next we can rework extractItemprops (and it's associated routine, extractItemprop) to call extractNewItemscope:

extractItemprop :: Node -> (Text, Text) -> Itemscope -> Itemscope
extractItemprop elt ("itemprop", name) scope =
  addProp name (extractHtml5Value elt scope) scope
extractItemprop _ _ scope = scope

extractItemprops :: Node -> Itemscope -> Itemscope
extractItemprops elt@(Element _ attrs _) scope =
  let scope' = extractNewItemscope elt scope
  in foldr (extractItemprop elt) scope' attrs
extractItemprops _ scope = scope

Now let's test them out on some simple html. To make doing this easier in ghci, let's first define a helper function to parse the html and get rid of the Either node (by throwing an exception in the Left (wrong) case), and a dummy Itemscope to use at the top-level:

html str = case parseHTML "" str of
  Left msg -> error msg
  Right rslt -> head $ docContent rslt

topscope = Itemscope (Just "top") []

Let's try our new functions out. Be sure to first enable OverloadedStrings in ghci so that we can have the literal html string converted to the correct type for our helper function:

*Main> :set -XOverloadedStrings
*Main> extractNewItemscope (html "

quux

") topscope Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]} *Main> extractNewItemscope (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]}

This seem to be working correctly. When we call extractNewItemscope with a node with itemscope, we built a new Itemscope object and store the "name"="quux" property there. And when the node doesn't have itemscope, we store "name"="quux" in the current Itemscope that's passed in (topscope in this case).

But wait a minute... In the first case above, we ended up dropping the passed-in scope on the floor. You can see this because the scope parameter is never used in extractNewItemscope's first function case. This isn't right, because that scope might already contain someone's "good stuff" that they're expecting to see in the final result. What we really need to do in extractNewItemscope is return both the previous scope (our state) and the newly constructed Itemscope object (our result value) if we do indeed construct one. Likewise, in the case where we don't construct one, then the updated scope will be both our state, and our Itemscope result value:

extractNewItemscope :: Node -> Itemscope -> (Itemscope, Itemscope)
extractNewItemscope elt@(Element _ attrs _) scope
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
    in (extractMicrodata elt newScope, scope)
extractNewItemscope node scope =
  let scope' = extractMicrodata node scope
  in (scope', scope')

Since we've changed the type of extractNewItemscope, this requires a similar change in extractItemprops. Now we need to receive both the result value and the updated state from extractNewItemscope and use the new Itemscope value whenever the itemprop keyword is seen:

extractItemprop :: Node -> Itemscope -> (Text, Text) -> Itemscope -> Itemscope
extractItemprop elt is ("itemprop", name) scope =
  addProp name (extractHtml5Value elt is) scope
extractItemprop _ _ _ scope = scope

extractItemprops :: Node -> Itemscope -> Itemscope
extractItemprops elt@(Element _ attrs _) scope =
  let (is, scope') = extractNewItemscope elt scope
  in foldr (extractItemprop elt is) scope' attrs
extractItemprops _ scope = scope

Let's see if this works:

*Main> extractItemprops (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]} *Main> extractItemprops (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = []}

Hmmm... another problem. In the first case where there is no itemscope attribute things work properly. The topscope is updated to contain the "name"="quux" property as we expect. But in the second case when itemscope is present, there's another problem. We're only getting an empty topscope here. The reason for this is because there is no itemprop that connects the topscope to the inner Itemscope object. Let's add one and verify that this is indeed the problem:

*Main> extractItemprops (html "

quux

") topscope Itemscope {itemscopeType = Just "top", itemscopeProperties = [Itemprop {itempropName = "connect", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "quux"}]})}]}

This is what we're expecting to see -- the topscope now contains the "connect" property that connects it to the previously orphaned Itemscope containing "name"="quux". So this means that although extractItemprops appears to be working ok for the case where everything is connected, we have an oversight for the case when we have isolated Itemscopes that aren't connected. We need to return these as results too somehow.

The easiest way to do this might be to find the point in the code were we're about to drop an orphaned Itemscope, and store it in a list somewhere so that we can get it when the algorithm completes. This new list becomes another piece of state that we need to pass around.

In order to keep things tidy, let's introduce a State type and use this in all of our stateful functions. That will be a little clearer than making the result be a tuple. Here's our state which includes the current Itemscope that the algorithm is working on, which we had before, and the new list of orphaned Itemscopes:

type State = (Itemscope, [Itemscope])

Now we can rewrite our stateful code as follows:

extractMicrodata :: Node -> State -> State
extractMicrodata elt@(Element _ _ children) state =
  foldr extractItemprops state children
extractMicrodata _ state = state

extractNewItemscope :: Node -> State -> (Itemscope, State)
extractNewItemscope elt@(Element _ attrs _) (scope, orphans)
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
        (scopeVal, orphans') = extractMicrodata elt (newScope, orphans)
    in (scopeVal, (scope, orphans'))
extractNewItemscope node state =
  let state'@(scopeVal, _) = extractMicrodata node state
  in (scopeVal, state')  -- return current scope as scopeVal

extractItemprop :: Node -> Itemscope -> (Text, Text) -> State -> State
extractItemprop elt is ("itemprop", name) (scope, orphans) =
  (addProp name (extractHtml5Value elt is) scope, orphans)
extractItemprop _ _ _ state = state

extractItemprops :: Node -> State -> State
extractItemprops elt@(Element _ attrs _) state =
  let (is, state') = extractNewItemscope elt state
  in foldr (extractItemprop elt is) state' attrs
extractItemprops _ scope = scope

The problem now is that although we're passing our new result state around, we are never storing anything into it. As the algorithm is implemented, it isn't clear where this should happen... so I'll give you a hint. It happens in the case where the foldr in extractItemprops doesn't encounter an itemprop attribute. I guess foldr isn't what we want anymore then.

Now we could do another pass over the attributes to determine whether itemprop doesn't occur, or we could pass a boolean through the foldr accumulator and update it if anything is found. But let's make it easy on ourselves and say that the itemprop attribute can only occur once in an html node. That allows us to rewrite extractItemprops to use find instead of foldr to look for the itemprop attribute:

extractItemprops :: Node -> State -> State
extractItemprops elt@(Element _ attrs _) state =
  let (is, (scope, orphans)) = extractNewItemscope elt state
  in case lookup "itemprop" attrs of
    Just name -> (addProp name (extractHtml5Value elt is) scope, orphans)
    Nothing -> (scope, is:orphans)
extractItemprops _ state = state

Now in the case where we find something, we add a new property as before. But in the case were we don't, we throw the Itemscope object on the result list instead.

Let's wrap this up by reworking the last function we need, extractMicrodataFromDoc:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let initialState = (Itemscope (Just "top") [], [])
      (scope, orphans) = foldr extractItemprops initialState $ docContent doc
  in scope:orphans

And finally, our debugged, working code:

{-# LANGUAGE OverloadedStrings #-}

import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

type State = (Itemscope, [Itemscope])

extractMicrodata :: Node -> State -> State
extractMicrodata elt@(Element _ _ children) state =
  foldr extractItemprops state children
extractMicrodata _ state = state

extractNewItemscope :: Node -> State -> (Itemscope, State)
extractNewItemscope elt@(Element _ attrs _) (scope, orphans)
  | isJust $ lookup "itemscope" attrs =
    let newScope = Itemscope (lookup "itemtype" attrs) []
        (scopeVal, orphans') = extractMicrodata elt (newScope, orphans)
    in (scopeVal, (scope, orphans'))
extractNewItemscope node state =
  let state'@(scopeVal, _) = extractMicrodata node state
  in (scopeVal, state')  -- return current scope as scopeVal

extractItemprops :: Node -> State -> State
extractItemprops elt@(Element _ attrs _) state =
  let (is, (scope, orphans)) = extractNewItemscope elt state
  in case lookup "itemprop" attrs of
    Just name -> (addProp name (extractHtml5Value elt is) scope, orphans)
    Nothing -> (scope, is:orphans)
extractItemprops _ state = state

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc =
  let initialState = (Itemscope (Just "top") [], [])
      (scope, orphans) = foldr extractItemprops initialState $ docContent doc
  in scope:orphans

-- Debugging stuff:

html str = case parseHTML "" str of
  Left msg -> error msg
  Right rslt -> head $ docContent rslt

topscope = Itemscope (Just "top") []

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

Now are we done? There's always one more thing! In part 5 we'll see how to use the "m" word to simplify this further.

Restringing a Racket with Haskell (pt 3)

Now we tackle the tricky part. extract-microdata uses the dreaded hobgoblin of functional programming: mutable state! Annoyingly convenient little boxes with updatable values, and mutable struct fields. One problem with all this mutable stuff is that it allows the control flow to follow a different path from the state changes. To turn this into functional Haskell code, we'll need to do a bit of untangling so that state can travel along with the control flow, and be replace with new state whenever a change is needed.

Here's the Racket function again, for reference:

(define (extract-microdata root [scope (void)] [result (box '())])
  (let ([new-scope (void)])
    (when (h:html-full? root)
      (for ([element (h:html-full-content root)])
           (match element
             [(h:html-element (list (x:attribute start stop key value) ...))
              (let ([attrs (for/list ([k key] [v value])
                                     (cons k v))])
                (when (assoc 'itemscope attrs)
                  (set! new-scope
                        (itemscope (cdr (or (assoc 'itemtype attrs)
                                            (cons "" #f))) '()))
                  (when (void? scope)
                    (set-box! result (cons new-scope (unbox result)))))

                (for ([prop attrs]
                      #:when (eq? (car prop) 'itemprop))
                     (set-itemscope-properties!
                      scope (cons (itemprop (cdr prop)
                                            (extract-html5-value element
                                              new-scope))
                                  (itemscope-properties scope)))))
              (extract-microdata element new-scope result)]
             [else #f]))))
  (unbox result))

This function iterates over all the content nodes of the document, and when they are elements it potentially creates a new itemscope structs and pushes them onto/into the result box. It also iterates over all the attributes of the elements looking for itemprop attributes. When it finds them it extracts their HTML5 value and pushes it into the properties field of the current itemscope object. Finally it recurses over each of the element's children. Perhaps we can break this down into several more manageable pieces.

First, let's focus on recognizing itemscope attributes and creating new Itemscope values. Let's call this piece extractNewItemscope. Since this function can potentially push a new Itemscope on a stack, it needs to take a list of Itemscope as input, and produce another one as output (linked lists in Haskell make perfect stacks):

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt scopes = undefined

Again, we need to make function cases based on whether the Node is an Element or something else, and we need to test for when the attributes of the element contain an "itemscope" key:

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes =
  if isJust $ lookup "itemscope" attrs
  then undefined
  else scopes
extractNewItemscope _ scopes = scopes

Since the first thing we're doing in the first function case is another test, we can use a guard clause to further refine it, letting the second case pick up the work the else clause was doing:

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    undefined
extractNewItemscope _ scopes = scopes

Now the body of this function only needs to build a new Itemscope value and push it onto the scopes stack. The type of this Itemscope is the value of the "itemtype" attribute (if any), and the properties start out empty. The colon (:) is the "cons" operator that pushes a new value onto the left-hand side of an existing list (scopes here):

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    Itemscope (lookup "itemtype" attrs) [] : scopes
extractNewItemscope _ scopes = scopes

Next, let's define a function to update the properties of an Itemscope when we see an "itemprop" attribute. To make this clearer, we'll first define a helper function to add a new property to an existing Itemscope: (since it's functional, it returns a new Itemscope):

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

Next we can use this to update the the top-most Itemscope on the stack when we see an "itemprop" attribute:

extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope]
extractItemprop elt ("itemprop", name) (scope:scopes) =
  addProp name (extractHtml5Value elt scope) scope : scopes
extractItemprop _ _ scopes = scopes

The first line of the let creates a new Itemprop value with the name and HTML5 value. The second line pushes this new Itemprop onto the list of existing properties. The third line then creates a new Itemscope value that will replace the one we're starting with. Notice how we pattern matched the entire scope stack with (scope:scopes) so that we can throw the old scope away and return the tail of the list with newScope pushed on the front.

Oops -- when we compile this we see that another tweak is needed. We previously had defined the value field of the Itemprop data type as Text, but in part 2 we discovered that values needed to be of type MicrodataValue. Let's fix that:

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

One nice thing about Haskell modules (the stuff within a single file) is that all the definitions are mutually recursive. That means that you can list functions in top-down dependency order, bottom-up, or however you want to organize things. The same goes for types, which is why we can change this field to MicrodataValue even though it's defined below in the file. (I spent several years programming in O'Caml, and it requires things to be defined in bottom-up dependency order. Believe me, this is better.)

Next, let's define a function that's similar to extractItemprop, but that works with all attributes of an element, not just one. We might be able to use map, but since we want each attribute to be able to update the top-most Itemscope on the stack we need to thread the stack (scopes) through all these map operations. That's exactly what a "fold" does -- it passes an accumulator parameter along and expects you to accumulate your result into the accumulator instead of just returning it like map does. map is a fold where the accumulation is done for you -- the function results are accumulated into a list.

We'll use foldr to do this. foldr folds from right-to-left. This might seem backwards, but its usually the right thing in Haskell. The reason it's often the right thing is because lists are built by pushing newer elements on the left, so if you want to preserve the order of your results with respect to the original list, you need to start from the right. Also, if you're relying on Haskell's laziness to avoid computing values from the tail of the list that never used, then foldr is exactly what you want. (On the other hand, if you don't care about preserving the left-to-right order, then foldl' is way more efficient. See Foldr Foldl Foldl' if you want to understand this better.)

extractItemprops :: Node -> [Itemscope] -> [Itemscope]
extractItemprops elt@(Element _ attrs _) scopes =
  foldr (extractItemprop elt) scopes attrs
extractItemprops _ scopes = scopes

That's it for extractItemprops. The scopes stack is passed to each extractItemprop call, and a new scope stack is returned that will be passed to the next call, or returned as the final result. Note that rather than just calling extractItemprop directly we doing a bit of currying magic here, calling (extractItemprop elt) instead. This gives each call the same elt parameter, and the partial application returns a new function that fits the requirement for foldr.

Now we're at a place where we can put these pieces together to write a functional version of extract-microdata. Since it's functional we'll want to pass our state along at every step, and return new state as a result. And since it will want to operate on every child of an element, we'll want to use our frind foldr again. The only tricky part if figuring out what each step performed by foldr actually needs to do:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes = undefined
extractMicrodata _ scopes = scopes

Hopefully at this point, there are no surprises here. The extractMicrodata function looks a lot like extractItemprops that we've seen already. But why name the local function "go"? That's just a common Haskell idiom (in Zetalisp it was always "frob", so this is just another step forward).

The body of go just needs to thread our state (scopes) through each of the steps:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          let scopes' = extractNewItemscope child scopes
              scopes'' = extractItemprops child scopes'
          in extractMicrodata child scopes''
extractMicrodata _ scopes = scopes

First extractNewItemscope is called which potentially pushes a new Itemscope on the stack. We'll call its result scopes' (using a "prime" is just another idiom -- the single quote is a perfectly valid identifier character in Haskell). Then extractItemprops is called with scopes' and returns scopes''. Finally we recursively call extractMicrodata for the child so we can process its children, passing scopes''. This ultimately returns a new scope stack as our result.

Of course we're never happy with this much verbosity in Haskell, so we can write this a little more concisely once again:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          extractMicrodata child $
            extractItemprops child $
              extractNewItemscope child scopes
extractMicrodata _ scopes = scopes

How's that? Well, let me counter your question with how's this?:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child scopes =
          (extractMicrodata child .
           extractItemprops child .
           extractNewItemscope child) scopes
extractMicrodata _ scopes = scopes

We're using the function composition operator, dot (.), to compose the 3 functions that all get passed scopes. Note that the composition operator in Haskell works by calling the right-most function first, and passing the result to the left most -- the way mathematicians do it, but perhaps backwards from what we expect from most programming languages. It is defined as:

(.) :: (b -> c) -> (a -> b) -> a -> c
(f . g) x = f (g x)

One more tweak:

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child =
          extractMicrodata child .
          extractItemprops child .
          extractNewItemscope child
extractMicrodata _ scopes = scopes

Now we're dropping the application of the composed functions to scopes, and dropping scopes from the parameter list. Instead, we're just returning the composed function itself since it has the correct type. We're moving into what they call the "point-free style" here (sometimes referred to as "pointless style"). Maybe this is getting too carried away.

Wrap it up...

The only thing we need to do now is put a bow on it, and test it out.

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = case parseHTML "" test1 of Left msg -> Left msg Right tree -> Right $ extractMicrodataFromDoc tree

extractMicrodataFromDoc simply folds extractMicrodata over all the document's content. The variables test1 and result1 are things I've thrown in for testing purposes. You shouldn't leave them in production code, but if you evaluate result1 in the *haskell* buffer, you'll see our handiwork:

*Main> result1
Right [Itemscope {itemscopeType = Nothing, itemscopeProperties = [Itemprop {itempropName = "name", itempropValue = TextValue "Foobar"},Itemprop {itempropName = "name", itempropValue = TextValue "quux"},Itemprop {itempropName = "item", itempropValue = ItemscopeValue (Itemscope {itemscopeType = Nothing, itemscopeProperties = []})}]},Itemscope {itemscopeType = Just "type", itemscopeProperties = []}]

Of course, there are always improvements to be made. Note how in the definition of result1 we're reaching into the Either data structure and in the Right case calling another function on what's there. Effectively, we're doing an operation that might be generalized like this:

mapEither :: (a -> b) -> Either c a -> Either c b
mapEither f (Either l r) = Either l (f r)

Well, there's a very common generalization of this pattern called fmap. Instead of operating on an Either data type, it operates on anything that supports the Functor type class.

fmap :: Functor f => (a -> b) -> f a -> f b

The Functor type class is basically one that allows functions applied to data structures to be applied to that data structure's elements. We can use that to simplify the result1 definition:

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

fmap takes a bit to get your head around, but it's really powerful and useful. If you look at the definition of mapEither that I gave you can get some idea of how it works. In fact, given that definition, we could have defined the Functor instance for Either like this (if we didn't have a definition already):

instance Functor (Either a) where
  fmap = mapEither

Why (Either a) in the above? Great Good explains it better than I can.

Review

Where are we now?

{-# LANGUAGE OverloadedStrings #-}

import Data.Foldable (foldMap)
import Data.List (find)
import Data.Maybe (isJust)
import Data.Text (Text, concat)
import Prelude hiding (concat)
import Text.XmlHtml

data Itemprop = Itemprop {
  itempropName :: Text,
  itempropValue :: MicrodataValue
  } deriving (Show, Eq)

data Itemscope = Itemscope {
  itemscopeType :: Maybe Text,
  itemscopeProperties :: [Itemprop]
  } deriving (Show, Eq)

addProp :: Text -> MicrodataValue -> Itemscope -> Itemscope
addProp name value (Itemscope t props) =
  Itemscope t (Itemprop name value : props)

hasItemscope :: Node -> Bool
hasItemscope elt@(Element _ _ _) =
  isJust $ lookup "itemscope" $ elementAttrs elt
hasItemscope _ = False

data MicrodataValue = ItemscopeValue Itemscope
                    | TextValue Text
                    deriving (Show, Eq)

extractHtml5Value :: Node -> Itemscope -> MicrodataValue
extractHtml5Value elt scope | hasItemscope elt = ItemscopeValue scope
extractHtml5Value elt scope =
  TextValue $ foldMap getText $ elementChildren elt
  where getText elt@(TextNode text) = text
        getText _ = ""

extractNewItemscope :: Node -> [Itemscope] -> [Itemscope]
extractNewItemscope elt@(Element _ attrs _) scopes
  | isJust $ lookup "itemscope" attrs =
    Itemscope (lookup "itemtype" attrs) [] : scopes
extractNewItemscope _ scopes = scopes

extractItemprop :: Node -> (Text, Text) -> [Itemscope] -> [Itemscope]
extractItemprop elt ("itemprop", name) (scope:scopes) =
  addProp name (extractHtml5Value elt scope) scope : scopes
extractItemprop _ _ scopes = scopes

extractItemprops :: Node -> [Itemscope] -> [Itemscope]
extractItemprops elt@(Element _ attrs _) scopes =
  foldr (extractItemprop elt) scopes attrs
extractItemprops _ scopes = scopes

extractMicrodata :: Node -> [Itemscope] -> [Itemscope]
extractMicrodata elt@(Element _ _ children) scopes =
  foldr go scopes children
  where go child =
          extractMicrodata child .
          extractItemprops child .
          extractNewItemscope child
extractMicrodata _ scopes = scopes

extractMicrodataFromDoc :: Document -> [Itemscope]
extractMicrodataFromDoc doc = foldr extractMicrodata [] $ docContent doc

test1 = "
\ \
\ \

quux

\ \
\ \ Foobar\ \
" result1 = fmap extractMicrodataFromDoc $ parseHTML "" test1

So we're done? Well, if you're an astute reader, you've already realized that this code doesn't actually work as intended. Stay tuned for part 4 where we'll do a little debugging.